The built-in Ephesoft classification methods work really well in most situations, but there will always be documents that require a bit of additional smarts to classify or merge correctly. A common example is multi-page invoices that contain the same header and footer information (invoice number, date, PO number, etc) on every single page and the only difference between pages is the line-item data. To Ephesoft, all of the pages look like a first page so it will split the document into many single page invoices.
A way around this is to use Keyword Classificaiton with a rule to start a new document on a unique Invoice Number, but then you must make sure you have KV Page Process rules that accurately extract the invoice number on all of your documents. This can be a challenge to manage, especially when you also need Key Value Extraction rules to extract the invoice number therefore doubling your work for any new invoice number variants.
A better solution would be to write some code that is flexible enough to be used in any project, at any part of the workflow, with any of document data available. This would allow you to classify or merge documents on any number of criteria and quickly fix those problem documents that would be time consuming or impossible to handle with the built-in Ephesoft features. With the previous invoice example, we could simply have this code run after key value extraction and merge all consecutive invoices with the same invoice number and call it good.
Luckily, this code has already been written!
Features and Usage
First the code goes through all the documents in a batch and checks them against a set of criteria. The document is reclassified to the document type specified in the rule with the highest priority that meets all the criteria.
Next, all consecutive documents are compared against a set of merging rules. If the document pair meets all the criteria, then the documents are merged together.
Configurable via XML
The same configuration files can be re-used at multiple steps if needed
Configuration files can be re-used and tweaked for similar projects
Exposes handy batch/document data
Document Type
Page Level Field values
Document Level Field values
Email Headers
Original File Name
Drop Folder Path
OCR Data
Supports multiple criteria per rule
Rule prioritization
Operators
Same - consecutive documents are the same type or have the same value
Distance - a value is within a Levenshtein distance of another value
Equals/Not Equals - a value equals or doesn't equal another value
Matches/Not Matches - a value matches or doesn't match a regular expression
Contains/Not Contains - a value contains or doesn't contains another value
Starts/ends with - a value starts or ends with another value
Has/doesn't have a value
I typically put this in the Document Assembler, Extraction, and Export scripts in every project. If I don't end up needing the helpers they can remain turned off in the script configuration. The document assembler script handles the bulk of the classification and merging duties. The extraction script can then classify and merge based on extracted values. And finally the export script can double check that documents look good after an operator has touched them (can often share a configuration with the extraction script).
Code and configuration
The code and base configuration files can be found here:Â https://gist.github.com/jhotmann/3af8bc2bd1cf367d213f4b1704d0e782
When the script is first run it will also create a properties file in the script-config directory that can be used to turn on/off classification and merging as well as control what XML files the script is using.
Jordan Hotmann, Developer
Comments