Search

Improve Ephesoft Classification & Document Assembly

The built-in Ephesoft classification methods work really well in most situations, but there will always be documents that require a bit of additional smarts to classify or merge correctly. A common example is multi-page invoices that contain the same header and footer information (invoice number, date, PO number, etc) on every single page and the only difference between pages is the line-item data. To Ephesoft, all of the pages look like a first page so it will split the document into many single page invoices.


A way around this is to use Keyword Classificaiton with a rule to start a new document on a unique Invoice Number, but then you must make sure you have KV Page Process rules that accurately extract the invoice number on all of your documents. This can be a challenge to manage, especially when you also need Key Value Extraction rules to extract the invoice number therefore doubling your work for any new invoice number variants.


A better solution would be to write some code that is flexible enough to be used in any project, at any part of the workflow, with any of document data available. This would allow you to classify or merge documents on any number of criteria and quickly fix those problem documents that would be time consuming or impossible to handle with the built-in Ephesoft features. With the previous invoice example, we could simply have this code run after key value extraction and merge all consecutive invoices with the same invoice number and call it good.


Luckily, this code has already been written!


Features and Usage


First the code goes through all the documents in a batch and checks them against a set of criteria. The document is reclassified to the document type specified in the rule with the highest priority that meets all the criteria.

Next, all consecutive documents are compared against a set of merging rules. If the document pair meets all the criteria, then the documents are merged together.

  • Configurable via XML

  • The same configuration files can be re-used at multiple steps if needed

  • Configuration files can be re-used and tweaked for similar projects

  • Exposes handy batch/document data

  • Document Type

  • Page Level Field values

  • Document Level Field values

  • Email Headers

  • Original File Name

  • Drop Folder Path

  • OCR Data

  • Supports multiple criteria per rule

  • Rule prioritization

  • Operators

  • Same - consecutive documents are the same type or have the same value

  • Distance - a value is within a Levenshtein distance of another value

  • Equals/Not Equals - a value equals or doesn't equal another value

  • Matches/Not Matches - a value matches or doesn't match a regular expression

  • Contains/Not Contains - a value contains or doesn't contains another value

  • Starts/ends with - a value starts or ends with another value

  • Has/doesn't have a value


I typically put this in the Document Assembler, Extraction, and Export scripts in every project. If I don't end up needing the helpers they can remain turned off in the script configuration. The document assembler script handles the bulk of the classification and merging duties. The extraction script can then classify and merge based on extracted values. And finally the export script can double check that documents look good after an operator has touched them (can often share a configuration with the extraction script).


Code and configuration


The code and base configuration files can be found here: https://gist.github.com/jhotmann/3af8bc2bd1cf367d213f4b1704d0e782


When the script is first run it will also create a properties file in the script-config directory that can be used to turn on/off classification and merging as well as control what XML files the script is using.



Jordan Hotmann, Developer