Straight Through Processing for Document Automation

Straight Through Processing Document Automation

Table of Contents Overview 04 Desktop Automation Drives Expectations 05 Where the Standard View of Unattended Automation Fails 06 Document Classification and UnattendedAutomation 07 A Different Way of Thinking About Unattended Automation 08 Taking a New Approach 09 Leveraging Unattended Automation 10 Automating Simple Tasks 11 Automating Task Verification 11 Achieving STP for Complex Tasks 12 STP Approach Based on Data Science 13 Rethinking Traditional RPA 14 Designing and Measuring a System 15 Looking atEfficiencies 16 Mortgage Document Classification and Separation Workflow 17 Manual Processes vs. STP in Mortgage Processing 19 Accounts PayableWorkflow 20 Intelligent Capture 21 2

Table of Contents Automation without Manual Intervention 22 Identifying the Possibilities of Document Classification 23 Range ofDocuments 24 Degree ofVariance 25 Available Information 25 Evaluating Possibilities with Data Extraction 26 STP Performance and What to Examine 27 Confidence Scores are notProbabilistic 28 A Practical Example: Invoice Data Extraction 29 What About Probability? 30 The Ugly Truth 31 Over Reliance on OCR Tools 32 Character-level and Word-level Accuracy Limitations 33 STP Requires Reliable Confidence Scores 34 OCR Designed for Intelligent Capture 35 First Step: Test Data 36 Second Step: Gather Test Data 36 Step Three: Configuring the System 37 Step Four: Conducting theTest 38 Step Five: Analyzing theResults 39 Is that All? 40 Summary: Data Entry Savings vs. Savings from STP 41 3

Straight Through Processing - Document Automation Overview Straight through processing (STP) is an automated electronic process that is used by enterprises large and small. Ideally, STP allows for an entire workflow process, from initiation to the final output or results, to be attained free of human intervention. This appears deceptively simple. For very basic, routine tasks, straight through processing can be achieved. However, for more complex processes, STP has proved elusive. This eBook examines the gap between expectations and realities that many enterprises face today in achieving STP and how to bridge this gap through newer solutions and a paradigm shift. STP requires a different way of thinking. For example, instead of thinking about information within documents as a page-level construct, consider the data as the data, irrespective of the page on which it exists. These and other subtle paradigm shifts are discussed along with applied technologies to achieve levels of automation that weren’t possible even recently. Proven use cases are also presented to help illustrate the different ways in which straight through processing is successfully being achieved today. 4

Straight Through Processing - Document Automation DesktopAutomation DrivesExpectations Oneofthesolutionareas that is driving expectations the most, in terms of business processes, is robotic process automation or RPA. RPA emerged from the tried-and-true use of scripting and macros to automate simple and rote tasks such as moving data from one application GUI to another. Thisis mostcommonlyreferredtoasdesktop automation. RPAexcelsat these types of automated tasks because it offers an easy way to “record” a user’s actions and play them back repeatedly with high throughput.Theresultisbinary:eitherthetaskiscompletedsuccessfullyas awholeoritisnot.Thistypeofautomationiswhatisknownasunattended automation where the tasks don’t require input from a person, or only in rare situations. Organizations enthusiastically adopted RPA for all of their simple processes for obvious reasons: it reduces workload and allows staff toeitherbereducedor reallocated. Unfortunately, only so many processes are simple and routine enough to support unattended automation via any type of solution. This is where expectations currently do not match the capability of solutions. RPA and business process automation vendors have started to talk of artificial intelligence using words like “cognitive” to convey the notion. Let’s take, for example, a process that involves more complex parsing of information: the information contained within documents. A common and complex use caseisinlending. 5

Straight Through Processing - Document Automation WheretheStandard View of Unattended AutomationFails A mortgage lender has a lot of processes that involve documents. The origination process is arguably one of the most document-intensive processes that exists; there can be a few hundred to over a thousand different documents involved from the initiation of the process to the funding and completion of a loan. Can this complex process be automated to the extent that humans are not required? In other words, is such a processacandidateforunattended automation? If we are defining unattended using principles of RPA where a task is either completelyexecutedcorrectlyoritisnot,thentheanswerisdecidedly“no.” This is because document-based data is complex. There are often many different data elements on a single page leaving a significant potential for error. Take, for instance, HUD 1008 form otherwise known as the “Uniform Underwriting and Transmittal Summary.” While relatively simple, this form canhaveover40differentdatafieldstolocateand extract. The probability that any system can accurately locate all of the fields and extract them correctly is very low. If the document is more complex, such as an appraisal or an unstructured document where there are no easy ways to locate data, then the likelihood of an error on a single page rises dramatically. So using the “RPA-inspired” definition, even the most advanced machine learning system would most likely provide less than 5% unattended automation. 6

Straight Through Processing - Document Automation Document Classification & UnattendedAutomation So what about document classification? Document classification is a binary answer of a document type. It either is either the document type or it is not. In a simple document classification task, single documents are submitted as individual files, and the only task involved is to identify them. Here high levels of unattended automation using the RPA definition can be achieved. In most cases, document classification is more complex due to cases where many documents arrive within a single PDF file requiring not only document classification, but also document separation. A document might be classified correctly, but erroneously include a page from another document.Theresultis that a large loan file consisting of several hundred documents is most likely to contain multiple errors requiring human intervention. Even if humans are involved in these tasks, we often see errors between 2% and 9% that significantly impact the ability to havehighdegreesofpage-level automation. This page-based approachis actually howmostadvancedcapturesystemsareconfigured–to deliver documentand/orpage-levelresultsthatare reviewedcompletelybyhumanstaff.This process is what is typically referred to as attended automation or humans in the loop. The bestcasescenariousingthisapproachisthatthestaffaresimplypressingtheENTERkeyalot if the answers are correct. While this process is typically more efficient than performing data entry, it is hardly the highest level of automation that can be achieved. We should be able to completelyeliminatealotofthisreviewandrotepressingoftheENTERkey. 7

Straight Through Processing - Document Automation A Different Wayof Thinking AboutUnattended Cantrueunattendedautomationreallybe achievedwithcomplexdocument-orientedprocesses?Theanswer is “yes,” but it requires a different type of thinking. Instead of thinking about information within documents as a page-level construct, consider the data as the data — irrespective of the page on which it exists. Instead of thinking about a HUD 1008 as a page of data, consider each data element as an entity (in and of itself), just like you mightthink abouta cell withina recordofa databaseas a singleelement. Insteadofvisualizing a groupof documents andpageswithdata,considertheloanpacketitselfas a complete database record of a loan consisting of several hundred or thousand data elements. Is it possible to achieve high-levelsofunattendedautomationatthedataelementlevel?Yes,itisandatratesof80%andhigher. This is because advanced document automation, which is optimized, can divide document tasks at that individual data element level into two groups. There are accurate tasks that can be immediately used by other systems and those tasks that need to be reviewed by a person. Each data element can essentially have a pass/fail just like the simple tasks normally involved with RPA. Document or page-level decisions are irrelevant. The key is to consider each individual task. 8

Straight Through Processing - Document Automation Takinga Whyis Straight Through NewApproach Processing(STP) important? The reality is that complex processes with many potential areas of failure, especially those that involve document-based Certainly, it is the gains that information, are more likely to be approached as attended STPoffers in helping to automation.Thiswillremaintrueunlesswecanbreakfreeofthe eliminate data entry, but STP concept of the page-based mentality, and separate the data fromthemodalityinwhichthedatahappenstobe presented. also eliminatesthe need to manually verify the data At Parascript, we practically invented the concept of data- entry. In the first section, we element level automation to achieve high levels of process discussed the two maintypes efficiencyand cost savings. of automation, attended and unattended and the appropriateuse cases for each. This section focuses on STP, why it is important in process automation and what it can do for your organization. 9

Straight Through Processing - Document Automation Leveraging Unattended Automation Automationof anytypeis valuable inthat it reduces the amount of menial tasks, leaving time for higher value work. However, automation for automation’s sake is only part of the entire equation. Unattended automation is used in cases where the outcome of a task is essentially binary: either the task was completedsuccessfullyoritfailedwiththemajorityoftasks. Tasks can move “straight though” in a fully-automated manner. This capability relies upon a very important factor: the ability to determine with high precision that a task was not just executed, but that it was executed correctly. 10

Automating SimpleTasks For simple tasks, we accomplish this through task validation. For example, rather than create a script to provision an email account and presume that everything was done correctly, validation routines can becreatedto send an email and verify that it was sent and received. We can verify that the email address created matches the established naming convention and that it is for the intended employee. All these checks identifywhetherthis taskissuccessfulornot. Automating Task Verification These checks are critical because they verify the actions of an automated task. After all, we certainly don’twanttoblindlyrelyupontheourRoboticProcess Automation to provision Information Technology resources when tasks are done incorrectly. Without automated task verification, the alternative would be to manually verify the output of each task. Just as you would not expect to “approve” every single action of an autonomous car, organizations clearly do not expect to manually check every single result. Without ahighlevelofSTP,trueautomationisneverachieved. Straight Through Processing - 11 Document Automation

Straight Through Processing - Document Automation Achieving STPfor ComplexTasks What about more complex tasks that cannot be completely automated (like the loan origination process discussed in the previous section)? Unlike simple tasks that can be easily automated and verified, the process of locating and extracting 20-30 data elements on a document rarely can be summarized as a simple pass/fail. Even if all the data on a single document could be located and extracted, what type of validation can be accomplished to determineifitisallcorrect? The process of achieving STP on complex document-oriented processes is not as straightforward as simpler tasks. Unfortunately, this complexity has resulted in a significant amountoforganizations using advanced capture solutions to automate tasks only to find that they need to verify every single result. The upside is that while the answers to achieving STP are not as straightforward as what you expect with simpler tasks, it is definitely possible. It just requires a different approach: one based upon data science. 12

Straight Through Processing - Document Automation STP Approach Based on Data Science Unlikerules-basedtask automation,document-orientedtaskautomationtypicallyinvolvesmoresophisticated technologies including computer vision, classifier algorithms and pattern matching. While none of these technologies can deliver 100% certainty, well-designed systems can achieve predictability. This allows organizations to statistically measure results of particular configurations such that the output is reliable with high levels of tolerance with typically 95% accuracy or greater. System predictability is based on the careful curation of sample data, which is representative of the larger populationof productiondata.Usingthatset,thedataisanalyzedtoarriveattherightmixofalgorithmsused to create a particular task configuration. The results of the configuration are then analyzed against expected outputtoidentifythe“signature”ofaccuratedata.Data Science. Fromthere, the system determines — at accuracy levels that reach 99% — correct data from incorrect data. This means that even more complex “cognitive tasks” can benefit from straight through processing at the discrete tasklevel,savingorganizationsmillionsofdollarsevery year. In order to truly benefit from automation, organizations need as much straightthrough processing as possible. This requires significantly minimizing the need for any review whatsoever. To do that, success depends on reliably identifying correct output from the stuff that requires manual intervention. As with a self-driving car, noonewantstoconstantlytellaself-drivingcarwhattodo. How Much Straight Through Processing Can We Expect in DocumentAutomation? In the previous section, we discussed how approaching complex automation tasks such as document automation requiresa data science approach to create a system that has predictable and reliable output. It is not simply a case of 100% pass or fail. Rather, there are subtleties that have to be addressed. 13

Straight Through Processing - Document Automation Rethinking Traditional RPA Approaching document automation requires a different level of thinking than “traditional” RPA automation. First, the notion that you can achieve a pass-fail at the page or document level for activities that involve automationofdataentryisnotpossible.Evenwithtoday’sadvancesin machinelearning,theprobabilitythat any single page will enjoy 100% accurate data location and extraction is very low: less than 5%. Only with page or document-level classification (where no document separation is involved) can you apply such a “binary” approach. For data extraction where there are multiple chances of success or failure on a single page,automationmustbemeasuredatadatafieldlevel,notatthepageordocument level. As an example, let’s take a flexible spending account process where the goal is to automate the location and extraction of five key data fields on receipts of various sizes and formats. If we expect a high percentage of receipts where all the data is processed accurately, then this would be the wrong approach. Even the most refined deep learning algorithms would have problems reliably locating and outputting all five fields on a majority of receipts. Maybe 5% to 10%of receipts wouldhavecompletesuccess.However,itcouldlocateand output 85%orgreaterofallfieldsofallreceipts. It looks something like this (these numbers are hypothetical, butbasedonrealtests): Successful Receipt-Level Extracton on: Outcome 1Field 98+% 2Fields 85+% 3Fields 50+% 4Fields 25+% 5Fields <15+% So looking at any data entry automation project in terms of a full page or document is not ideal because most pages or documents will require some level of review.

Straight Through Processing - Document Automation Designing& Measuring aSystem Confidence Score You cannot design and measure a system based andThresholds solely on the ability to achieve either a pass or This is where the concepts of “confidence scores” fail 100% of the time. This is unlike designing and “confidence thresholds” come into play. For automation for very rote, defined tasks for any advanced capture system, the output for each whichit is easy to determine pass/fail scenarios. answer will be accompanied by a confidence score. Whenitcomestoemploying“cognitive”typesof This score is based on a lot of inferences such as computing, we have to deal with predictions, complexity of the location, whether thereare other candidate answers, range of potentialanswersand not absolutes. Machine learning makesmistakes many more. that are hard to plan for or comprehend. The Confidence scoresare meant to be used to morecomplexthetaskis,themorecomplexthe construct a statisticalmeasure of all output so that potential outcomes. For any particular data field it is possible to create a boundary score, often on a page, there is a probability of a system’s called a confidencethreshold, which determines ability to achievesuccess. with a certain levelof precision whether output for a given field is correct or incorrect. Without confidences scores,we would have to manually review all output – that certainly doesn’t sound like straight throughprocessing. STP Measured at the DataFieldLevel The amount of straight through processing needsto be measured at the data field level and requires identification of confidence thresholds for each data field type in order to automatically identify good data frombad. This means that workflows employing document automation should emphasize review only on the data fields that require it. The remainder shouldbe shuttled straight into downstream processes. Overall benefits are measured not on how many pages or documents can move straight through, but how much overall data entry is removed from the equation. It stands to reason that if an organization is performing data entry on 100 million data fields per month, reducing 50% of that workload at a data field level represents significant value.

Straight Through Processing - Document Automation Successful Receipt-Level Extracton on: Outcome 95% of All Fields 80+% 85% of All Fields 85+% 75% of All Fields ~90% 50% of All Fields >95% Let’s take another look at the amount ofautomation that can be achieved for the above review process. This time,let’sfocusonthetotal percentageoffields instead of total number of receipts (again, this is hypothetical, but based on real tests). Looking at What does STP LookLike? Efficiencies Exploring what straight through processing (STP) looks like in document process automation requires Looking at efficiencies at the data field level, we examining actual examples of successful STP. In this canseethatthereisan85%probabilitythat95% nextsection, we go into detail, providing examples of of all data fields can be correctly located and processes—before and after—that make use of STP. extracted. If the number of fields is 100 million, The level of STP for document-based tasks varies. that means almost 81 million data fields do not Andyet,significantamountsofautomationare still require data entry or review. That is a significant achievable; you just have to apply different level of straight throughprocessing. measurements. 16

Straight Through Processing - Document Automation Mortgage Document Classification & SeparationWorkflow Theprocessesinvolvedwith originatingor servicing a loan involves a lot of paperwork – even if that “paper” is now a bunch of digital documents. From proofs of income and assets to other supporting documentation including appraisals, individual documents can number into the hundreds and thousands for commercial loans. Common across all of the various scenarios is the ability to identify one documentfromanother. 17

1 Verification Process In modern mortgage origination process, a lender will receive, often by piecemeal, a number of different documents. Upon receipt, the task turns to verification that the document is correctly submitted (it is the document requested) and meets certain criteria such as a pay stub, issued within the last two months. This process often incurs lag due to when the document issubmitted, in what manner it is submitted, and how many documents need to be reviewed. 2 Document Separation Another related task is the process of separating one document from another. Lenders and servicers like to deal with individual documents so if more than one document is submitted as a single file, that file is “burst” into individual files where each file represents a single document. This process can take quite a bit of time since the person performing the task must first identify the document and then page through the file and locate the last page, and then create an individual file. 3 Potential for Error For both tasks, it is almost always a process of first classifying the document, and then separating it into an individual computer file. For both of these tasks there is a significant opportunity to introduce error in addition to delays. For document classification, a credit card statement showing liabilities may incorrectly be identified as a bank statement showing assets. This introduces unnecessary delays and customer frustration. For document separation, the credit card statement may be correctly identified, but an error could be made such as incorrectly including the last page showing the balance with another document. This affects downstream processes that rely upon verifying total liabilities, which need immediate access to the page where the value resides. 18 Straight Through Processing - Document Automation

Manual Processes vs. STP in MortgageProcessing If these tasks are completed manually, there is no STP since every component is touched by a person. When looking at this same process with automation, we use tasks to introduce measurements. For document classification, the measurementcanbea simplepercentageofdocumentsclassifiedcorrectlyvs. incorrectly. Weusesophisticated statistical models to measure accuracy and establish a “threshold” that governs which documents have a high likelihood of being classified correctly (often at 98% or 99% accuracy), and therefore, canmovethroughwithnomanualeffortinvolved. If a single loan service staff can process 100 documents per hour and the total number of documents received is 100,000 in an eight hour time period, this means that the servicer must employ approximately 125 staff to process them during an 8-hour workday. With automation measured with the document-level success criteria, we can supposethata well-tunedsystemcaneffectivelyremove70%ofthatwork. If document separation is involved, it is more complex since we must add a new measurement to the percentage of documents successfully classified. By definition, the system cannot produce more than 70% STPfordocumentclassification.Thisis the absolutemaximumpresuming100%STPfordocumentseparation; however, this is never the case because there will also be errors, even with sophisticated systems. The key is thattheerrorisknowable,andtherefore, controllable. Presuming that the system can accurately separate 85% of documents (again measured at 98% or 99% accuracy) this means that about 56% of totaleffort is eliminated and can go straight through with no manual review. In this scenario, the time to separate documents reduces the throughput of each staff to around 60 perhour.Sototaldailystafftimeisreducedfromaround1660to733.Thissoundsimpressive. Even more impressive is that, through measurable, controllable automation, we are able to sustain a 98% to 99%accuracy rate of that output. Compared with manual processes that can range in accuracy from 93% to 95%,automationeffectivelyreduceserrorbyaround75%(from5%-7%errorto1%to2%). 19 Straight Through Processing - Document Automation

Straight Through Processing - Document Automation InvoiceData AccountsPayable Extraction: Workflow Accuracy AccountsReceivable/AccountsPayable(AR/AP),whichoperates This is becausewith a tuned onthedataitself, is a bit more involved when measuring STP. As system, certain data can be previously discussed, we must measure not by the percentageof accepted as accurate and not invoices that have data accurately extracted, but by the require any data review or percentage of data that is extracted correctly. This takes us to entry. Thisleaves only a measuring not by document, but by data field; there is no such relatively small percentage of thing as STP at the invoice level, but significant gains in data tobemanaged. For productivitycanstillbe achieved. invoices, Parascript software can automate dataentry of about 85% of the datafields from a these transactional documents. This leaves only 15%of the remaining data on averagefor manual dataentry. Let’s say an organization deals with 5000 documents per day.If it takes 25 seconds to locateand perform data entry on 15 fields, automation can reducethe number of fields handled manually to about tworesulting in a total daily time savings of almost 30 hours. Again,thekey measure isthe reduction in the amount of manual tasks, this time measured at field-level data entry, not invoice-level work. 20

Intelligent Capture Intelligent capture softwareis one of the few applications out there that canofferrealcostsavingsalongwithimproveddataaccuracy.Thekeyto measuringthose gains from a straight through processing perspective is to understand the tasks involved so that measurements can be taken at theindividualtasklevelandthenrolled-uptounderstandtherealROI. Invoice Data Extraction:Accuracy This is because, with a tuned system, certain data can be accepted as accurate and not require any data review or entry leaving only a relatively small percentage of data to be managed. For invoices, Parascript software can automate data entry of about 85% of the data fields from these transactional documents. This leaves only 15% of the remaining data on average to undergomanual dataentry. 21 Straight Through Processing - Document Automation

AutomationWithout ManualIntervention In this section, we delve further into the important aspects of document automation, particularly the documents themselves, that determine the upper limit of automation. Essentially, the benefit of adding document automation to your processes all boils down to one question: how much work can I automate thatrequiresnomanualintervention? The coy (and accurate) answer is always, “it depends.” This is typically due to the nature of the task and the attributes of the documents. There are some general rules of thumb—built over years of experience with document automation projects— that you can use. These rules are based upon two common automationtasks:documentclassificationanddata extraction. Straight Through Processing - Document Automation 22

Straight Through Processing - Document Automation Identifying the Possibilities of DocumentClassification Mostenterpriseshavetheneedtoorganizedocumentsforsomereasonoranother.Sometimes,itistosupport the ability to locate information when needed. Other times, document classification is involved within a given processtomakeitfasterandmorecontrollable. A classic “nightmare” scenario might be the need to manually sift through and categorize a room full of bankersboxesofdocumentstosupportpre-trialdiscovery.However,regardlessoftheneed,theultimateaim is to remove the necessity for documents to be organized manually because of the associated issues of cost, timeand accuracy. Within document classification, a number of attributes affect any system’s performance. These include the: range of documents within the scope; the degree of variance within the document types; and the informationthatcanbeusedtoclassifythesedocuments. 1 2 3 RANGE OF DOCUMENTS DEGREEOF INFORMATION WITHIN THE SCOPE; VARIANCE WITHIN THE THAT CAN BE DOCUMENT TYPES;AND USEDTO CLASSIFY DOCUMENTS. 23

Straight Through Processing - Document Automation Range of Documents An organization that only needs to sort through and organize three types of documents will realize a significant difference in performance from an organization that needs to address several hundred different document types. This is because with the introduction of each new document type to a classification project, you introduce the possibility of the classifier mistakenly assigninga documenttothewrongdocument class/type. Simply put, the more potential classes involved, the greater the potential for confusion, and ultimately, error. Exactly how much confusion and error are caused by each new document class is hardtocalculate,butwedoknowfromexperiencethattheerror rateisnot linear. For example, in a mortgage document classification project, we mightfindthatabankcandealwith200differentdocuments.For eachclassification task, the classifier evaluates a given document to determine if it belongs to one of the 200 identified document classes. This task is quite a bit more complex than the task to assign a given document to one of two or three classes. This is because as we add a new document class, we add potential for overlap between the characteristics of one document class and another,orworse,severalotherdocument types. 24

Straight Through Processing - Document Automation Degree ofVariance Whiletherearealwaysdifferencesbetweentwodifferentdocumenttypes,therearealsopotentialdifferences between two documents of the same type. For instance, the document class of “Credit Report” can be considered a single type. However, within that type, there are as many variations in terms of data and layout as there are organizations providing credit reports. That is, there is no single format where we can always anticipatethe same data. Asaresult,therearedifferentkeyattributesthatmightindicateacreditreportfromExperianversusonefrom Transunion. Just like the potential for error when we’re dealing with multiple document types, the degree of variancewithinadocumenttypeintroducesthepossibilityoferror. Available Information Somedocumentsareeasyto classify just by looking at them. For instance, receipts have a typical shape and data. Invoices typically include tables somewhere in the middle of the page. Other documents require more analysis to determine the correct document class assignment such as text-heavy agreements. As a general rule, the more attribute-based information that is distinct to a particular document class the better. When document classes combine many different and distinct attributes, we can realize fairly reliable results. Examples of Document Types For instance, an invoice can be distinct based on the layout (table in themiddle, numeric data on the bottom right and address block on the tophalf), text (presence of the word invoice), and non-text data such as a logo. For an agreement, we rely much more on text that might be shared with other document types so the ability to correctly assign the document class is hampered. Generally, classifiers of all types do better when a document class has a distinct set of attributes and the more, the better. Most document classification projects—even complex ones such as mortgage classification—can get 70% or more STP with enough sampledata, time for analysis, configuration and refinement. Generally speaking,your classification results drop by a fraction of a percentage with each new document type, but the calculation is not 25 linear. A few documenttypes can achieve 90% or more while 500 to 800 may get somewherearound 70%STP.

Straight Through Processing - Document Automation Evaluating Possibilities with Data Extraction When it comes to data extraction, it all comes down to the variance of the data from two major standpoints: data type and data location. Data types can mean differences within the date format such as American vs. common European formats. Or, it may mean variance between typed and handwritten data. Data location typically speaks to whether the document in question is structured (such as a form), semi-structured (such as an invoice or remittance),orunstructured (such as an agreement or contract). Even structured forms can have variance in data location due to differences in how the document was scanned. If it started as paper, a host of image quality problems can present themselves. Or, the way that the information was manually entered can have an impact. We have all seen examples of forms where the person wrote data well outside of the box. Just as with document classification, the ability to realize high STP rates is heavily dependent upon the variance with the higher variance documents providing less reliable results than their low variance counterparts. As a rule, structured forms can start with 80% or more STP rising to 95% or more while unstructured documents might only start with 40%-50% STP. 26

STP Performance and What toExamine TherearenoexactanswersforhowmuchSTPyoucanachievewithoutdoing a good amount of work. And yet, it is definitely possible to put boundaries around potential STP performance by examining, at a high level, the number of documents within a project and the degree of variance within eachdocumenttype.Afterthat, youcanstartto puttogether a framework listing-out the documents by type, the estimated number of variants, the document structure and data types. From there, you can provide rough estimates based upon the guidelines provided above to get a good sense of what is practical in terms of overall project STP objectives. Technology Behind Straight Through Processing Technical Side of STP: Confidence Scores and Thresholds: Delving into the underlying intelligent capture technology behind achieving STP requires examining confidence scores. In client engagements and in conversations with analysts and consultants, the topic of confidence scores almost always comes up. Far too often, there is a lack of understanding of what confidence scores actual are and how they are used. A common question is something like, “do you display confidence scores to users?” This question reveals a fundamental lack of understanding so let’s dig deeper into some common misunderstandings and therealities. 27 Straight Through Processing - Document Automation

Straight Through Processing - Document Automation Confidence Scores are not Probabilistic Whyis displaying a confidence score not useful? Because a score by itself has no specific ability to communicate whether the associated data output is accurate or not. A confidence score of 80 does not mean that it “has an 80% probability of being correct.” So if this is the case, what good are they? First, let’s use an analogy. Suppose there was a test for which a student got 30questions correct and they come to you to ask what the resulting grade is. Could you give them a grade? No, you’d likely ask how many total questions there were on the exam. Confidence scores are similar in that you need more context in order to make them useful. Instead of knowing the total number of questions on the exam, you need to understand the full range of confidence scores for any particular answer value. 28

Straight Through Processing - Document Automation A Practical Example: Invoice Data Extraction For instance, let’s suppose we have 1000 invoices that we use to configure and measure a system. For this project, we wish to use confidence scores to determine when the “Invoice Total” field is likely to be accurate vs. inaccurate. Rather than evaluate a single “Invoice Total” confidence score from the thousand, we evaluate the scores for all 1000 invoices. Doing this, we can evaluate what “Invoice Total” answers are correct and which ones are incorrect, noting the confidence score ranges for the correct and incorrect answers. We can then order the answersbyconfidencescorefromlargestto smallest.It mightlooksomethinglikethis: Answer Correct? Score 102.70 Y 78 95.00 Y 72 34.23 Y 65 54.36 Y 55 28.55 N 35 250.75 N 28 136.12 N 10 In this sorted view of “Invoice Total”answers,we canseethatanswerswith scoresbelow55aremorelikelyincorrectwhilescoreswith55oraboveare likely correct. Of course the total sorted list would consist of 1000 answers, but you get the point. The only way to use confidence scores is to take a sizablesetofoutputandperformthisexerciseforeachdatafield.Usingthis example, a score of 55 doesn’t mean 55% probability of being accurate, it simplymeansthat,baseduponanalysis,itislikelytobecorrect. 29

Straight Through Processing - Document Automation WhatAbout Probability? Upuntil now,we haveonly shown how to use confidence scores to separate likely accurate data from inaccurate data. So how do wecalculate probability? The answer is that we must analyze a large sample set (that 1000 invoices is a good number) of representative data. “Representative” means that your sample data resembles your actual data. And then, you have to perform alotmoreanalysisto getto an understandingofindividual score accuracy. For instance, once again using the analogy of the exam, you will calculate the accuracy rate of all answers above and below the threshold. Youmight findthat 450answersoutof500,whichare above the threshold are accurate. This equates to an accuracy rateof90%.Andthen,youmightgroupeachanswerbythescore andfindthat,ourof50answerswithascoreof85,48arecorrect, yielding an accuracy rate of 96%. Youmight calculate all of these numbers. Ultimately, you can get to a level of understanding to calculate the accuracy rate for each confidence score. This is how confidencescoresshouldbeevaluatedand used. The reality is that this is a lot of work that few organizations performduetothecomplexityandtime requirements. 30

Straight Through Processing - Document Automation The UglyTruth Confidence scores, by themselves don’t mean Does Your OCR Suffer From anything, yet many organizations act under that Low Confidence? presumption. This behavior results in either increased risk or increase cost. That is, the risk of letting bad data go through (e.g., all incorrect This section focuses on the reliability of OCR tools. answers with a score of 80 go through) or The fact is that most Intelligent Capture software spending a lot of effort to review each answer is complex. The result is that few organizations really plan to achieve straight throughprocessing (e.g., the systemneverproduceshighconfidence (STP) through the use of field-level confidence score scores). thresholds. What happens if an organization invests thenecessarytimeandresourcestoachieveSTP? Most of the time, organizations don’t trust the Will it be successful? The answer is, more often than not,“no.” system because they are not able to adequately In the previous section, I ended with an ominous measure it – so they verify 100% of the system warning (“the ugly truth”) regarding use of output. We have also observed systems that confidence scores and potential lack of reliability. cannot produce reliable confidence scores at all The effect is that organizations often review 100% – so there is no ability to use them; this also of their data and in most implementations, this forces 100% manual review to ensure accuracy. unfortunate fact is the only way to assure high qualitydata. This is not the definition of straight through One of the most common inquiries that we receiveis processing. the ability to improve a process, which alreadyuses “OCR.” (Note: I place OCR in quotes because there is a lot more that goes into Intelligent Capturethan use of OCR.) The main problem typicallyrevolves around the desire to capture more datathan the current system. A typical objective associated with this desire is to also reduce the amount of manual dataentry. Often, this then leads to how much data they must currently review and correct. The answer is almost always 100%. That is, 100% of data is reviewed by staff who then make a determination of whether particular data fields need to be corrected. After examining the situation, we typically find that the output along with associated confidence scores are not up to the task of achieving reliable automated verification. Why is this? 31

OverReliance on OCRTools There is a good reason most people closely associate Intelligent Capture solutions with OCR: most of these solutions use OCR and probably grew out of solutions offering more simple capabilities such as document imaging. Many times organizations create their own solutions using OCR toolkits. For simple needs, creating a custom solution using off-the-shelf OCR is completely reasonable. For instance, using OCR to convert images into searchabletextis a perfect taskformostOCR toolkits. However, when it comes to using OCR for turning document-based information into highly-reliable structured data, the requirements quickly outstrip the capabilities of OCR toolkits and the developers using them. This is because OCR was primarily created to solve a single task: transcribe text on a scanned document into machine-readable text. In this regard, little to no attention is paid to the accuracy of specific data fields. Accuracy is measuredatthecharacterand/orwordlevelforallthedataonthepage. 32 Straight Through Processing - Document Automation

Character-level andWord-level AccuracyLimitations The focus on character-level and word-level accuracy generally results in good accuracy for converting images into searchable text. However, it provides very limited capability to output reliable confidence scores at a data field level versus at a character or word level. In other words, use of off-the-shelf OCR tools may get you accurate page-level data. And yet, withoutsignificantmodificationof theOCRtoolsthemselvesandadditional development on top of the OCR results, a project (that requires knowing when data is accurate or not) cannot attain straight through processing. Anotherproblemisthatmanysolutionsimplementalotofrulesthatfocus on validation of data, but these rules are run only after receiving output fromOCR.Sosolutionsmightcheckoutputagainstadictionaryorotherlist of expected values, or process the output using pattern recognition to detect if the output is accurate. These efforts don’t do anything for the confidence score itself; scores from OCR are not changed and therefore, they cannot be used to establish a reliable threshold. The only way to potentially improve reliability of confidence scores is to use this type of validation during the process of recognition, which can help the OCR enginemakeabetterselectionofpresentingthecorrect answer. 33 Straight Through Processing - Document Automation

Straight Through Processing - Document Automation STP Requires ReliableConfidence Scores In the previous section, we covered how confidence scores can be used to identify a threshold that establishes—at a statistical level—good data from bad. After analysis of a sizable amount of data, we can set thresholds that represent specific accuracy levels. For instance, with invoice data extraction, we can set a threshold of 70 for “Total Amount” that means 88% of all Total Amounts are extracted with 98% accuracy. That is a fairly precise statement. And precision requires reliability. What happens when OCR cannot output reliable field-level confidence scores? Instead of getting a range of data and scores that allow sorting to identify a threshold, such as what is seen in the first table, you end up getting a range that offers no ability to identify a threshold at all as can be seen in the secondtable. RangeofData&ReliableScoreswithThreshold Answer Correct? Score 102.70 Y 78 95.00 Y 72 34.23 Y 65 54.36 Y 55 28.55 N 35 250.75 N 28 136.12 N 10 Range of Data & Unreliable Scores where no Threshold Note that in both cases, the accuracy can be Established level of output is the same:four Answer Correct? Score of seven fields are correct. But the 102.70 N 78 output in the second table is unreliable whenconfidence scores are examined. 95.00 Y 72 Plentyof inaccurate data have 34.23 N 65 relatively highscores and vice versa. The example ofamounts is probably an 54.36 Y 55 easier task but the more complex the 28.55 N 35 data, the moredifficult it is for general OCR tools tooutput reliable field-level 250.75 Y 28 confidencescores. Without reliability, 136.12 Y 10 organizationsare stuck reviewing 100% of theirdatain order to ensure accuracy. 34

Straight Through Processing - Document Automation OCR Designed for Intelligent Capture The previous example isn’t just representative of custom projects that use off-the-shelf OCR tools. Intelligent Capture solutions can have the same problems for the same reasons. While these solutions offer a bevy of functions designed to improve accuracy and validate the data, the same unreliability problems of confidence scores remain. That is where a special breed of OCR can help. Designed for data fields and not for page-level results, special-purpose OCR is built to work with specific data on a page, each with a particular data type, and each with its own range of potential values and business requirements. All this additional information is used during the data field recognition process, not after. This type of specializedOCR is trained on specific datatypes to not only provide high levels of accuracy, but to also output confidence scores that are reliable at a data field level. This means that with enough analysis, very stringent thresholds can be identified and used to control output. This ensures that only suspect data is reviewed, leaving the bulk of accurate data to pass straight into business workflows and applications. Banks all over the world use Parascript software because of both its accuracy and reliability. This results in over 90% costreductions. Can organizations enjoy automation with off-the-shelf OCR? Sure. But if the ultimate goal is to achieve high levels of automation without the need for manual intervention, OCR tools fall well short. Practical Steps for a PoC to Assess STP In the next section, there arepractical steps for conducting a Proof of Concept (PoC) as guidance on assessing the straight through processing (STP) potentialof any intelligent capture system. The main value provided byintelligentcapture software is the ability to extract as much unstructured, document-oriented information as possible at the highest levels of accuracy. So while the user experience and operational management capabilities are alsoimportant, if the software fails to deliver high levels of accurate data, then you might as well stay witha manual process. When we need to accuratelymeasure a system, we need good test data with which to observe systemperformance. This is called “ground truth data.” 35

First Step: Test Data Ground truth data is essentially sample data that has the answer key. For instance, if you plan to test and compare the ability to process invoice data, then the ground truth data will consist of samples of invoices along with the actual value of each field you wish to extract for each sample invoice. Eachsamplewouldlooksomethinglikethis: File Name:Invoice1234.PDF Invoice Date: 6/13/2020 Invoice Number: 1234 Invoice Amount: 2112.00 Second Step: Gather TestData Your test data should be taken from real production examples. While artificial test data could potentially substitute for real data, it is typically insufficient to adequately represent the true nature of your documents. The amount of test data you need to reliably measure any system realistically depends on the amount of variance or differences observed in eachdocumenttype.Themoretestdatayouhave,themoreaccurateyour measurements will be. That is, 500 samples should be a bare minimum in order to reliably understand if a given system actually performs in productionthewayitperformsin testing. 36 36 Straight Through Processing - Document Automation

Step Three: Configuring theSystem Generally,it is far from practicalto configuresystemsonyourown.Whileallintelligent capture is designed to supply structured data from unstructured documents, the manner in which you configure systems can vary widely. Therefore, the ability to learn a number of systems to the degree at which you can configure a highly-tuned system is not realistic. This is where the vendors come in. They understand their softwarebetterthananyoneandcanprovidethebestsupportintermsofconfiguring a system for a test run; so it is always best if possible to have the vendor supply the configuration. Onewordofcaution,however.It is possible to create a configuration that works well in a test, but is not practical to use in production. To put it bluntly, a PoC can be gamed. For instance, if your need is to classify a wide variety of documents such as mortgageloanfiles,it is possible to configure a systemusing rules ortemplates based on sample data that does not work well in a real-world production environment wherethevarietyofdocumentsgoeswellbeyondwhatisusedinaPoC. Therearetwowaystodealwiththis.Thefirstistoalwaysaskforinformationregarding how the software was configured to understand if the configuration only works for the PoC or if the configuration methodology would also work in production. The secondis to createtwo samplesetsthat aresimilar in characteristics (e.g., number of samples, similar amounts of variance, similar document types, etc.), but are completely different sets of files. Provide one set, called the “training set” to the vendorand then use the other set, the “test set” to actually test the system. Youwillwanttoprovideareasonableamountoftimebetweenprovidingthetraining set and actually conductingthetest run; maybeacoupleofweeksatmostdepending onthecomplexityofthePoC.SimpledataextractionPoCsshouldonlyrequireaweek orlessofconfigurationandpreparationbythevendor. 37 Straight Through Processing - Document Automation

Straight Through Processing - Document Automation Step Four: Conducting theTest The big day has arrived, or has it? Technically conducting the test is not the main focus; rather examiningtheresultsis themainpriority.Conducting the test should merely be the day you provide the test set to the vendor. The vendor receives this set and processes the results. Be clear with the vendor regarding how much time is allowed to produce results. What you do not want is to have the vendor re-configure the system to improve actual results. One way you can reduce the likelihood of gamesmanshipistoaskfortheconfiguredsoftwareto bedeliveredto youwherethesoftwareis installed in an environment you control. From there, the vendor can be authorized to run the test data through the systemonlywithnoroomforchicanery. 38

Straight Through Processing - Document Automation StepFive: Analyzing theResults Analysis of the results is the big focus since the true value of an intelligent capture system is delivering the mostdatawiththehighestlevelsofaccuracy.Becauseyouhaveyourgroundtruthdataforyourtestdeck,you haveaveryeasywayofcomparingeachsystem’soutputinanapples-to-applesmanner.Askfortheresultsto bedeliveredinasimplestructuredformatsuchas: File Name Field 1 - Date Field 2 - Invoice Field 3 - Amount Invoice123.pdf 01/10/2020 12345 $34.56 Invoice345.pdf 03/15/2020 ABCDE $123.45 --- --- --- --- Each cell of the table is the actual value of the file name and corresponding field. From here you can directly compare each system’s structured results with your answer key to identify the total accuracy of the system. The total system accuracy is the number of total data fields divided by the number of accurate data fields outputor: # of Accurate OutputFields —————————————————-= System Accuracy Total # of all Fields For instance, if your sample set of 500 invoices has 1500 total fields (one each for date, invoice number and amount), and System A outputs 800 accurate fields, then System A Accuracy = 800/1500 or 53%. Once you have compared each system’s output to the answer key, youhave a solid understanding of accuracy for each system. If some data fields are more important than others, using the same comparisonmethod,you can understand accuracy at a field level calculating, for instance, the accuracy of Invoice Amount by dividing the total number of accurateoutput for Amount by the total of amount fields. 39

Is that All? If your goal is to reduce the amount of data entry (but not the amount of data verification), the analysis described in Step 5 will help you understand how much data entry can be avoided. However, if your goal is to completely avoid even having staff review output, you need to use another piece of data: the confidence score. In the previous section we discussed how confidence scores can be used to understandhow much data canbeprocessedwithnoneedtoreviewit.Wedothatbyusingasortedlistofoutputandconfidencescores (sortedbyconfidencescore)atadatafieldlevel,toidentifytheconfidencescorethreshold.It is thisthreshold thatdetermineswhetherdatashouldbereviewedbystafforproceedstraightthroughwithnoreview. Theprocessis just like it is described in the previous section with the exception that you perform this analysis on each system’s output. You may find that one system produces less overall system accuracy, but has the ability to provide more straight through processing. For instance, after reviewing output by system accuracy andconfidencescore,wemightfindthe following: SystemReview SystemA SystemB System Accuracy 80% 90% STPRate 70% 65% So how can System A deliver higher STP with a lower system accuracy? The answer all comes down to the reliability of confidence scores. It is possible that System A is able to associate higher confidence scores with accurate data in a more reliable way than System B. So even though System B’s overall output is more accurate, it cannot tell with much precision, which of that data is accurate. This forces a higher amount of accuratedatato beverifiedbystaff. 40 Straight Through Processing - Document Automation

Straight Through Processing - Document Automation Summary: Data Entry Savings vs. Savings from STP The ultimate decision to focus on data entry savings vs. savings from STP is made on business needs including the nature of a given process. Every PoC or evaluation of intelligent capture should always and objectively measuresystemaccuracy.Withoutasolidunderstanding of that single data point, you will have no understanding of system capability and ultimately the potential of successforyourprojects. 41

Straight Through Processing for Document Automation

Next in

Next in