AI Content Chat (Beta) logo

Data Science with Intelligent Capture

DATA SCIENCE INTELLIGENT CAPTURE DataScience IntelligentCapture LEVERAGING SMART LEARNING PARASCRIPT ©2021Parascript Management, Inc. All rights reserved.

DATA SCIENCE INTELLIGENT CAPTURE PageofContents 03 08 13 OVERVIEW// WHATACCURACY USINGCONFIDENCE WHAT IT'S ALLABOUT NUMBERS REALLYMEAN SCORES 04 09 -10 14 INTELLIGENTCAPTURE // IN DATA SCIENCE, IT SEPARATING GOODDATA LIVING UP TO APROMISE STARTS WITHDATA FROMBAD 05 -06 11 15 PRECISION IS THE KEY IDENTIFYING GOODDATA INTELLIGENTCAPTURE TOSUCCESS FROMBADDATA REALIZED 07 12 INTELLIGENTCAPTURE WHY CONFIDENCE SCORES IS DATASCIENCE AREIMPORTANT

DATA SCIENCE INTELLIGENT CAPTURE OVERVIEW WHAT ITIS ALL ABOUT Expectations for today’s digital workforce automation are centered around higher speed and efficiency. As an enabling component for complex document-oriented robotic processes, intelligent capture must process as much document-based dataas possible in a 100% unattendedautomation state. The return on investment lives or dies on this ability. Yet most organization’s use of intelligent capture still involves a significant amount of data verification by human staff. With all the automation available to organizations either on premise or via a cloud service, why does intelligent capture still have a problem living up to its promise? When it comes to intelligent capture, organizations are not interested in implementing workflowsthat require staff to manual sort documents and enter data. They are interested in removing as much of the manual labor aspossible.Explore here how to attain true unattended document automation with highaccuracy.

DATA SCIENCE INTELLIGENT CAPTURE Intelligent Capture LivingUptothePromise In a recent AIIM survey of professionals that manage intelligent capture systems for their organizations, almost 40% of respondents selected as either their first or second choice that accuracy of their system was not good enough. This is unsurprising given that over 45% selected as their first or second choice that complexity of configuration was a significant problem. If systems are overly complex,thenit is reasonablethat these systems will not deliver as expected. With all of the automation available to organizations either on premise or via a cloud service,whydoesintelligent capturestillhaveaproblemliving uptoits promise?Whenit comestointelligent capture,organizationsarenotinterestedin implementingworkflowsthat still require staff to manual sort documents and enter data. They are interested in removingas much of themanual labor as possible. This is different from, for instance, a CRM system where there is little to no expectation of removing staff. The focus is on making human-centricworkflows efficient and controlled,not toeliminatethehumanelement. Withintelligent capture,the single biggest factor of success is the ability for the systemto automate as much document-based data as accurately as possible without involving staff. This is called unattended automation. The return on investment lives or dies on this ability. This means that if you have 10 million invoices per year, understanding how much of this data can be accurately extracted is your objective. The ability to process document-oriented data with no human intervention is known as straight through processing (STP). If you use intelligent capture without the data science approach, it is likely that 5% of your volume could be consuming 50% of your workflowresources.

DATA SCIENCE INTELLIGENT CAPTURE Precision isthe Key toSuccess Inmanyrespects,intelligent capturesoftwareresemblesadataanalytics platform. Both aremeasuredonthequalityoftheirresults. Bothrequire significantattentiontosample data. Both require specialists who understand how to configure and measure the system. However, unlike data analytics, many organizations have approached configuration of an intelligent capturesystemusingafewsampledocuments withonlyminimalanalysisof the output. The reality is that those organizations endup with almost zero automationdue to the lack of reliability of the output. This is because using a few samples only allows a functional configuration— one that employs rules, but that has not been analyzed and optimized. The result is that these organizations essentially review 100% of the data. Here we seea system that runs OCR on everything, but the output is so unreliable that all data is manuallyverifiedbeforeitgetsoutput.Eventhough100%review is required,it still ends up with some level of error due to problems with manual verification. Humans aren’t perfect, even when reviewing OCR output.

DATA SCIENCE INTELLIGENT CAPTURE Precision isthe Key toSuccess The ultimate goal for intelligent capture is to have as much data flow through the system as possible,leavingonly a smallamountfor thestaff to handle. This maximizes theamountofdatathatgoes straightthrough. It requires organizations to take a data analytics approach, spending time on curating adequate sample sets and evaluating the results of the system. Below we see a system properly configured and optimized to reliably classify and extract data using statistical measurement, delivering a high percentage of straight through processing. Some data still requires manual verification, but since the system has been configured and optimized using proper statistical methodologies, a large amount of data can go straightthrough. Inorderto get tothislevelofreliabilityandefficiency,weneedtousesamplesets thataccurately reflectproductiondata.Sounlikemost,ifnotallothersystems organizationsemploy,intelligent capture is about precision of document automation which requires a significant amount of attention on data analysis and systemoptimization.

DATA SCIENCE INTELLIGENT CAPTURE Intelligent Capture is DataScience If you are only using intelligent capture to convert images into text, then you don’thave tospendtimeon the precision of your data. However, you are also not getting the full benefit of intelligent capture. intelligent capture is not just OCR. It is the domain of technology focused on automating specific document processes including identifying and sorting documents as well as locating specific data elements within documents and reliably extracting them. intelligent capture can be applied to both scanned documents as well as born-digital documents such as Word files or emails, which don’t require OCR at all. Since intelligent capture is all about significantlyremovingmanualwork,thefocus for understanding the value of intelligent captureison answeringonesinglequestion: howmuchworkcanflowstraightthrough without any manual intervention? This question can often be answered with a single number, e.g., “85% flows straight through.” In reality, to arrive at this single number or percentage involves performing many more calculations.

DATA SCIENCE INTELLIGENT CAPTURE What AccuracyNumbers ReallyMean READ RATE PAGE-LEVEL ACCURACY This is the data field-level Once the capture system locates a percent of ability to locate a data field, it successfully particular data field on a page. transcribes the information. FIELD-LEVEL ACCURACY CONFIDENCE THRESHOLD Some data is valued at a higher Calculate the percentage of fields level than other data so system located for each field and multiply accuracy is measured at the data that by the percentage of field level. transcription accuracy. READ RATE PAGE-LEVEL ACCURACY This is the data field-level percent of ability to This is the measurement of how many fields are accurately extract data whether using OCR read correctly at a page level. For instance, if a or other means. system reads 8 out of 10 fields on average for a single-page invoice, it has a page-level accuracy rate of 80%. For structured forms where there is good image quality,the percentage can be quite high, as high as a 95% to 99% read rate. CONFIDENCE THRESHOLD For variably-structured documents such as invoices, read rates are typically lower, This is the field-level setting that governs whether depending upon the system’s ability to apply data is accepted as accurate and sent straight the appropriate algorithm to locate any single through or sent for review / manual handling. The data field. ability to reliably set thresholds to achieve specific accuracy rates is the single largest factor in achieving any level of unattended automation. FIELD-LEVEL ACCURACY This is the measurement of read rate for each field. Since some data is more important than others, organizations often will prioritize performance for specific fields.

DATA SCIENCE INTELLIGENT CAPTURE INDATASCIENCE, IT STARTS WITHDATA When the primary focus for a system is precision, finding the right data to use for both configuration and measurement of the system is essential. This is where data science comesin.Youhaveprobablyheardofconceptssuchasstatisticallysignificantor margin of error.These areoftenusedwithpollingandothersurvey-basedresearch.For intelligent capture, we use similar measurements for a similar reasons: to achieve precision. Therearetwoissuesthatmustbeaddressedwhenevaluatingtheappropriatesample set with which to use. The first is the coverage of the range of document types and the second is the coverage of variance of documents within each type. These are represented in the graphbelow. Unrepresentative Sample Set s t ou y La Sample set includes only a portion of ment the document types ocu D of er Amount of different mb examples of a u single document N type in your sample set. Document types Number of samples that will be collected for each unknown to the document type system. Closing Proof Trust Loan Tax Form Appraisal Appraisal Credit Bank ID Builder Checklist of Review Application Invoice Report Statement Cert Income Range of Document Types 9 PARASCRIPT

DATA SCIENCE INTELLIGENT CAPTURE IN DATA SCIENCE,IT STARTS WITHDATA In trying to estimate the savings rate of the US population, it would be irresponsible for a research group to restrict data collection to only one part of the nation. The results would not represent the true savings rate if focus was on a single slice of the country. It is the same for intelligent capture. Let’s say an organization wishes to automate classification of mortgage documents that includes an appraisal, tax form, credit report, application and a good faith estimate. If the sample set used for configuration does not reflect the range of document types to be automated in production (as represented in the graph below), then the resulting configuration will have a low rate of document classification. In this scenario, if the sample set does not include examples of appraisals, then these documents will be incorrectly classified. The result is the organization will have a large percentage, if not all, of these documents go to exception requiring staff to evaluate and organize these documents. If thesampleset doesnot includeexamplesofthevariation withinany given document type, thena large percentage of data fields will go to exception—requiring staff to perform data entry. For example, let’s say the organization receives over 1000 different variants (or layouts) of an appraisal yet only uses 20-30 examples in its sample set to configure the system (represented by the blue line). It is a high probability that the configured system will not locate the needed data on a large amount of these documents. The result is that, while perhaps they are properly classified, each appraisal document will require manual dataentry. The key to a reliable configuration is to use reliable sample data. The graph below indicates that all document typesare representedin the samplesandthenumberofsampleswithineachdocumenttypereflectstheirvariance. Ultimately, the task of collecting and curating adequate sample sets is part art and part science. The only way to be 100%certain that your sample set properly represents your production data is to analyze everything. This is impractical and prohibitively expensive. Instead, we use statistical methods to get as close as possible without incurring a significant amount of cost in terms of both time and money. Ideal Representative Sample Set Sample set adequately covers the ts distribution of document types ou y La t en Sample set adequately covers the Amount of different % of document variants um examples of a single document type in Doc your sample set. of r be The actual amount of Num different versions of a single document type in your production data. Tax Appraisal Closing Proof Trust Loan Appraisal Credit Bank ID Builder Form Invoice Checklist of Review Application Report Statement Cert 10 PARASCRIPT Income Range of Document Types

DATA SCIENCE INTELLIGENT CAPTURE IDENTIFYINGGOODDATA FROM BAD DATA Once we have gathered a good sample set and configured the software, it is time to evaluate the results in order to optimize the system. Our sample data along with the “answer key” allow us to compare the results of the system to the correct answer. This allows us to calculate the read rate for each field. We also spend a lot of time analyzing another number called the confidencescore. If you are a technical person who has worked with OCR software, then you probably have heard and even made use of a confidence score. All OCR software provides character-levelandword-levelconfidencescores.These scoresprovidethedeveloper an indication of whether the OCR software finds the answer to be correct. The scores are not representative of probabilities so a score of 80 does not mean an 80% probability of being correct. Thesecharacter/word scorescanbeuseful. However, whenitcomestoactual dataextraction—notsimply converting an image to text—another confidence score comes into play, the data field confidence score. Just like page-level OCR, software focused on data extraction produces the field confidencescore. FormXtra.AIisfocusedonfield-leveldatalocationandextractionwhichdiffers frommore genericfull-pageOCRsoftwaresuchasABBYYFinereader,Nuance OmniPageSDKor OCR available through Google, Amazon and Microsoft. The field-level confidence score uses the raw OCR characterand word-level scores andsynthesizesthemwith other available information to arrive at a final score produced by the software. This other information can be a data type (e.g., numeric, letters), format (e.g., phonenumbervs.creditcardnumber),etc.Whenitcomestoachievingtrue automation,theseconfidencescoresarecritical.Unfortunately,mostsolutions cannot supporttrue automation. To understand why and the potential significant negative impact on your project, read on.

DATA SCIENCE INTELLIGENT CAPTURE WHY CONFIDENCE SCORES AREIMPORTANT In using a field-level confidence score, the main objective is to identify a threshold that separates good data from bad data. Good data is a correct answer, meaning an accurate, literal transcription of the field as represented on the page. If the input document has a date of birth as 1/1/1970, the field into which the data is transcribed should contain 1/1/1970 as well. A confidence score is assigned and output by the OCR engine for each field answer. Thefield-level confidence score uses the raw OCR character- and word-level scores and synthesizes them with other available information to arrive at a final score. This other information can be, for example, the expected data type (such as numerals, letters) and format (such as phone number versus credit card number). For instance, if the answer to a phone number field provides confidence scores for each number, a field-level confidence score assembles all of the individual data for each number and combines it with other information about the field such as the expected length of the number (in this case 10 digits), as well as potentially the formatting resulting in a confidence score for the phone number. When evaluating a field-level confidence score for instance, the OCR engine might output the date of birth (DOB) value as 12/5/2008 along with a confidence score of 60. The field confidence scoring for each data element should output a consistent range of scores for correct answers. These scores should behigher than the scores for incorrect answers so that if you evaluated the results for 100 DOB fields and sorted them according to the confidence score of each, the correct answers should, on average, have confidence scores that are higher than incorrect answers. Although confidence scores are used to distinguish likely correct answers from likely incorrect answers, confidence scores are not probabilistic -- a score of 60 does not mean that there is a 60 percent likelihood that the answer is correct. In reality, no OCR engine can produce a perfect correlation between a confidence score andwhether ornottheansweriscorrect.Therewillbeinstances whereacorrect answerhasalowconfidencescore.Regardless,withtunedsystems,theresultsshould indicate an obvious score threshold where the majority of answers above it are correct and the majority of answers below it are incorrect.

DATA SCIENCE INTELLIGENT CAPTURE USINGCONFIDENCE SCORES Once we understand field-level confidence scores, we can measure and tune field-level accuracyfor higher quality data results. To be effective and reliable, this confidence score analysis should be based on several hundred to several thousand samples, ensuring the analysis includes the broadest array of variances in document quality and layout. In the above example, the information we are concerned with are the last three columns: the first two are the transcription of the field (date of birth) from the OCR engine and the confidence score for that field (also generated by the OCR engine). The final columnshowsthe result from analysis astowhethertheOCRanswermatchesexactlywhatisonthedocument image.Oncethisisdone,answerscanbe ordered by confidence score from high to low in order to identify the optimal threshold. To find the optimal threshold, you must calculate accuracy provided at a specified threshold. We measure actual accuracyof a specified thresholdby dividing the numberof OCR answers abovethe thresholdthat are accurate by all answers provided above the threshold. In this scenario, all but one answer with a score equal to or above 74 are correct. There is also an answer below the threshold of 74 that is correct. Therefore, the majority of data can be segmentedinto two groups: one with a field-level confidence score of 74 or above and one group with scores of less than 74. Separation of data into these two groups is the goal of using confidence scores.

DATA SCIENCE INTELLIGENT CAPTURE SEPARATINGGOOD DATA FROMBAD ThisabilityforOCRto consistentlyoutputreliableconfidencescores(i.e.,erroneousdata consistentlyhas lowerconfidencescoresthanaccuratedata)todeterminebreakpointsiscalled establishingconfidencethresholds and allows for true unattended automation of document processing. This ensures high accuracy and completelyremoving the need for manual verification of the majority of your data. Only data that falls below the identifiedconfidence threshold(74inthiscase)isprobablyinaccurateandmustbemanuallyreviewed.Due to differences in data fields, it is possible and realistic that some fields can use a low confidence thresholdwhileothersrequireahigherthreshold;italldependsupontheanalysis.Perhaps, the date of birth field has a threshold of 74, but the social security field needs a threshold of88. In some cases, OCR software cannot produce sufficiently consistent field confidence scores to establish an ordered list of answers that allow selection of a single confidence score threshold (whereanswers above the threshold are mostly accurate). The picture above shows a case where there are too many incorrect scores with relatively higher confidence scores and vice versa for correct scores. When confidence scores are unreliable, an ordered list of answers based upon confidence scores producesmany incorrect answers above and correct answers below any threshold. When this is the case, rather than having accurate data to flow through from OCR to, say, the data warehouse without the need for verification, all data is forced to go through manual review. Even if most of the data is correct, the extra review is costly, and there is a higher probability that manual review will not identify all incorrect data due to human error. 14 PARASCRIPT

DATA SCIENCE INTELLIGENT CAPTURE INTELLIGENTCAPTURE REALIZED Achieving true automation with intelligent capture involves a lot more than just evaluating features and configuring the system. To create a reliable configuration, you must employ data science to gather an appropriate sample set with which to configure, measure and optimize for straight through processing of document-based tasks. The good news is that Parascript has a reliable process to shepherd your organization through this journey and much of it is automated using machine learning. This is based upon our decades of experience using these advanced technologies. The result is the highest levels of unattended automaton with the lowest upfront investment.

DATA SCIENCE INTELLIGENT CAPTURE CONTACT US TODAY www.parascript.com 888.225.0169 [email protected] PARASCRIPT ©2021Parascript, LLC. All rights reserved.