Data Science with Intelligent Capture (14/16)

DATA SCIENCE INTELLIGENT CAPTURE SEPARATINGGOOD DATA FROMBAD ThisabilityforOCRto consistentlyoutputreliableconfidencescores(i.e.,erroneousdata consistentlyhas lowerconfidencescoresthanaccuratedata)todeterminebreakpointsiscalled establishingconfidencethresholds and allows for true unattended automation of document processing. This ensures high accuracy and completelyremoving the need for manual verification of the majority of your data. Only data that falls below the identifiedconfidence threshold(74inthiscase)isprobablyinaccurateandmustbemanuallyreviewed.Due to differences in data fields, it is possible and realistic that some fields can use a low confidence thresholdwhileothersrequireahigherthreshold;italldependsupontheanalysis.Perhaps, the date of birth field has a threshold of 74, but the social security field needs a threshold of88. In some cases, OCR software cannot produce sufficiently consistent field confidence scores to establish an ordered list of answers that allow selection of a single confidence score threshold (whereanswers above the threshold are mostly accurate). The picture above shows a case where there are too many incorrect scores with relatively higher confidence scores and vice versa for correct scores. When confidence scores are unreliable, an ordered list of answers based upon confidence scores producesmany incorrect answers above and correct answers below any threshold. When this is the case, rather than having accurate data to flow through from OCR to, say, the data warehouse without the need for verification, all data is forced to go through manual review. Even if most of the data is correct, the extra review is costly, and there is a higher probability that manual review will not identify all incorrect data due to human error. 14 PARASCRIPT

Data Science with Intelligent Capture - Page 14

Data Science with Intelligent Capture Page 13 Page 15