Data Science with Intelligent Capture (12/16)

DATA SCIENCE INTELLIGENT CAPTURE WHY CONFIDENCE SCORES AREIMPORTANT In using a field-level confidence score, the main objective is to identify a threshold that separates good data from bad data. Good data is a correct answer, meaning an accurate, literal transcription of the field as represented on the page. If the input document has a date of birth as 1/1/1970, the field into which the data is transcribed should contain 1/1/1970 as well. A confidence score is assigned and output by the OCR engine for each field answer. Thefield-level confidence score uses the raw OCR character- and word-level scores and synthesizes them with other available information to arrive at a final score. This other information can be, for example, the expected data type (such as numerals, letters) and format (such as phone number versus credit card number). For instance, if the answer to a phone number field provides confidence scores for each number, a field-level confidence score assembles all of the individual data for each number and combines it with other information about the field such as the expected length of the number (in this case 10 digits), as well as potentially the formatting resulting in a confidence score for the phone number. When evaluating a field-level confidence score for instance, the OCR engine might output the date of birth (DOB) value as 12/5/2008 along with a confidence score of 60. The field confidence scoring for each data element should output a consistent range of scores for correct answers. These scores should behigher than the scores for incorrect answers so that if you evaluated the results for 100 DOB fields and sorted them according to the confidence score of each, the correct answers should, on average, have confidence scores that are higher than incorrect answers. Although confidence scores are used to distinguish likely correct answers from likely incorrect answers, confidence scores are not probabilistic -- a score of 60 does not mean that there is a 60 percent likelihood that the answer is correct. In reality, no OCR engine can produce a perfect correlation between a confidence score andwhether ornottheansweriscorrect.Therewillbeinstances whereacorrect answerhasalowconfidencescore.Regardless,withtunedsystems,theresultsshould indicate an obvious score threshold where the majority of answers above it are correct and the majority of answers below it are incorrect.

Data Science with Intelligent Capture - Page 12

Data Science with Intelligent Capture Page 11 Page 13