Text Scanning: A Basic HelpsheetElectronic Text Center
University of Virginia
Charlottesville, VA 22904
Optical Character Recognition: The Process
Optical Character Recognition (OCR) converts scanned images into text. It works well on most 20th-century and 19th-century typefaces. With earlier printed material, or with poor reproductions of any typeface, the OCR software begins to encounter time-consuming obstacles. Broken letters, ligatures, digraphs, uneven inking, and antiquated letterforms may be unrecognized by the software, and each unrecognized character adds time to the proofing and correction stage of your project.
Try a test scan before going ahead with any large amount of text. A little experimenting at first can result in a lower error rate (and therefore less to correct in proofreading). Your results should be good with most modern type faces, but even with clean text of a decent type size there will be occasional errors; this error rate increases as the text's size and clarity decreases. Altering the brightness and resolution can improve results, but little can be done with a badly faded photocopy or a 17th or 18th century typeface.
Anything that disrupts the integrity of the letter's shape can be a potential cause of an error, although the software has some ability to compensate. Breaks in letters (and sometimes ornate italics) can cause what you will come to recognize as distinctive OCR errors -- a d getting read as cl, a 1 or ! as l, an m as in, or an e as c.
Optical Character Recognition: Some Sample Scans
If you are new to OCR scanning, you might want to look at sample scans, comprised of digital images of pages and the results of the OCR process.
Optical Character Recognition: The Etext Center Equipment
The Electronic Text Center currently has three Epson and two Fujitsu scanners connected to Pentium PCs, using Epson Scan or ScandAll software for graphics scanning and OmniPage Pro or ABBYY FineReader for text scanning (optical character recognition, or OCR). PDF creation is also available using Adobe Acrobat. In the Electronic Text Center, the icons for these applications are located on the desktop. Additionally, the two Fujitsu scanners have an automatic document feeder attachment for processing large amounts of text.