[IMAGE: Setting Up A Course Website]

Prev     Next
Scanning
Contents

Optical Character Recognition (OCR)


What is OCR? Optical Character Recognition (OCR) is a process of scanning printed pages as images on a flatbed scanner and then using OCR software to recognize the letters as ASCII text. The OCR software has tools for both acquiring the image from a scanner and recognizing the text.
Ideal Source Material for OCR

OCR works best with originals or very clear copies and mono-spaced fonts like Courier. If you have choices, use the following source material:

  • 12 point or greater font size.
  • Black text on a white background.
  • A clean copy; not a fuzzy multi-generation copy from a copy machine.
  • Standard type font (Times, New Roman, etc.) Fancy fonts may not be recognized.
  • Single column layout.
OCR Limitations
  • Using text from a source with font size less than 12 points or from a fuzzy copy will result in more errors.
  • Except for tab stops and paragraphs marks, MOST document formatting is lost during text scanning, (Bold, Italic & Underline are sometimes recognized).
  • The output from a finished text scan will be a single column editable text file. This text file will always require spellchecking and proofreading as well as reformatting to desired final layout.
  • Scanning plain text files or printouts from a spreadsheet usually works, but the text must be imported into a spreadsheet and reformatted to match the original.
What Source Material Doesn't Work Well for OCR?
  • Forms (especially with boxes and check boxes)
  • Very small text
  • Multi-generation fuzzy or blurry copies from a copy machine
  • Mathematical formulas
  • Draft copies of documents with hand-written revisions
  • Fancy text and unusual fonts
  • Handwritten text


Maintained by: Gene Gatlin (gatlinet@plu.edu)
Last Update: 09/23/97