From the Labs

From the Labs: Scanning Text

HP develops an approach to segment text and noise from annotated machine printed documents.

  • October 2009
  • By Vantika Dixit

Results: Center for Unified Biometrics and Sensors, State University of New York, Buffalo, USA, and HP Labs India, Bangalore, are working on an approach to segment handwritten text, machine printed text, and noise from annotated machine printed documents. The experimental results on an imbalanced data set has shown that the approach achieves an overall recall of 96.33 percent.

Why it matters: Unlike the retrieval of machine printed documents, where high OCR accuracy can be expected, the retrieval of noisy annotated documents which contain both handwritten text and machine printed text is still a challenge because document retrieval in the context of handwriting has not been widely explored. The Markov Random Fields (MRF) approach paves way for preprocessing mixed documents which is an important step in the design of systems for OCR, author identification, signature verification and document retrieval.

Methods: Initially, three different kinds of patches are classified and then refined using MRF, based on the concept of a system of neighbors and belief propagation. Prior to classification, each binarized document is segmented into patches which are small snippets of the image. In MRF-based framework, each document is modeled as a random field which consists of a number of patches. A specific sized window is used for dilation of the original binarized image and the bounding box of each connected component after dilation is defined as a patch. The size of the window is empirically chosen such that the resultant patch typically represents a handwritten or machine-printed word. Three different categories of features are considered for classification of a given patch into one of three classes: handwritten text, machine printed text, and noise.

Next steps: Experiments show that the MRF method has better classification performance than a single classifier. Future work includes use of smaller patches and use of other classification techniques.

Print
Advertisement

MAGAZINE

People Power 2.0

How civilians helped win the Libyan information war.

Sponsored Content

Technologies from National Instruments

Triggering
Learn how to configure a start trigger on a USB data acquisition device

> Click here for more National Instruments Videos <
Whitepaper

How To Measure Voltage

Voltage is the difference of electrical potential between two points of an electrical or electronic circuit, expressed in volts. It measures the potential energy of an electric field to cause an electric current in an electrical conductor.

Most measurement devices can measure voltage. Two common voltage measurements are direct current (DC) and alternating current (AC).

Learn the fundamentals of creating an AC or DC voltage measurement system. See how to properly connect the signals to your data acquisition system for accurate acquisition.

This document is part of the How-To Guide for Most Common Measurements centralized resource portal.

View full PDF > Listen to story >
Find us on Youtube

Videos

Interview with George Dyson

More

Advertisement
Advertisement
Advertisement