Overview
Read this article from a comprehensive knowledge base, updated and supplemented with articles reviewed by scientific committees.
Read the articleAUTHOR
-
Rolf INGOLD: Professor - Computer Science Department, University of Fribourg (Switzerland)
INTRODUCTION
Document image analysis and recognition is a scientific discipline that brings together a range of computer techniques aimed at reconstructing the content of a document from its image. While it has long been confined to the field of character recognition, it now has much broader objectives, ranging from simple document classification to complete content interpretation, indexing and re-editing. Thus, the ultimate goal of document image recognition is to generate a high-level representation in the form of structured documents, in a form suitable for the intended application.
By way of introduction, let's consider a page from a scientific book (figure 1 a ) that needs to be "hypertextualized", i.e. produced as an electronic version with hypertext links for navigation. In such an application, it is imperative to determine the logical structure of the book, i.e. its hierarchical organization into chapters, sections and paragraphs, and to identify definitions, exercise statements, experiment descriptions, formulas, etc. Figure 1 b visually reflects this structure at page level, while figure 1 c illustrates the resulting hierarchical structure. It is this structure that can be used for hypertext navigation.
Traditionally, document recognition has been applied primarily to paper documents for which no electronic form was available. Today, these techniques are recognized as being particularly useful for restructuring unstructured or poorly structured electronic documents, using the image produced synthetically, for example with a Postscript print engine.
From a historical point of view, it is interesting to note that optical character recognition predates the development of computer technology, since patents were already filed in the 19th century and a demonstration prototype was reported in 1916. The first computerized approaches to character recognition date back to the early 1960s; for example, the first mail sorting machine (limited to typed addresses) was installed in the USA in 1965. However, major developments date back to the advent of office automation in the 1980s
Exclusive to subscribers. 97% yet to be discovered!
Already subscribed? Log in!
Document image analysis and recognition
Article included in this offer
"Digital documents and content management"
(
71 articles
)
Updated and enriched with articles validated by our scientific committees
A set of exclusive tools to complement the resources
Bibliography
References
Exclusive to subscribers. 97% yet to be discovered!
Already subscribed? Log in!