dfg_logo


Die Grenzboten

Die Grenzboten, 28. Jahrgang, 2. Semester 1. Band, Leipzig 1869; SuUB, Bremen.

InVenod

In the last two decades many digitisation projects have been carried out. Whole books and entire collections of libraries have been transformed into a digital format. The means for effectively analysing these documents are still limited. Methods for processing those documents for which established character recognition methods fail are required. In particular, this concerns documents with mixed languages, special and unknown symbols, non-standard character sets, especially rare fonts, such as those found in ancient documents.

This projects develops a system for automated analysis, preparation and processing of documents without the employment of typical optical character recognition methods. Instead, the analysis and comparison of glyphs solely relies on visual features, enabling the processing of arbitrary symbols and fonts. As a result, printed documents can be efficiently stored, transferred as well as reformatted for different output media.







Document reproduction

In the following, some of the results are illustrated for a number of documents, ranging over half a millennium, beginning with Gutenberg's Bible in 1454, and as the most recently example, a typewriter document from the twentieth century. The following images show the different appearances of those documents, each one having its individual characteristics:

Ueberblick Aller Beispiele



Some of the most frequent errors that occur during the reproduction process are shown in the following figure. The reproduced glyphs are printed in red on the original document page in order to illustrate the reproduction errors.

Example Defects


Analysing the broad spectrum of documents beginning with the invention of the printing press until today, it shows that there exist no two documents, nor a fortiori any books, which are based on the same layout structure and which suffer from the very same defects. The diversity of books is rather comparable to the diversity of fingerprints: From the distance they may look quite similar, but the differences are substantial when extracting each single glyph for the purpose of reproduction.

The proposed approach analyses documents solely on the basis of the visual appearance, and thus, enables the processing of documents of any languages of different times, as far as the glyphs clearly separate as in Latin printed documents. This allows to arrive at a broad picture of the range of issues document image processing methods have to deal with in the diversity of books found in the last six centuries. Despite of those differences a number of main problems could been identified and analysed within books of different times.

Moreover, the proposed methodology, although still suffering from some problems, has been shown to be a useful strategy to manage documents in future in order to deal with huge amounts of image data. The compression of documents depends on their quality and the chosen visual features, which are employed in order to group glyphs of the same category into the same cluster. Additionally, the extracted glyphs can be transformed into the SVG format, and thus, allows the reproduction of documents on different devices which just need to be equipped with SVG rendering software. Eventually, the assignments of symbols to the resulting clusters even enables to arrive at searchable documents.



Publications

Jan-Hendrik Worch, Björn Gottfried (2014)
Choosing shape features by means of genetic algorithms
for glyph-clustering of historical documents.

International Journal of Computer Applications.
102(3):1-6, Foundation of Computer Science.

Björn Gottfried, Lothar Meyer-Lerbs (2011)
Towards the Processing of Historic Documents.
In: Raffaela Bernadi, Sally Chambers, Björn Gottfried,
Frederique Segont, Ilya Zaihrayeu
Advanced Language Technologies for Digital Libraries.
LNCS 6699, Springer, 15-28.

Jannis Stoppe, Björn Gottfried (2011)
Skeleton Comparisons - The Junction Neighbourhood Histogram.
In: Tompa, F. et al. (Eds.):
ACM Symposium on Document Engineering (DocEng 2011).
Mountain View, California, USA, September 19-22, 2011, ACM

Lothar Meyer-Lerbs, Arne Schuldt, Björn Gottfried (2010)
Glyph Extraction from Historic Document Images.
In: A. Antonacopoulos, M. J. Gormish and R. Ingold (Eds.):
ACM Symposium on Document Engineering (DocEng 2010).
Manchester, United Kingdom, September 21-24, 2010, ACM, pp. 227-230.



Funding

German Research Foundation (DFG)
GZ: GO 2023/3-2

© September 22nd, 2014 - Björn Gottfried Universität Bremen