Die Grenzboten, 28. Jahrgang, 2. Semester 1. Band, Leipzig 1869; SuUB, Bremen.
InVenod
In the last two decades many digitisation projects have been carried out. Whole books and entire collections of libraries have been transformed into a digital format. The means for effectively analysing these documents are still
limited. Methods for processing those documents for which established character recognition methods fail are required. In particular, this concerns documents with mixed languages, special and unknown symbols, non-standard character sets,
especially rare fonts, such as those found in ancient documents.
This projects develops a system for automated analysis, preparation and processing of documents without the employment of typical optical character recognition methods. Instead, the analysis and comparison of glyphs solely relies on visual features, enabling the processing of arbitrary symbols and
fonts. As a result, printed documents can be efficiently stored, transferred as well as reformatted for different output media.
Document reproduction
In the following, some of the results are illustrated for a number of documents, ranging over half a millennium, beginning with Gutenberg's Bible in 1454, and as the most recently example, a typewriter document from the twentieth century. The following images show the different appearances of those documents, each one having its individual characteristics:
The proposed approach analyses documents solely on the basis of the visual appearance, and thus, enables the processing of documents of any languages of different times, as far as the glyphs clearly separate as in Latin printed documents. This allows to arrive at a broad picture of the range of issues document image processing methods have to deal with in the diversity of books found in the last six centuries. Despite of those differences a number of main problems could been identified and analysed within books of different times.
Moreover, the proposed methodology, although still suffering from some problems, has been shown to be a useful strategy to manage documents in future in order to deal with huge amounts of image data. The compression of documents depends on their quality and the chosen visual features, which are employed in order to group glyphs of the same category into the same cluster. Additionally, the extracted glyphs can be transformed into the SVG format, and thus, allows the reproduction of documents on different devices which just need to be equipped with SVG rendering software. Eventually, the assignments of symbols to the resulting clusters even enables to arrive at searchable documents.
Publications
Jan-Hendrik Worch, Björn Gottfried (2014)
Choosing shape features by means of genetic algorithms
for glyph-clustering of historical documents.
International Journal of Computer Applications.
102(3):1-6, Foundation of Computer Science.
Björn Gottfried, Lothar Meyer-Lerbs (2011)
Towards the Processing of Historic Documents.
In: Raffaela Bernadi, Sally Chambers, Björn Gottfried,
Frederique Segont, Ilya Zaihrayeu
Advanced Language Technologies for Digital Libraries.
LNCS 6699, Springer, 15-28.
Jannis Stoppe, Björn Gottfried (2011)
Skeleton Comparisons - The Junction Neighbourhood Histogram.
In: Tompa, F. et al. (Eds.):
ACM Symposium on Document Engineering (DocEng 2011).
Mountain View, California, USA, September 19-22, 2011, ACM
Lothar Meyer-Lerbs, Arne Schuldt, Björn Gottfried (2010)
Glyph Extraction from Historic Document Images.
In: A. Antonacopoulos, M. J. Gormish and R. Ingold (Eds.):
ACM Symposium on Document Engineering (DocEng 2010).
Manchester, United Kingdom, September 21-24, 2010, ACM, pp. 227-230.
Funding
German Research Foundation (DFG)
GZ: GO 2023/3-2