ICFHR2018 Competition on Automated Text Recognition on a READ Dataset


Motivation

Automated Text Recognition (ATR) has made huge progress within the last few years. Even for complex historical documents, character error rates (CER) below 10% can be achieved. In according competitions (e.g. the recent ICDAR2017 HTR), the training data usually was taken from the same document as the test data.

In contrast, for practical applications, typically there is no ground truth available for the specific document to be transcribed. Consequently, in order to train an ATR system that then produces reasonably low error rates, a certain amount (possibly up to some hundreds) of pages would have to be transcribed in good quality. But due to the essential human effort for creating ground truth, this is both expensive and time consuming.

On the other hand, many text corpora have already been transcribed and published. This raises the question to what extend such (more or less public) datasets could be used to pre-train a rather universal ATR system such that one can minimize the amount of additional training data, which is necessary for proper subsequent document specific applicability. Moreover, the dependency between the amount of available specific training data and the CER gain is of apparent practical interest.

In order to encourage further research towards robust ATR systems, i.e. which can properly deal with distinct scripts, this competition targets on emulating such application scenarios by providing a new, rather heterogeneous dataset containing various documents from different writers, time periods and languages. The documents were taken from the advanced transcription platform Transkribus that currently is under further development in the EU Horizon-2020 project READ.

While the transcription accuracy of submissions will be evaluated on those various documents as entire test sets, for document specific adjustment training purposes as described above, only few pages of the respective set will be made available beforehand.

Data

Every dataset consists of heterogeneous documents (see below: Figure 1 for examples and appendix A for further details), each written by only one writer but in different time periods and various languages. »  Read more …

Data samples
Figure 1. Five sample line-images.

A. Dataset Reference

Konzilsprotokolle A
Konzilsprotokolle B
Konzilsprotokolle C
Barlach
OEAW
  • Minutes of the Österreichische Akademie der Wissenschaften (Austrian Academy of Sciences), middle of the 19th century
  • Provided by the the Österreichische Akademie der Wissenschaften
  • http://www.oeaw.ac.at
Kochbuch
  • German recipe book from the beginning of the 20th century (Lotte Meyer/Laika)
Schiller
  • Letters from Friedrich von Schiller
Goethe
  • Letters from Johann Wolfgang von Goethe
Peirce
  • Writings of Charles S. Peirce
  • Provided by Indiana University – Purdue University Indianapolis
Ibsen
  • Letters (epistolary) from Camilla Collett
  • Provided by the National Library of Norway
Bentham
McGahern
Christensson
Ricordi
Topelius
  • Writings of Zacharias Topelius
  • Provided by the Society of Swedish Literature in Finland
  • http://topelius.fi
Munch
Janauschek
  • Handwritten notes of a (anonymous) historian of the 19th century.
Semper
  • Clean copy of Gottfried Semper, written by a secretary
  • Provided by the Institute for the History and Theory of Architecture at the Eidgenössische Technische Hochschule (ETH) Zürich
Patzig
Bibeluebersetzer
  • Translation of the Bible by the anonymous, so called Österreichischer Bibelübersetzer
  • http://manuscripta.at
St. Gallen
Schwerin

News

May 16, 2018:
The competition remains open beyond the ICFHR deadline. Feel free to submit new methods.

April 15, 2018:
test data available

January 22, 2018:
competition is open and training data available

Important Dates

January 22, 2018:
competition opens

January 22, 2018:
training data available

April 15, 2018:
test data available

May 1, 2018:
deadline for submitting results on the test data

May 16, 2018:
provide a brief system description

August 5-8, 2018:
Results announced at ICFHR 2018





Organizers







Tobias Strauß

[University of Rostock, CITlab – Computational Intelligence Technology Lab] 

Gundram Leifert

[Computational Intelligence Technology Laboratory / CITlab, Rostock, Germany] 

READ Partner

Tobias Hodel

[Staatsarchiv des Kantons Zürich] 

Researcher in Digital Humanities and Digital History.

Roger Labahn

[University of Rostock, CITlab – Computational Intelligence Technology Lab]