ICFHR2018 Competition on Automated Text Recognition on a READ Dataset


Automated Text Recognition (ATR) has made huge progress within the last few years. Even for complex historical documents, character error rates (CER) below 10% have been achieved. In recent competitions (e.g. ICDAR2017 HTR), the training data was usually taken from the same document as the test data. In contrast, for practical applications, ground truth of the specific document to be transcribed is typically not available: In order to train an ATR system that will achieve reasonably low error rates, a certain amount (possibly up to some hundreds) of pages would have to be transcribed in a sufficient quality. But this is both expensive and time consuming since the human effort to create ground truth is quite high. On the other hand, many text corpora have already been transcribed and published. This raises the question to what extend such (more or less public) datasets could be used to pre-train a rather universal ATR system such that for its subsequent document specific use, the required amount of additional training data can be minimized. Moreover, the relation between the amount of available specific training data and the gainable CER is of apparent practical interest. To encourage further research in direction of robust (i.e., w.r.t. different scripts) ATR systems, this competition targets on emulating such application scenarios by providing a new, rather heterogeneous dataset containing various documents from different writers, time periods and languages. The documents were taken from the EU–Horizon2020 project READ. Transcription accuracy will be evaluated on those various documents as test sets, but for training purposes as described above, only few pages of the respective test set will be made available beforehand.


The datasets consists of heterogeneous documents (see Figure 1 for examples). Each of them was written by only one writer but in different time periods and various languages. Read more ...

Details about the origin of the documents can be found on the end of this page.

Data samples
Figure 1. Five sample line-images of the proposed dataset.

A. Datasets

Konzilsprotokolle A
Konzilsprotokolle B
Konzilsprotokolle C
  • Minutes of the Österreichische Akademie der Wissenschaften (Austrian Academy of Sciences), middle of the 19th century
  • Provided by the the Österreichische Akademie der Wissenschaften
  • http://www.oeaw.ac.at
  • German recipe book from the beginning of the 20th century (Lotte Meyer/Laika)
  • Letters from Friedrich von Schiller
  • Letters from Johann Wolfgang von Goethe

  • Writings of Charles S. Peirce
  • Provided by Indiana University – Purdue University Indianapolis
  • Letters (epistolary) from Camilla Collett
  • Provided by the National Library of Norway
  • Writings of Zacharias Topelius
  • Provided by the Society of Swedish Literature in Finland
  • http://topelius.fi
  • Handwritten notes of a (anonymous) historian of the 19th century.
  • Clean copy of Gottfried Semper, written by a secretary
  • Provided by the Institute for the History and Theory of Architecture at the Eidgenössische Technische Hochschule (ETH) Zürich

  • Translation of the Bible by the anonymous, so called Österreichischer Bibelübersetzer
  • http://manuscripta.at
St. Gallen


January 22, 2018:
competition is open and training data available

Important Dates

January 22, 2018:
competition opens

January 22, 2018:
training data available

April 15, 2018:
test data available

May 1, 2018:
deadline for submitting results on the test data

August 5-8, 2018:
Results announced at ICFHR 2018


Tobias Strauß

[University of Rostock, CITlab – Computational Intelligence Technology Lab] 

Gundram Leifert

[Computational Intelligence Technology Laboratory / CITlab, Rostock, Germany] 

READ Partner

Tobias Hodel

[Staatsarchiv des Kantons Zürich] 

Researcher in Digital Humanities and Digital History.