ICFHR2018 Competition on Automated Text Recognition on a READ Dataset


Training and test data set

Note: The data may only be used for the competition.

The data is given as line images, an information file (containing a surrounding polygon) and the ground truth. Please do not use other data to improve the results.

In order to provide a common character set over such different documents, the standard Unicode Normalization Form Compatibility Decomposition – NFKD (see http://www.unicode.org/reports/tr15) is applied to the ground truth. This normalization decomposes Unicode characters wherever possible, e.g.: ü decomposes to u and  ̉ ̈, ſ normalizes to s, æ remains æ.

Furthermore, we omit any nonspacing mark characters (combining diaresis U+0308, combining ring above U+030A, combining cedilla U+0327 . . . ). Thus, we finally restrict the alphabet to 102 characters (instead of 192 characters that appear in the original ground truth data). In the above example, the diaresis  ̉ will be omitted since it is a nonspacing mark. The other characters resulting from the decomposition of ü, ſ and æ are left untreated.

The training data is divided in a general set (from 16 documents) and a set of pages (from 5 documents) of equal scripts as in the test set. The training set comprises roughly 25 pages per document (the precise number varies such that the number of contained characters is almost equal per document). The test set contains 5 documents with 15 pages each.

Task

The competition is simply offered in a single track. As already mentioned above, there is document-specific training data out of any document from the test set. In order to analyze the impact of document-specific training data, participants shall submit 4 transcriptions per test set basing on 0, 1, 4 or 16 additional (specific) training pages, respectively. Here, 0 pages correspond to a baseline system without additional training.

Info file

Besides the txt-file containing the ground truth text and the jpg-image file, there is an info-file containing some additional information:

  • imageName: name of the original image
  • text region id and line id
  • readingOrder: the index of the line according to the reading order of the specific text region (not always reliable)
  • shift between (0,0) in the original image and (0,0) of the line image
  • baseline of the text
  • Mask: a surrounding polygon where its closure contains the written text

As usually, (0,0) is in the upper left corner. custom and trafo are some kind of headline. conf can be ignored (if it appears). The most important information is clearly the mask. As already mentioned, the mask was automatically generated and may not be totally accurate.

Result file format

The result file is a tar file archive consisting of 20 text files, one file per document and number of document-specific training pages. The file names must be composed of the document id (which is the folder name consisting of 5 digits) and the number of additional training pages: For Konzilsprotkolle B (id 30865) with 16 pages from the training document Konzilsprotokolle B this will be 30865_16.txt.

The text file must contain the base file name of the line image and the transcribed text separated by a SPACE as shown below

Example result file
30887_0220_1068678_region_1495467404389_1312_r1001 1828 Oct. 8
30887_0220_1068678_region_1495467404389_1312_r1002 Blackstone or Civil Code
30887_0220_1068678_region_1495467409019_1313_line_1495467702705_1315 Subject matters
30887_0220_1068678_region_1495467409019_1313_line_1495467704409_1316 III. Services

The participants may test the correctness of their file format by uploading a tar-file containing the text files 30865_0.txt, 30865_1.txt, 30865_4.txt and 30865_16.txt at the testtrack.

The tar-file should not contain directories, only the files result files which have to be UTF-8 encoded.

Evaluation procedure

The quality of the results are measured in terms of the well-known CER. To identify the winner of the competition, we calculate the common CER over all test sets with all four transcriptions. However, for possibly drawing further conclusions about underlying questions, we might possibly use the available CERs in more advanced assessments.

Subtrack 1.2 - testtrack

  • Training list Konzilsprotokolle B (you have to login and follow the competition first)
  • Test list Konzilsprotokolle B (you have to login and follow the competition first)
  • Submit a new method for evaluation (you have to login and follow the competition first)
  • View results for all available methods

News

May 16, 2018:
The competition remains open beyond the ICFHR deadline. Feel free to submit new methods.

April 15, 2018:
test data available

January 22, 2018:
competition is open and training data available

Important Dates

January 22, 2018:
competition opens

January 22, 2018:
training data available

April 15, 2018:
test data available

May 1, 2018:
deadline for submitting results on the test data

May 16, 2018:
provide a brief system description

August 5-8, 2018:
Results announced at ICFHR 2018





Organizers







Tobias Strauß

[University of Rostock, CITlab – Computational Intelligence Technology Lab] 

Gundram Leifert

[Computational Intelligence Technology Laboratory / CITlab, Rostock, Germany] 

READ Partner

Tobias Hodel

[Staatsarchiv des Kantons Zürich] 

Researcher in Digital Humanities and Digital History.