ICFHR2016 Competition on Handwritten Text Recognition on the READ Dataset


The "ICFHR2016 Competition on Handwritten Text Recognition on the READ Dataset" competition was organized in the framework of the ICFHR 2016 competitions by the Pattern Recognition and Human Language Technologies research centre with the collaboration of the READ partners. This competition is ongoing in the scriptnet platform after the results were announced at ICFHR. This contest aims to bring together researchers working on off-line Handwritten Text Recognition (HTR) and provide them with a suitable benchmark to compare their techniques on the task of transcribing typical historical handwritten documents. Previous editions of this contest were organized at the ICFHR 2014 (Sánchez, 2014) and at the ICDAR 2015 (Sánchez, 2015).

The proposed dataset consists of a subset of documents from the Ratsprotokolle collection composed of minutes of the council meetings held from 1470 to 1805 (about 30.000 pages), which will be used in the READ project. This dataset is written in Early Modern German. The number of writers is unknown. Handwriting in this collection is complex enough to challenge the HTR software.

The dataset for this competition is composed of 450 pages; most of the pages consist of a single block with many difficulties for line detection and extraction. The dataset is divided into 2 batches for the competition: 1 batch for training and 1 batch for testing.

The first batch is composed of 400 pages. The ground-truth in this set is in PAGE format (Pletschacher, 2010) and it is provided annotated at line level in the PAGE files.

The second batch is a test set of 50 pages that will be kept hidden and released in due time, in order to obtain the results to be evaluated and compared.

Description and goals

The systems entering this contest should try to obtain the most accurate recognition results in the test partition.

The available data for the first batch will consist of:

-The original images of all the training pages
-The PAGE file corresponding to each page image. For each text line in this image, the PAGE file contains a baseline and an automatically obtained bounding polygon (Romero, 2015), and the corresponding diplomatic transcript. All baselines have been checked and corrected manually.

The test images, with the transcript fields empty, will be eventually provided in the same format as first batch for evaluation purposes.

Several submissions per participant will be allowed and all the results will be considered when presenting the competition results. Regarding the tokenization, the transcripts in each submission have to be as similar as possible to the training data. In each submission, the participant must provide a brief description of the main characteristics of the submitted system. The final goal is to analyze the different proposals of the participants.

Evaluation modalities

The evaluation will be performed on the transcription results provided by each recognition system. The evaluation metric is a linear combination of the Word Error Rate (WER) and the Character Error Rate (CER) (50% each) between the reference transcript and the transcript provided by the system from each line. The winner is the system which obtains the least value of the linear combination on the test set.

Two tracks are planned in this competition:

-Restricted track: in this track the participants can use only the data provided by the organizers for training and tuning their systems
-Unrestricted track: in this track the participants can use any data of their choice

News

The competition is open in scriptnet

Important Dates

1 March 2016 Competition opens, start of registration period, training data available, baseline system available.

31 May 2016 Registration deadline (no more participants will be admitted after this date).

12 June 2016 Test data available.

24 June 2016 Deadline for systems results.

26 June 2016 Deadline for sending short description of the submitted systems.

Oct 23-26, 2016 Winners and final ranking of all teams will be made public at the ICFHR 2016 conference.

Dec, 2016 The competition is open in the scriptnet platform





Organizers







Verónica Romero

[Universitat Politècnica de València] 

Member of the Pattern Recognition and Human Language Technology Research Center of the Universitat Politècnica de València.

Enrique Vidal

[Universitat Politècnica de València] 

PhD in Physics from the Universitat de València (Spain), 1985. Full professor of Computer Science in the Universitat Politècnica de València. Member of the IEEE and a fellow of the IAPR.

Joan Andreu Sanchez

[Universitat Politècnica de València] 

Joan Andreu Sanchez is professor at Universitat Politècnica de Valencia and researcher in the Pattern Recognition and Human Language Technologies (PRHLT) research center.