Member of the Pattern Recognition and Human Language Technology Research Center of the Universitat Politècnica de València.
ICDAR2017 Competition on Handwritten Text Recognition on the READ Dataset (ICDAR2017 HTR)
This competition proposed for ICDAR 2017 aims at introducing a usual scenario for some collections in which there exist transcripts at page level for many pages, but these transcripts are not aligned with text lines in the document images. The problem is then to align automatically these transcripts with the corresponding line images for subsequently training an HTR system. In this scenario, it is feasible to annotate accurately/manually line images with their transcripts only for a few pages.
The proposed competition is as follows:
- Training. The idea is to simulate a typical situation in which some small ground truth (GT) has been carefully prepared and other existing training material is available without GT at line level, but at page or region level. Two batches of training are supplied to the entrants:
- Train-A Dataset of pages with manually revised baselines and the corresponding transcripts associated to them. This batch is small, 50 pages.
- Train-B Dataset of pages without any layout or text line information. The corresponding transcripts are provided at page level with line breaks. It has 10k pages, though for convenience it is divided into two 5k page batches.
- Test Two tracks are defined: a traditional track with usual evaluation and an advanced track corresponding to a more realistic but more challenging scenario.
- Test-A Traditional track: a batch of page images annotated with baselines will be provided for evaluation. WER/CER will be used to measure performance and compare the systems. The information will be provided in PAGE format as in Train-A, but without transcripts. The participants have to submit PAGE files with the transcripts included in each line. A linear combination of WER and CER will be used for evaluation of this track.
- Test-B Advanced track: the pages will contain information about the geometry of regions where to detect text line and recognize. This set consists of two parts:
- Test-B1 The same batch of images as in Test-A but with PAGE files that do not contain baseline information.
- Test-B2 A completely new batch of images.
The participants have to submit PAGE files for all Test-B including information about the detected lines and the corresponding recognition. The subdivision of Test-B1 and Test-B2 is of no concern to the participants. All Test-B should be processed using exactly the same pipeline. Test-B1 is used for control, thus, it is not allowed to use the baselines from Test-A or even the number of lines in each region to aid the detection. The winner will be decided just on results on batch Test-B2. Organizers will check and publish results in both Test-B1 and Test-B2. Inconsistent results between these two subsets will result a request for explanation from the competitors and may be subject to disqualification.
Evaluation will be performed with BLEU at region level (or some variant of BLEU for taking into account errors at character level), concatenating the lines provided by the participants. The reading order of the lines affects the performance, thus the participants must take care to include the lines in the reading order detected by their system. The reading order in the PAGE files is implicit by the order of the XML TextLine elements within each TextRegion.
The dataset considered for training in this competition is the Alfred Escher Letter Collection (AEC). For test, the documents will not be from the AEC, but will be letters from the same period as the AEC.
the Test data is now available for both traditional track and advanced track.
Remember to include your mail in the followers of this competition if you want to be continously informed with news.
There is a remark regarding the data provided for this competition:
In this edition, the quality of the images (and the resolution) for some batches (is not as good as previous editions. For the preparation of this competition, we received the images that you have available and the Ground-Truth (GT) was prepared for this images taking profit of existing GT material (transcripts).
This issue may happen both with the training data and the test data. For the test data, we inform you that the images are collected from different collections and therefore the image resulution may be not the same for all test images.
Regarding the resolution of the images, low resolution images are very frequent in archives (thousands of images, according to archives involved in READ). This is because many collections were scanned some time ago and currently some of these collections are not being scanned again (document not currently available, low budgets, different priorities, ...). So, this is a real problem that many collections residing in archive needs to be addressed.
Sorry for not providing you this information in advance.
3/4/2017 The training data is now available
7/2/2017 ICDAR2017 Competition on Handwritten Text Recognition on announcement
3 April 2017: competition opens
3 April 2017: training data available
15 June 2017: registration deadline
30 June 2017: test data available
14 July 2017: deadline for submitting results on the test data