Jump to main navigation Skip to Content

DFG Logo: back to Homepage Deutsche Forschungsgemeinschaft

Information for Researchers No. 25 | 28 May 2014
Coordinated Funding Initiative for the Further Development of Optical Character Recognition Processes

Optical character recognition (OCR) processes allow machine-readable full text to be automatically generated from digitised images. The use of digital full texts has become indispensable in many academic disciplines, particularly in humanities research. The Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) is funding the digitisation of historically important texts, for example printed works in German dating from the 16th, 17th and 18th centuries (VD 16, VD 17, VD 18). The aim of the funding initiative to improve OCR technology is to establish standardisable processes and support the creation of reference corpora in order to optimise the production of full texts from these digitised images.

Current infrastructure challenges in automatic text recognition have less to do with the technical improvement of individual OCR engines and more to do with a lack of relevant training material for these engines (reference corpora and lexical resources), non-standardised workflows for full text generation, a lack of interoperability of processes and formats, and inadequate verifiability of the accuracy rates of OCR results. Reference corpora, as well as tools and processes, must be freely accessible, reusable and transparent in order to achieve long-term improvements in the academic usability of full texts.

The purpose of this two-stage announcement is to improve, and where necessary standardise, techniques for full text generation. The objective in the first phase is to create a coordination structure. Proposals for the coordination project should describe the thematic modules and task areas of a coordinated process. Project proposals may then be submitted on the basis of this process in an open announcement for phase two, the implementation of the individual modules. The coordination project proposal should also include a strategy for the interaction of the modules.

The DFG is inviting institutions with experience in this area to submit proposals for the coordination project.

The coordination proposal must include the following:

  • Conceptual design of individual thematic and task areas (modules)
  • Explanation of the objectives for the individual thematic and task areas (in particular, whether the intention is to use standard solutions or competing approaches to solve a problem)
  • Specification of the individual modules in terms of the constructive participation of third parties (usually commercial service providers)
  • Mechanisms for coordinating the different modules and projects
  • Composition of the coordinating body and if appropriate the advisory board, with due consideration of academic expertise
  • Allocation of tasks within the coordination consortium
  • Timetable for the sequence of modules

The conceptual design for the individual thematic and task areas should take into account the fact that the central materials include both digitised images from the VDs and printed works from the 19th century. The following fields and problems should be addressed, drawing on relevant experience and results from the national and international context:

  • Establishment and expansion of genre-, epoch-, language- and if appropriate type-specific corpora and lexical resources
  • Further development of open source OCR engines
  • Improvement of post-correction applications
  • Establishment of practical workflows for crowdsourcing, i.e. the integration of (academic) users, particularly in post-correction, enrichment and finishing of full texts
  • Standardisation of workflows, if necessary using special use cases; addressing gaps in the workflow and design of reusable processes
  • Further development of processes for text, image and structure recognition
  • Enabling the interoperability of data formats for the purposes of importing, exporting and storage
  • Processes for the persistent identification and long-term archiving of full texts
  • Methods for standardised, verifiable version control
  • Creation of transparency through calculation of accuracy rates/error rates; if appropriate, suggestions for adapting the DFG Practical Guidelines in this area
  • Further development or adaptation of visualisation tools such as the DFG Viewer

Tasks of the coordination consortium during the announcement phase for the modules (phase two) and the project phase of funded projects:

  • Overall supervision of the initiative
  • Advising the applicants on the individual module projects
  • Organising meetings and workshops for coordination within the initiative
  • Representing the initiative to external parties and DFG decision-making bodies
  • Collating and documenting results and producing recommendations for DFG decision-making bodies

All persons and institutions eligible to submit proposals within LIS funding programmes (Scientific Library Services and Information Systems) are eligible to submit proposals for the coordination project. It would be recommendable for a proposal to be submitted by a consortium of relevant information infrastructure institutions working in close partnership. It is advisable to keep the number of participants in the consortium not too large. There should also be appropriate participation by representatives of the research community in the coordination body. The institutions involved in the successful coordination proposal are also eligible to submit a proposal for the individual modules. Initial funding for the coordination project can be approved for up to three years. An extension is possible.

Proposals are subject to the terms described in guideline 12.13, "Scientific Information Management Tools and Methods". Proposals should be written in accordance with Guidelines 12.01, "Proposal Preparation Instructions – Project Proposals in the Area of Scientific Library Services and Information Systems".

Notices of intent announcing the intended submission of a proposal should be submitted by 1 September 2014. Proposals for the coordination project should be submitted by 1 November 2014.

Further Information

On 12-13 March 2014 a workshop on “Techniques for Improving OCR Results” was held in Bonn. This announcement is based on the results of this workshop:

Guidelines 12.13, "Scientific Information Management Tools and Methods", and Guidelines 12.01, "Proposal Preparation Instructions – Project Proposals in the Area of Scientific Library Services and Information Systems" can be found at:

DFG programme contacts: