Automated or optical text recognition (OCR) is used to automatically capture text from digital images and thus generate searchable and analyzable data. The Mannheim University Library has many years of experience in digitization and with the use of various text recognition software.
The Research Data Center is happy to support researchers at the University of Mannheim along the entire workflow from digitization to layout and text recognition as well as training specialized models and structuring of the data.
Tool | Cost model | Properties | Particularly suitable for |
fee-based/commercial | Text and layout recognition; good layout analysis | Modern prints, complex layout | |
Open Source | Graphical user interface for Kraken; intuitive use | Historical prints and manuscripts, including non-Latin script | |
fee-based/commercial | Text recognition; image and video analysis; for manuscripts and prints | Prints and manuscripts | |
Open Source | Command line-based text recognition software; optimised for historical and non-Latin written material | Historical prints and manuscripts, including non-Latin script | |
Open Source | Graphical user interface for various open source text recognition programmes | Historical prints and manuscripts | |
Open Source | Command line programme for text recognition of PDF files; uses Tesseract as OCR engine | Historical/ | |
Open Source | Modular, command line-based text recognition software | Historical prints | |
Open Source | Web-based text recognition platform; good universal models; currently no follow-up training possible | Historical/ | |
Open Source | Command line-based text recognition software; suitable for large data sets | Historical/ | |
fee-based/commercial | Comprehensive text recognition and transcription platform; with intuitive user interface | Historical manuscripts and tables |
Here you will find instructions and materials on various open source text recognition programmes and transcription platforms. It is a collection of useful references, not all resources have been created by Mannheim University Library itself.
As part of the OCR-D project, three different transcription levels for the transcription of historical documents were defined in transcription guidelines. The levels differ in the degree of faithful reproduction. The guidelines can be found on the OCR-D project homepage. You can also find a guideline for publishing your own training data on Github.
Here you will find Ground-Truth for training or retraining your own models:
A virtual keyboard with the required special characters can also be helpful when creating ground truth. You can also find virtual keyboards for different transcription platforms on Github.
If we can support you or if you have any questions, please do not hesitate to contact us.