Automated Text Recognition – Extracting Data via OCR/ HTR
Automated or optical text recognition (OCR) is used to automatically capture text from digital images and thus generate searchable and analyzable data. The Mannheim University Library has many years of experience in digitization and with the use of various text recognition software.
The Research Data Center is happy to support researchers at the University of Mannheim along the entire workflow from digitization to layout and text recognition as well as training specialized models and structuring of the data.
Services
- Consulting on automated text recognition (OCR) for research projects
- OCR Recommender
- Open OCR consultation hour: every 2nd Thursday of the month, from 3 to 4 p.m., without registration (link to Zoom meeting: ocr-bw.bib.uni-mannheim.de/sprechstunde, meeting ID: 682 8185 1819, ID code: 443071).
Selection of text recognition and transcription platforms
Tool
Cost model
Properties Particularly suitable for fee-based/commercial Text and layout recognition; good layout analysis Modern prints, complex layout Open Source
Graphical user interface for Kraken; intuitive use Historical prints and manuscripts, including non-Latin script fee-based/commercial Text recognition; image and video analysis; for manuscripts and prints Prints and manuscripts Open Source
Command line-based text recognition software; optimised for historical and non-Latin written material Historical prints and manuscripts, including non-Latin script Open Source
Graphical user interface for various open source text recognition programmes Historical prints and manuscripts Open Source
Command line programme for text recognition of PDF files; uses Tesseract as OCR engine Historical/ modern prints Open Source
Modular, command line-based text recognition software Historical prints Open Source
Web-based text recognition platform; good universal models; currently no follow-up training possible Historical/ modern prints and manuscripts Open Source
Command line-based text recognition software; suitable for large data sets Historical/ modern prints fee-based/commercial Comprehensive text recognition and transcription platform; with intuitive user interface Historical manuscripts and tables Instructions and materials for various OCR software
Here you will find instructions and materials on various open source text recognition programmes and transcription platforms. It is a collection of useful references, not all resources have been created by Mannheim University Library itself.
eScriptorium
- All Github documentation of the Mannheim University Library on eScriptorium (german)
- Local installation (Windows/Linux) (german)
- Locale installation (MacOS) (english)
- User manuals
- Video: Introduction to eScriptorium (german)
- Model transfer from Transkribus to eScriptorium (german)
Tips for creating ground truth (training data)
As part of the OCR-D project, three different transcription levels for the transcription of historical documents were defined in transcription guidelines. The levels differ in the degree of faithful reproduction. The guidelines can be found on the OCR-D project homepage. You can also find a guideline for publishing your own training data on Github.
Here you will find Ground-Truth for training or retraining your own models:
- OCR & Ground-Truth-Resources
- HTR United
- Ground-Truth for Charlottenburger Amtsschrifttum
- Ground-Truth for digital copies of the Mannheim University Library
- Ground-Truth for digital copies of the Tübingen University Library
- IAM Database for manuscripts
A virtual keyboard with the required special characters can also be helpful when creating ground truth. You can also find virtual keyboards for different transcription platforms on Github.
Projects and cooperations
- Cooperation project on text recognition and data structuring with the Chair of Economic History (Prof. Streb)
- Cooperation project on manuscript recognition with the Chair of Late Medieval and Early Modern Studies (Prof. Kümper)
If we can support you or if you have any questions, please do not hesitate to contact us.