Automated Text Recognition – Extracting Data via OCR/HTR

Automated or optical text recognition (OCR) is used to automatically capture text from digital images and thus generate searchable and analyzable data. The Mannheim University Library has many years of experience in digitization and with the use of various text recognition software.

The Research Data Center is happy to support researchers at the University of Mannheim along the entire workflow from digitization to layout and text recognition as well as training specialized models and structuring of the data.

Services

Consulting on automated text recognition (OCR) for research projects
OCR Recommender
Open OCR consultation hour: every 2nd Thursday of the month, from 3 to 4 p.m., without registration (link to Zoom meeting: ocr-bw.bib.uni-mannheim.de/sprechstunde, meeting ID: 682 8185 1819, ID code: 443071).

Tool	Cost model	Properties	Particularly suitable for
ABBYY Finereader	fee-based/commercial	Text and layout recognition; good layout analysis	Modern prints, complex layout
eScriptorium	Open Source	Graphical user interface for Kraken; intuitive use	Historical prints and manuscripts, including non-Latin script
Google Vision	fee-based/commercial	Text recognition; image and video analysis; for manuscripts and prints	Prints and manuscripts
Kraken	Open Source	Command line-based text recognition software; optimised for historical and non-Latin written material	Historical prints and manuscripts, including non-Latin script
OCR4All	Open Source	Graphical user interface for various open source text recognition programmes	Historical prints and manuscripts
OCRmyPDF	Open Source	Command line programme for text recognition of PDF files; uses Tesseract as OCR engine	Historical/modern prints
OCR-D	Open Source	Modular, command line-based text recognition software	Historical prints
PERO-OCR	Open Source	Web-based text recognition platform; good universal models; currently no follow-up training possible	Historical/modern prints and manuscripts
Tesseract	Open Source	Command line-based text recognition software; suitable for large data sets	Historical/modern prints
Transkribus	fee-based/commercial	Comprehensive text recognition and transcription platform; with intuitive user interface	Historical manuscripts and tables

Instructions and materials for various OCR software
Here you will find instructions and materials on various open source text recognition programmes and transcription platforms. It is a collection of useful references, not all resources have been created by Mannheim University Library itself.
eScriptorium
All Github documentation of the Mannheim University Library on eScriptorium (german)
Local installation (Windows/Linux) (german)
Locale installation (MacOS) (english)
User manuals
German
English
Video: Introduction to eScriptorium (german)
Model transfer from Transkribus to eScriptorium (german)
OCR-D
User and installation guide
OCRmyPDF
Users and installation guide (Windows/Linux) (german)
Tesseract
All Github documentationen of the Mannheim University Library on Tesseract (german)
Users and installation guide (Linux) (german)
I Users and installation guide (Windows) (german)
Tips for creating ground truth (training data)
As part of the OCR-D project, three different transcription levels for the transcription of historical documents were defined in transcription guidelines. The levels differ in the degree of faithful reproduction. The guidelines can be found on the OCR-D project homepage. You can also find a guideline for publishing your own training data on Github.
Here you will find Ground-Truth for training or retraining your own models:
OCR & Ground-Truth-Resources
HTR United
Ground-Truth for Charlottenburger Amtsschrifttum
Ground-Truth for digital copies of the Mannheim University Library
Ground-Truth for digital copies of the Tübingen University Library
IAM Database for manuscripts
A virtual keyboard with the required special characters can also be helpful when creating ground truth. You can also find virtual keyboards for different transcription platforms on Github.

In our FAQs you will find answers to the most frequently asked questions about automated text recognition and the software used in the OCR-BW.

If the answer you are looking for is not listed, simply contact us by e-mail.

Projects and cooperations

Cooperation project on text recognition and data structuring with the Chair of Economic History (Prof. Streb)
Cooperation project on manuscript recognition with the Chair of Late Medieval and Early Modern Studies (Prof. Kümper)

If we can support you or if you have any questions, please do not hesitate to contact us.

Contact

Forschungsdatenzentrum (FDZ)

Team: Irene Schumm, Phil Kolbe, David Morgan, Thomas Schmidt, Renat Shigapov, Christos Sidiropoulos, Vasilka Stoilova, Larissa Will

University of Mannheim
Universitätsbibliothek Mannheim
Schloss Schneckenhof West
68161 Mannheim

E-mail: forschungsdatenuni-mannheim.de
Web: www.bib.uni-mannheim.de/en/teaching-and-research/research-data-center-fdz

Opening Hours

Available Seats

Information and Advice

Chat Mon–Fri 10–6

Automated Text Recognition – Extracting Data via OCR/HTR

Services

Selection of text recognition and transcription platforms

Instructions and materials for various OCR software

eScriptorium

OCR-D

OCRmyPDF

Tesseract

Tips for creating ground truth (training data)

Projects and cooperations

Contact

Forschungsdatenzentrum (FDZ)

InfoCenter

FORUM