Installing additional language packs

OCRmyPDF uses Tesseract for OCR, and relies on its language packs for languages other than English.

Tesseract supports most languages.

You can often find packages that provide language packs:

# Display a list of all Tesseract language packs
apt-cache search tesseract-ocr

# Debian/Ubuntu users
apt-get install tesseract-ocr-chi-sim  # Example: Install Chinese Simplified language back

You can then pass the -l LANG argument to OCRmyPDF to give a hint as to what languages it should search for. Multiple languages can be requested using either -l eng+fre (English and French) or -l eng -l fre.

Known limitations

As of v4.2, users of ocrmypdf working languages outside the Latin alphabet should use the following syntax:

ocrmypdf -l eng+gre --output-type pdf --pdf-renderer tesseract

The reasons for this are:

  • The latest version of Ghostscript (9.19 as of this writing) has unfixed bugs in Unicode handling that generate invalid character maps, so Ghostscript cannot be used for PDF/A conversion
  • The default “hocr” PDF renderer does not handle Asian fonts properly