GreekOCR Toolkit: Global Functions

Last modified:

Contents

The toolkit defines a number of free function which are not image methods. These are defined in different modules and can be imported in a python script with

from gamera.toolkits.greekocr.compare import levenshtein, errorrate
from gamera.toolkits.greekocr.unicode_teubner import unicode_to_teubner

Conversion of Greek Unicode to LaTeX

unicode_to_teubner

Returns the given unicode string to a LaTeX document body using the Teubner style for representing Greek characters and accents. Signature:

unicode_to_teubner (unicode_text)

The returned LaTeX code does not contain the LaTeX header. To create a complete LaTeX document, you can use the following code:

# LaTeX header
print "\documentclass[10pt]{article}"
print "\usepackage[polutonikogreek]{babel}"
print "\usepackage[or]{teubner}"
print "\\begin{document}"
print "\selectlanguage{greek}"

# document body
print unicode_to_teubner(unicode_string)

# LaTex footer
print "\end{document}"

Measuring recognition rates with ground truth data

levenshtein

Computes the Levenshtein distance (aka edit distance) between the two Unicode strings s1 and s2. Signature:

levenshtein(s1, s2)

This implementation differs from the plugin edit_distance in the Gamera core in two points:

  • the Gamera core function is implemented in C++ and currently only works with ASCII strings
  • this implementation is written in pure Python and therefore somewhat slower, but it works with Unicode strings.

For details about the algorithm see http://en.wikibooks.org/wiki/Algorithm_implementation/Strings/Levenshtein_distance

errorrate

For the two given Unicode strings, the edit distance divided by the length of the first string is returned. Signature:

errorrate(groundtruth, ocr)