Investigating OCR

July 6, 2007


Since 0.5.1, i’m investigating OCR for Gnome through Gnome Scan. The most advanced software is OCRopus. OCRopus is to tesseract what HTML is to plain/text. And in fact, OCRopus output HTML :). OCRopus is currently based on the famous tesseract OCR engin, but some hocr code is in the repo, and more is to come.

Just like Gegl, i ensure that even alpha softwares i use in gnome-scan are at least packaged, either in an official repos or by my own. Tesseract is packaged, but OCRopus require tesseract SVN which does not built and has a bug (public headers includes config.h). The built failure received a patch waiting for inclusion, and i provide a patch fixing the last issue.

The main issue is that OCRopus has never publish a release. It has only SVN repo. OCRopus uses Jam without any rules to generate a distribution package (like make distcheck from automake). After some research, i decided to add automake build system to OCRopus. Tesseract has two build system : one for developer, one for distributor. Automake is still the best solution for distributor.

I provided a quite big but incomplete patch at OCRopus mailing list which received some attention. It seems that tesseract and OCRopus are very close together. Both are not very active. I hope that gnome-scan will tick those project developers and put them to users. OCRopus has only a command line tool, it should provide a library. Tesseract provide 11 libraries, it should provide only one !

So, gnome-scan OCR is kind of blocked by upstream project that’s needs some love. Both projects needs to modify a bit their design. I think i will provide a “protocropus” plugin for gnome-scan which will provide a simple bridge between the various command line OCR tools or libraries (including OCRopus), but not depending on OCRopus.


2 Responses to “Investigating OCR”

  1. Anonymous Says:

    Many projects is moving away from automake for something usable.

    Dont do the reverse. Some example: scribus, inkscape (on windows at least, but there are continuous discussion to moving away from automake, because nobody really familiar withit, and requires some magic.

    Automake is like cvs, definietly need a replace like cvs to svn, git, etc.

    What about cmake?

  2. Bronzecat Says:

    ctky searching for an OCR software for Linux and more precisely gnome, finding such professional software is often pure luck!
    Please keep working on it, it’s a great piece of professional software that, I am totally certain, many people need.

Comments are closed.