Investigating OCR

July 6, 2007


Since 0.5.1, i’m investigating OCR for Gnome through Gnome Scan. The most advanced software is OCRopus. OCRopus is to tesseract what HTML is to plain/text. And in fact, OCRopus output HTML :). OCRopus is currently based on the famous tesseract OCR engin, but some hocr code is in the repo, and more is to come.

Just like Gegl, i ensure that even alpha softwares i use in gnome-scan are at least packaged, either in an official repos or by my own. Tesseract is packaged, but OCRopus require tesseract SVN which does not built and has a bug (public headers includes config.h). The built failure received a patch waiting for inclusion, and i provide a patch fixing the last issue.

The main issue is that OCRopus has never publish a release. It has only SVN repo. OCRopus uses Jam without any rules to generate a distribution package (like make distcheck from automake). After some research, i decided to add automake build system to OCRopus. Tesseract has two build system : one for developer, one for distributor. Automake is still the best solution for distributor.

I provided a quite big but incomplete patch at OCRopus mailing list which received some attention. It seems that tesseract and OCRopus are very close together. Both are not very active. I hope that gnome-scan will tick those project developers and put them to users. OCRopus has only a command line tool, it should provide a library. Tesseract provide 11 libraries, it should provide only one !

So, gnome-scan OCR is kind of blocked by upstream project that’s needs some love. Both projects needs to modify a bit their design. I think i will provide a “protocropus” plugin for gnome-scan which will provide a simple bridge between the various command line OCR tools or libraries (including OCRopus), but not depending on OCRopus.