Choosing an OCR system

August 10, 2007

Hi everybody,

According to comments on my last post about the state of OCR in Gnome, I fill the need to clarify the situation about supporting or not proprietary software.

Manifest: Gnome Scan is part of Gnome and thus, part of GNU. Yes, Gnome Scan has GNU in its name, and that’s not for fashion. Gnome Scan goal is to provide a libre scan infrastructure for the GNU OS on top of Gnome (rocking) technologies Gtk+, GEGL, etc. Gnome Scan also uses non GNU free software such as SANE for accessing scanners and yet OCRopus for OCR.

Someone would say : « why choosing OCRopus ? OCR-Shop or IRIS Toolkit or rocks ! »

Yet free OCR engine are years behing proprietary software ; right. However, using proprietary solution won’t help them. Paying for a SDK for adding value to proprietary software without even receiving incoming is just crazy ! It’s up to their respective company to provide support for their software. Please don’t complain that i don’t use your proprietary software. I really accept the fact that Gnome OCR must make room for every OCR engine, just because no one is perfect (especially libre ones).

Comment on supporting different OCR engine is rightful. Taking this feedback in account, I plan to build an API for Gnome OCR just like GtkPrint do for printing, and Gnome Scan for scanning; i.e. in a modular fashion. This change from my preliminary plan to provide this library in OCRopus itself. However, i’m pretty sure i will only support OCRopus. Just like SANE in Gnome Scan up to 0.4, Gnome Scan use OCRopus and only OCRopus (i.e. hardcoded) for OCR. Even worst, AbiScan itself uses directly OCRopus. That’s experimental solution, comments are welcome.

Asking for a libre OCR API is very important. That’s one value of Gnome Scan. OCRopus and libre OCR engines needs love. Don’t refuse them what they need 😉

I wish everyone understand my point of view and Gnome Scan goals, without fearing commenting. Feedback makes me happy :).

Regards,
Étienne.

State of OCR in Gnome

August 7, 2007

Hi everybody,

My work on flegita-gimp does not mean i forgot OCR which is, IMHO, the first class feature of scanning. Writing AbiScan clears my vision on how to design Gnome OCR UI. Before writing AbiScan, i was wondering how to integrate OCR in Gnome Scan. I was really worried because Gnome Scan is designed to pass image (as GeglBuffer) to application, not text or HTML or wathever OCR output format. I decided to write AbiScan and use ocropus directly instead of through Gnome Scan.

This lead me to find the way Gnome will receive OCR and OCR UI. AbiScan use ocropus command line tool, the idea is to use a library providing common OCR UI instead. This library should be ship by OCRopus. Why OCRopus and not Gnome Scan ? Because i think this library depends more on OCRopus and not on Gnome Scan. I may provide an OCR sink in Gnome Scan which help pluging Gnome Scan and OCRopus, but that’s not all the UI and OCR interaction part which should heavily rely on OCRopus itself, just like OCRopus command line tool.

Publishing AbiScan seems to have revealed questions from users. At the risk of repeating OCRopus website, let me explain a bit of OCRopus. OCRopus is not an OCR engine. OCRopus is a document analysis and OCR system. Instead of rewriting its own OCR engine, it uses existing one, especially tesseract, but more are to come. The difference between OCRopus and an OCR engine is exactly the same as between HTML and plain/text. HTML contains semantic, formatting and test itself while plain/text contains only … text ! So, if ever you read a comparision between OCRopus and e.g. gocr or ocrad, you can laught at it. Well, in fact, ocrad has a minimal layout analyser for text column, but that’s not as advanced as OCRopus layout analyser.

Regards,
Étienne.

AbiScan Preview

August 6, 2007

Hi all,

Resulting in about one week of lazy effort, i reach to produce a preliminary version of AbiScan on top of OCRopus. I produced a screencast video of direct OCR import into Abiword Frame. This is very buggy, but very exciting too :).

I must thanks #abiword people, especially Dominic Lachowicz, Marc Mauer, Martin Sevior, jean, sum1 and Hubert Figuière. Thanks goes to OCRopus and Gegl people for their work and advices.

I provide AbiScan patch against abiword-plugins SVN. The plugins does not work if abiword use G_MODULE_BIND_LAZY flags, this is a bug in abiscan, not abiword. I provide a patch against abiword SVN removing g_module_open flags, but it will hopefully never be merged.

If you want to try it, follow the following steps :

  1. Install tesseract-ocr from SVN, with the patch i provide in tesseract BTS ;
  2. Install ocropus ;
  3. Install Gegl SVN ;
  4. Install Gnome Scan SVN ;
  5. Install abiword SVN with g-module-open-flags.diff patch ;
  6. Install abiword-plugins SVN with abiscan.diff patch ;
  7. Launch Abiword
  8. Launch Insert > Import from scanner and follow the steps.

Warning : that’s really buggy.

  • Gnome Scan does not handle device list very well if you launch several times the dialog.
  • OCRopus does not provide any API, so the plugin use system() and isn’t able to monitor progress. OCRopus might take very long time.
  • Sometimes, it eats tons of memory.
  • Currently, it lose formating, that’s due to a HTML import pasteFromBuffer() bug. I had to make a choice between paste into existing document losing formating, or open directly tmp OCRopus HTML directly.

Bug reports are very welcome, please file bugs to gnome-scan product in Gnome bugzilla, for the abiscan component. Note that OCRopus prefer 150dpi images.

Anyway, that’s a rought draft with the key feature provided by Gnome Scan and OCRopus : tight integration into application and advanced OCR.


becomes

Regards,
Étienne


E Ultreïa !

Back from scouting

July 30, 2007

Hi all Gnome lovers,

I’m back from 17 days of scouting in nature. This was great. I published some photos of the camp at Faye in Nièvre. I came back last friday and was exhausted.

I didn’t resumed yet Gnome Scan development. I’ll take the time to think the future of Gnome Scan, espcially OCR. Sadely, there were not that much work on OCRopus during the past 3 weeks. I wonder how to pass data to the application. OCRopus output is in HTML with OCR tags. That’s useful but not very clean. I wonder how to integrate that in AbiWord.

So, my plan for the end of the summer is to implement OCR, rotation, Gimp and Abiword plugin.

Investigating OCR

July 6, 2007

Hi,

Since 0.5.1, i’m investigating OCR for Gnome through Gnome Scan. The most advanced software is OCRopus. OCRopus is to tesseract what HTML is to plain/text. And in fact, OCRopus output HTML :). OCRopus is currently based on the famous tesseract OCR engin, but some hocr code is in the repo, and more is to come.

Just like Gegl, i ensure that even alpha softwares i use in gnome-scan are at least packaged, either in an official repos or by my own. Tesseract is packaged, but OCRopus require tesseract SVN which does not built and has a bug (public headers includes config.h). The built failure received a patch waiting for inclusion, and i provide a patch fixing the last issue.

The main issue is that OCRopus has never publish a release. It has only SVN repo. OCRopus uses Jam without any rules to generate a distribution package (like make distcheck from automake). After some research, i decided to add automake build system to OCRopus. Tesseract has two build system : one for developer, one for distributor. Automake is still the best solution for distributor.

I provided a quite big but incomplete patch at OCRopus mailing list which received some attention. It seems that tesseract and OCRopus are very close together. Both are not very active. I hope that gnome-scan will tick those project developers and put them to users. OCRopus has only a command line tool, it should provide a library. Tesseract provide 11 libraries, it should provide only one !

So, gnome-scan OCR is kind of blocked by upstream project that’s needs some love. Both projects needs to modify a bit their design. I think i will provide a “protocropus” plugin for gnome-scan which will provide a simple bridge between the various command line OCR tools or libraries (including OCRopus), but not depending on OCRopus.

Étienne.