ocr – muellis blog

I had the pleasure to use a “Bookeye” book scanner. It’s a huge device which helps scanning things like books or folders. It’s very quick and very easy to use. I got a huge PDF out of my good 100 pages that I’ve scanned.

Unfortunately the light was very bright and so the scanner scanned “through” the open pages revealing the back sides of the pages. That’s not very cool and I couldn’t really dim the light or put a sheet between the Pages.
Also, it doesn’t do OCR but my main point of digitalising this book was to actually have it searchable and copy&pastable.

There seem to be multiple options to do OCR on images:

tesseract

covered already

ocropus

Apparently this is supposed to be tesseract on steroids as it can recognise text on paper and different layouts and everything.
Since it’s a bit painful to compile, I’d love to share my experiences hoping that it will become useful to somebody.

During compilation of ocropus, you might run into issues like this or that, so be prepared to patch the code.

cd /tmp/ svn checkout http://iulib.googlecode.com/svn/trunk/ iulib cd iulib/ ./configure --prefix=/tmp/libiu-install make && make install

cd /tmp/ wget http://www.leptonica.com/source/leptonlib-1.67.tar.gz -O- | tar xvzf - cd leptonlib*/ ./configure --prefix=/tmp/leptonica make && make install

cd /tmp/ svn checkout http://ocropus.googlecode.com/svn/trunk/ ocropus # This is due to this bug: http://code.google.com/p/ocropus/issues/detail?id=283 cat > ~/bin/leptheaders < #!/bin/sh echo /tmp/leptonica/include/leptonica/ EOF chmod a+x ~/bin/leptheaders ./configure --prefix=/tmp/ocropus-install --with-iulib=/tmp/libiu-install/ make && make install

muelli@bigbox /tmp $ LD_LIBRARY_PATH=/tmp/ocropus-install/lib/:/tmp/leptonica/lib/ ./ocropus-install/bin/ocroscript --help
usage: ./ocropus-install/bin/ocroscript [options] [script [args]].
Available options are:
  -e stat  execute string 'stat'
  -l name  require library 'name'
  -i       enter interactive mode after executing 'script'
  -v       show version information
  --       stop handling options
  -        execute stdin and stop handling options
muelli@bigbox /tmp $

However, I can’t do anything because I can’t make LUA load the scripts from the share/ directory of the prefix. Too sad. It looked very promising.

Cuneiform

This is an interesting thing. It’s a BSD licensed russian OCR software that was once one the leading tools to do OCR.
Interestingly, it’s the most straight forward thing to install, compared to the other things listed here.
bzr branch lp:cuneiform-linux cd cuneiform-linux/ mkdir build cd build/ cmake .. -DCMAKE_INSTALL_PREFIX=/tmp/cuneiform make make install

This is supposed to produce some sort of HTML which we can glue to a PDF with the following tool.

hocr2pdf

Apparently takes “HTML annotated OCR data” and bundles that, together with the image, to a PDF.

cd /tmp/ svn co http://svn.exactcode.de/exact-image/trunk ei cd ei/ ./configure --prefix=/tmp/exactimage make && make install

That, however, failed for me like this:

  LINK EXEC objdir/frontends/optimize2bw
/usr/bin/ld: objdir/codecs/lib.a: undefined reference to symbol 'QuantizeBuffer'
/usr/bin/ld: note: 'QuantizeBuffer' is defined in DSO /usr/lib64/libgif.so.4 so try adding it to the linker command line
/usr/lib64/libgif.so.4: could not read symbols: Invalid operation
collect2: ld returned 1 exit status
make: *** [objdir/frontends/optimize2bw] Error 1

Adding “LDFLAGS += -lgif” to the Makefile fixes that. I couldn’t find a bug tracker, hence I reported this issue via email but haven’t heard back yet.

Although the hOCR format seems to be the only option to actually know where in the file the text appears, no OCR program, except cuneiform and tesseract with a patch, seems to support it 🙁

gscan2pdf

as a full suite it can import pictures or PDFs and use a OCR program mentioned above (tesseract or gocr). The whole thing can then be saved as a PDF again.
Results with gocr are not so good. I can’t really copy and paste stuff. Searching does kinda work though.

Using Tesseract, however, doesn’t work quite well:

 Tesseract Open Source OCR Engine
tesseract: unicharset.cpp:76: const UNICHAR_ID UNICHARSET::unichar_to_id(const char*, int) const: Assertion `ids.contains(unichar_repr, length)' failed.
sh: line 1:  6187 Aborted                 (core dumped) tesseract /tmp/4jZN0oNbB1/dLNBLkcjph.tif /tmp/4jZN0oNbB1/_4BdZMfGXJ -l fra
*** unhandled exception in callback:
***   Error: cannot open /tmp/4jZN0oNbB1/_4BdZMfGXJ.txt
***  ignoring at /usr/bin/gscan2pdf line 12513.
Tesseract Open Source OCR Engine
tesseract: unicharset.cpp:76: const UNICHAR_ID UNICHARSET::unichar_to_id(const char*, int) const: Assertion `ids.contains(unichar_repr, length)' failed.
sh: line 1:  6193 Aborted                 (core dumped) tesseract /tmp/4jZN0oNbB1/ELMbnDkaEI.tif /tmp/4jZN0oNbB1/C47fuqxX3S -l fra
*** unhandled exception in callback:
***   Error: cannot open /tmp/4jZN0oNbB1/C47fuqxX3S.txt
***  ignoring at /usr/bin/gscan2pdf line 12513.

It doesn’t seems to be able to work with cuneiform 🙁

Archivista Box

This is actually an appliance and you can download an ISO image.
Running it is straight forward:
cd /tmp/ wget 'http://downloads.sourceforge.net/project/archivista/archivista/ArchivistaBox_2010_IV/archivista_20101218.iso?r=http%3A%2F%2Fsourceforge.net%2Fprojects%2Farchivista%2F&ts=1295436241&use_mirror=ovh' qemu -cdrom /tmp/archivista_20101218.iso -m 786M -usb

Funnily enough, the image won’t boot with more than 786MB of RAM. Quite weird, but qemu just reports the CPU to be halted after a while. If it does work, it boots up a firefox with a nice WebUI which seems to be quite functional. However, I can’t upload my >100MB PDF probably because it’s a web based thing and either the server rejects big uploads or the CGI just times out or a mixture of both.

Trying to root this thing is more complex than usual. Apparently you can’t give “init=/bin/sh” as a boot parameter as it wouldn’t make a difference. So I tried to have a look at the ISO image. There is fuseiso to mount ISO images in userspace. Unfortunately, CDEmu doesn’t seem to be packaged for Fedora. Not surprisingly, there was a SquashFS on that ISO9660 filesystem. Unfortunately, I didn’t find any SquashFS FUSE implementation 🙁 But even with elevated privileges, I can’t mount that thing *sigh*:

$ file ~/empty/live.squash
/home/muelli/empty/live.squash: Squashfs filesystem, little endian, version 3.0, 685979128 bytes, 98267 inodes, blocksize: 65536 bytes, created: Sat Dec 18 06:54:54 2010
$ sudo mount ~/empty/live.squash /tmp/empty/
mount: wrong fs type, bad option, bad superblock on /dev/loop1,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so
$ dmesg | tail -n 2
[342853.796364] squashfs: version 4.0 (2009/01/31) Phillip Lougher
[342853.796726] SQUASHFS error: Major/Minor mismatch, older Squashfs 3.0 filesystems are unsupported

But unsquashfs helped to extract the whole thing onto my disk. They used “T2” to bundle everything to a CD and packaged software mentioned above. Unfortunately, very old versions were used, i.e. cuneiform is in version 0.4.0 as opposed to 1.0.0. Hence, I don’t really consider it to be very useful to poke around that thing.

It’s a huge thing worth exploring though. It all seems to come from this SVN repository: svn://svn.archivista.ch/home/data/archivista/svn.

WatchOCR

For some reason, they built an ISO image as well. Probably to run an appliance.
cd /tmp/ wget http://www.watchocr.com/files/watchocr-V0.6-2010-12-10-en.iso qemu -cdrom /tmp/watchocr-V0.6-2010-12-10-en.iso -m 1G

The image booted up a webbrowser which showed a webinterface to the WebOCR functionality.
I extraced the necessary scripts which wraps tools like cuniform, ghostscript and friends. Compared to the archivista box, the scripts here are rather simple. Please find webocr and img2pdf. They also use an old cuneiform 0.8.0 which is older than the version from Launchpad.

However, in my QEMU instance, the watchocr box took a very long time to process my good 100 pages PDF.

Some custom script

That tries to do the job did in fact quite well, although it’s quite slow as well. It lacks proper support for spawning multiple commands in parallel.

After you have installed the dependencies like mentioned above, you can run it:
wget http://www.konradvoelkel.de/download/pdfocr.sh PATH="/tmp/exactimage/bin/:/tmp/cuneiform/bin/:$PATH" LD_LIBRARY_PATH=/tmp/cuneiform/lib64/ sh -x pdfocr.sh buch-test-1.pdf 0 0 0 0 2500 2000 fra SomeAuthor SomeTitle

The script, however, doesn’t really work for me, probably because of some quoting issues:

+ pdfjoin --fitpaper --tidy --outfile ../buch-test-1.pdf.ocr1.pdf 'pg_*.png.pdf'
          ----
  pdfjam: This is pdfjam version 2.08.
  pdfjam: Reading any site-wide or user-specific defaults...
          (none found)
  pdfjam ERROR: pg_*.png.pdf not found

Having overcome that problem, the following pdfjoin doesn’t work for an unknown reason. After having replaced pdfjoin manually, I realised, that the script sampled the pages down, made them monochrome and rotated them! Hence, no OCR was possible and the final PDF was totally unusable *sigh*.

It’s a mess.

To conclude…

I still don’t have a properly OCRd version of my scanned book, because of not very well integrated tools. I believe that programs like pdftk, imagemagick, unpaper, cuneiform, hocr2pdf, pdfjam do their job very well. But it appears that they are not very well knit together to form a useful tools to OCR a given PDF. Requirements would be, for example, that there is no loss of quality of the scanned images, that the number of programs to be called is reduced to a minimum and that everything needs to be able to do batch processing. So far, I couldn’t find anything that fulfills that requirements. If you know anything or have a few moments to bundle the necessary tools together, please tell me :o) The necessary pieces are all there, as far as I can see. It just needs someone to integrate everything nicely.

Alright, the following stuff is probably only funny, if you know German and Germans a bit. At least I had to laugh a couple of times, so you might enjoy that as well 🙂

I received a PDF with some weird English translations of German idioms and I tried to extract the text information from that, so I stumbled upon a page explaining how to do OCR with free software on Linux. I got the best results using Tesseract with the German language set, but I had to refine the result (leaving some typos intact).

that’s me sausage = ist mir wurst
go where the pepper grows = geh hin wo der pfeffer wächst
I think my pig whizzles = ich glaub mein schwein pfeift
sorry, my english is under all pig = entschuldige, mein englisch ist unter aller sau
now can come what want…i ready = letzt kann kommen was will, ich bin fertig
I think I spider = ich glaub ich spinne
the devil will i do = den teufel werd ich tun
what too much is, is too much = was zu viel ist, ist zu viel
my lovely mister singing club = mein lieber herr gesangsverein
don’t walk me on the nerves = geh mir nicht auf die nerven
come on…jump over your shadow = komm schon…spring ueber deinen schatten
you walk me animally on the cookie = du gehts mir tierisch auf den keks
there my hairs stand up to the mountain = da stehen mir die haare zu berge
tell me nothing from the horse = erzaehl mir keinen vom pferd
don’t ask after sunshine = trag nicht nach sonnenschein
free like the motto: you me too = frei nach dem Motto, du mich auch
I have the nose full = ich hab die nase voll
lt’s not good cherry-eating with you = es ist nicht gut kirschen essen mit dir
it’s going up like smiths cat = es geht ab wie Schmidts katze
to thunderweather once more = zum Donnerwetter noch mal
not from bad parents = nicht von schlechten eltern
now it goes around the sausage = jetzt geht’s um die wurst
there you on the woodway = da bist du auf dem holzweg
good thing needs while = gut ding braucht weile
holla the woodfairy = holla die waldfee
we are sitting all in the same boot = wir sitzen alle im selben boot
don’t make you a head = mach dlr keinen kopf
there run me the water in the mouth together = da läuft rnlr das wasser im mund zusammen
I understand just train-station = ich versteh nur bahnhof
I hold it in head not out = ich halt’s im kopf nicht aus
shame you what = scham dich was
there we have the salad = da haben wir den salat
end good, everything good = ende gut, alles gut
zip you together = reiß dich zusammen
now butter by the fishes = jetzt mal butter bei die flsche
he made himself me nothing you nothing out of the dust — er machte sich mir nichts, dir nichts aus dem Staub
I belive you have the ass open — ich glaub du hast den Arsch auf!
you make me nothing for = du machst mir nichts vor
that makes me so fast nobody after = das macht mir so schnell keiner nach
I see black for you = ich seh schwarz fur dich
so a pig-weather = so ein Sauwetter
you are really the latest = du bist wirklich das letzte
your are so a fear-rabbit = du bist so ein angsthase
everybody dance after your nose = alle tanzen nach deiner nase
known home luck alone = trautes Heim, Glueck allein
I think I hear not right = Ich denk Ich hör nicht richtig
that have you your so thought = das hast du dir so gedacht
give not so on = gib nicht so an
heaven, ass and thread! = Himmel, Arsch und Zwirn’
of again see = auf wiedersehen
Human Meier = Mensch Meier
now we sit quite beautiful in the ink = jetzt sitzen wir ganz schoen in der Tinte
you have not more all cups in the board = du hast nicht mehr alle Tassen im Schrank
around heavens will = um Himmels willen
you are heavy in order = du bist schwer in Ordnung
l wish you what = ich wünsch dir was
she had a circleroundbreakdown = sie hatte einen kreislaufzusammenbruch
you are a blackdriver = du bist ein schwarzfahrer
I know me here out = ich kenn mich hier aus
l fell from all clouds = Ich fiel aus allen Wolken
that I not laugh = das ich nicht lache
no one can reach me the water = niemand kann mir das wasser relchen
that’s absolut afterfullpullable = das ist absolut nachvollziehbar
give good eight = gib gut acht
not the yellow of the egg = nicht das gelbe vom Ei
come good home = komm gut heim
evererything in the green area = alles im gruenen bererch
I die for Blackforrestcherrycake = Ich sterbe fuer Schwarzwalderkirschtorte
how too always = wie auch immer
I make you ready! = Ich mach dlch fertig!
I laugh me death = ich lach mich tot
it walks me icecold the back down = es lauft mir eiskalt den rücken runter
always with the silence = Immer mit der Ruhe
that’s one-wall-free = das Ist einwandfrei
I’m foxdevilswild = lch bin fuchsteufelswild
here goes the mail off = hier geht die post ab
me goes a light on = mir geht ein licht auf
it‘s highest railway = es ist hoechste Eisenbahn

Tag: ocr

OCRing a scanned book