OCRing a scanned book

I had the pleasure to use a “Bookeye” book scanner. It’s a huge device which helps scanning things like books or folders. It’s very quick and very easy to use. I got a huge PDF out of my good 100 pages that I’ve scanned.

Unfortunately the light was very bright and so the scanner scanned “through” the open pages revealing the back sides of the pages. That’s not very cool and I couldn’t really dim the light or put a sheet between the Pages.
Also, it doesn’t do OCR but my main point of digitalising this book was to actually have it searchable and copy&pastable.

There seem to be multiple options to do OCR on images:

tesseract

covered already

ocropus

Apparently this is supposed to be tesseract on steroids as it can recognise text on paper and different layouts and everything.
Since it’s a bit painful to compile, I’d love to share my experiences hoping that it will become useful to somebody.

During compilation of ocropus, you might run into issues like this or that, so be prepared to patch the code.


cd /tmp/
svn checkout http://iulib.googlecode.com/svn/trunk/ iulib
cd iulib/
./configure --prefix=/tmp/libiu-install
make && make install


cd /tmp/
wget http://www.leptonica.com/source/leptonlib-1.67.tar.gz -O- | tar xvzf -
cd leptonlib*/
./configure --prefix=/tmp/leptonica
make && make install


cd /tmp/
svn checkout http://ocropus.googlecode.com/svn/trunk/ ocropus
# This is due to this bug: http://code.google.com/p/ocropus/issues/detail?id=283
cat > ~/bin/leptheaders <
#!/bin/sh
echo /tmp/leptonica/include/leptonica/
EOF
chmod a+x ~/bin/leptheaders
./configure --prefix=/tmp/ocropus-install --with-iulib=/tmp/libiu-install/
make && make install

muelli@bigbox /tmp $ LD_LIBRARY_PATH=/tmp/ocropus-install/lib/:/tmp/leptonica/lib/ ./ocropus-install/bin/ocroscript --help
usage: ./ocropus-install/bin/ocroscript [options] [script [args]].
Available options are:
  -e stat  execute string 'stat'
  -l name  require library 'name'
  -i       enter interactive mode after executing 'script'
  -v       show version information
  --       stop handling options
  -        execute stdin and stop handling options
muelli@bigbox /tmp $

However, I can’t do anything because I can’t make LUA load the scripts from the share/ directory of the prefix. Too sad. It looked very promising.

Cuneiform

This is an interesting thing. It’s a BSD licensed russian OCR software that was once one the leading tools to do OCR.
Interestingly, it’s the most straight forward thing to install, compared to the other things listed here.

bzr branch lp:cuneiform-linux
cd cuneiform-linux/
mkdir build
cd build/
cmake .. -DCMAKE_INSTALL_PREFIX=/tmp/cuneiform
make
make install

This is supposed to produce some sort of HTML which we can glue to a PDF with the following tool.

hocr2pdf

Apparently takes “HTML annotated OCR data” and bundles that, together with the image, to a PDF.


cd /tmp/
svn co http://svn.exactcode.de/exact-image/trunk ei
cd ei/
./configure --prefix=/tmp/exactimage
make && make install

That, however, failed for me like this:

  LINK EXEC objdir/frontends/optimize2bw
/usr/bin/ld: objdir/codecs/lib.a: undefined reference to symbol 'QuantizeBuffer'
/usr/bin/ld: note: 'QuantizeBuffer' is defined in DSO /usr/lib64/libgif.so.4 so try adding it to the linker command line
/usr/lib64/libgif.so.4: could not read symbols: Invalid operation
collect2: ld returned 1 exit status
make: *** [objdir/frontends/optimize2bw] Error 1

Adding “LDFLAGS += -lgif” to the Makefile fixes that. I couldn’t find a bug tracker, hence I reported this issue via email but haven’t heard back yet.

Although the hOCR format seems to be the only option to actually know where in the file the text appears, no OCR program, except cuneiform and tesseract with a patch, seems to support it 🙁

gscan2pdf

as a full suite it can import pictures or PDFs and use a OCR program mentioned above (tesseract or gocr). The whole thing can then be saved as a PDF again.
Results with gocr are not so good. I can’t really copy and paste stuff. Searching does kinda work though.

Using Tesseract, however, doesn’t work quite well:

 Tesseract Open Source OCR Engine
tesseract: unicharset.cpp:76: const UNICHAR_ID UNICHARSET::unichar_to_id(const char*, int) const: Assertion `ids.contains(unichar_repr, length)' failed.
sh: line 1:  6187 Aborted                 (core dumped) tesseract /tmp/4jZN0oNbB1/dLNBLkcjph.tif /tmp/4jZN0oNbB1/_4BdZMfGXJ -l fra
*** unhandled exception in callback:
***   Error: cannot open /tmp/4jZN0oNbB1/_4BdZMfGXJ.txt
***  ignoring at /usr/bin/gscan2pdf line 12513.
Tesseract Open Source OCR Engine
tesseract: unicharset.cpp:76: const UNICHAR_ID UNICHARSET::unichar_to_id(const char*, int) const: Assertion `ids.contains(unichar_repr, length)' failed.
sh: line 1:  6193 Aborted                 (core dumped) tesseract /tmp/4jZN0oNbB1/ELMbnDkaEI.tif /tmp/4jZN0oNbB1/C47fuqxX3S -l fra
*** unhandled exception in callback:
***   Error: cannot open /tmp/4jZN0oNbB1/C47fuqxX3S.txt
***  ignoring at /usr/bin/gscan2pdf line 12513.

It doesn’t seems to be able to work with cuneiform 🙁

Archivista Box

This is actually an appliance and you can download an ISO image.
Running it is straight forward:

cd /tmp/
wget 'http://downloads.sourceforge.net/project/archivista/archivista/ArchivistaBox_2010_IV/archivista_20101218.iso?r=http%3A%2F%2Fsourceforge.net%2Fprojects%2Farchivista%2F&ts=1295436241&use_mirror=ovh'
qemu -cdrom /tmp/archivista_20101218.iso -m 786M -usb

Funnily enough, the image won’t boot with more than 786MB of RAM. Quite weird, but qemu just reports the CPU to be halted after a while. If it does work, it boots up a firefox with a nice WebUI which seems to be quite functional. However, I can’t upload my >100MB PDF probably because it’s a web based thing and either the server rejects big uploads or the CGI just times out or a mixture of both.

Trying to root this thing is more complex than usual. Apparently you can’t give “init=/bin/sh” as a boot parameter as it wouldn’t make a difference. So I tried to have a look at the ISO image. There is fuseiso to mount ISO images in userspace. Unfortunately, CDEmu doesn’t seem to be packaged for Fedora. Not surprisingly, there was a SquashFS on that ISO9660 filesystem. Unfortunately, I didn’t find any SquashFS FUSE implementation 🙁 But even with elevated privileges, I can’t mount that thing *sigh*:

$ file ~/empty/live.squash
/home/muelli/empty/live.squash: Squashfs filesystem, little endian, version 3.0, 685979128 bytes, 98267 inodes, blocksize: 65536 bytes, created: Sat Dec 18 06:54:54 2010
$ sudo mount ~/empty/live.squash /tmp/empty/
mount: wrong fs type, bad option, bad superblock on /dev/loop1,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so
$ dmesg | tail -n 2
[342853.796364] squashfs: version 4.0 (2009/01/31) Phillip Lougher
[342853.796726] SQUASHFS error: Major/Minor mismatch, older Squashfs 3.0 filesystems are unsupported

But unsquashfs helped to extract the whole thing onto my disk. They used “T2” to bundle everything to a CD and packaged software mentioned above. Unfortunately, very old versions were used, i.e. cuneiform is in version 0.4.0 as opposed to 1.0.0. Hence, I don’t really consider it to be very useful to poke around that thing.

It’s a huge thing worth exploring though. It all seems to come from this SVN repository: svn://svn.archivista.ch/home/data/archivista/svn.

WatchOCR

For some reason, they built an ISO image as well. Probably to run an appliance.

cd /tmp/
wget http://www.watchocr.com/files/watchocr-V0.6-2010-12-10-en.iso
qemu -cdrom /tmp/watchocr-V0.6-2010-12-10-en.iso -m 1G

The image booted up a webbrowser which showed a webinterface to the WebOCR functionality.
I extraced the necessary scripts which wraps tools like cuniform, ghostscript and friends. Compared to the archivista box, the scripts here are rather simple. Please find webocr and img2pdf. They also use an old cuneiform 0.8.0 which is older than the version from Launchpad.

However, in my QEMU instance, the watchocr box took a very long time to process my good 100 pages PDF.

Some custom script

That tries to do the job did in fact quite well, although it’s quite slow as well. It lacks proper support for spawning multiple commands in parallel.

After you have installed the dependencies like mentioned above, you can run it:

wget http://www.konradvoelkel.de/download/pdfocr.sh
PATH="/tmp/exactimage/bin/:/tmp/cuneiform/bin/:$PATH" LD_LIBRARY_PATH=/tmp/cuneiform/lib64/ sh -x pdfocr.sh buch-test-1.pdf 0 0 0 0 2500 2000 fra SomeAuthor SomeTitle

The script, however, doesn’t really work for me, probably because of some quoting issues:

+ pdfjoin --fitpaper --tidy --outfile ../buch-test-1.pdf.ocr1.pdf 'pg_*.png.pdf'
          ----
  pdfjam: This is pdfjam version 2.08.
  pdfjam: Reading any site-wide or user-specific defaults...
          (none found)
  pdfjam ERROR: pg_*.png.pdf not found

Having overcome that problem, the following pdfjoin doesn’t work for an unknown reason. After having replaced pdfjoin manually, I realised, that the script sampled the pages down, made them monochrome and rotated them! Hence, no OCR was possible and the final PDF was totally unusable *sigh*.

It’s a mess.

To conclude…

I still don’t have a properly OCRd version of my scanned book, because of not very well integrated tools. I believe that programs like pdftk, imagemagick, unpaper, cuneiform, hocr2pdf, pdfjam do their job very well. But it appears that they are not very well knit together to form a useful tools to OCR a given PDF. Requirements would be, for example, that there is no loss of quality of the scanned images, that the number of programs to be called is reduced to a minimum and that everything needs to be able to do batch processing. So far, I couldn’t find anything that fulfills that requirements. If you know anything or have a few moments to bundle the necessary tools together, please tell me :o) The necessary pieces are all there, as far as I can see. It just needs someone to integrate everything nicely.

Engrish

Alright, the following stuff is probably only funny, if you know German and Germans a bit. At least I had to laugh a couple of times, so you might enjoy that as well 🙂

I received a PDF with some weird English translations of German idioms and I tried to extract the text information from that, so I stumbled upon a page explaining how to do OCR with free software on Linux. I got the best results using Tesseract with the German language set, but I had to refine the result (leaving some typos intact).

  • that’s me sausage = ist mir wurst
  • go where the pepper grows = geh hin wo der pfeffer wächst
  • I think my pig whizzles = ich glaub mein schwein pfeift
  • sorry, my english is under all pig = entschuldige, mein englisch ist unter aller sau
  • now can come what want…i ready = letzt kann kommen was will, ich bin fertig
  • I think I spider = ich glaub ich spinne
  • the devil will i do = den teufel werd ich tun
  • what too much is, is too much = was zu viel ist, ist zu viel
  • my lovely mister singing club = mein lieber herr gesangsverein
  • don’t walk me on the nerves = geh mir nicht auf die nerven
  • come on…jump over your shadow = komm schon…spring ueber deinen schatten
  • you walk me animally on the cookie = du gehts mir tierisch auf den keks
  • there my hairs stand up to the mountain = da stehen mir die haare zu berge
  • tell me nothing from the horse = erzaehl mir keinen vom pferd
  • don’t ask after sunshine = trag nicht nach sonnenschein
  • free like the motto: you me too = frei nach dem Motto, du mich auch
  • I have the nose full = ich hab die nase voll
  • lt’s not good cherry-eating with you = es ist nicht gut kirschen essen mit dir
  • it’s going up like smiths cat = es geht ab wie Schmidts katze
  • to thunderweather once more = zum Donnerwetter noch mal
  • not from bad parents = nicht von schlechten eltern
  • now it goes around the sausage = jetzt geht’s um die wurst
  • there you on the woodway = da bist du auf dem holzweg
  • good thing needs while = gut ding braucht weile
  • holla the woodfairy = holla die waldfee
  • we are sitting all in the same boot = wir sitzen alle im selben boot
  • don’t make you a head = mach dlr keinen kopf
  • there run me the water in the mouth together = da läuft rnlr das wasser im mund zusammen
  • I understand just train-station = ich versteh nur bahnhof
  • I hold it in head not out = ich halt’s im kopf nicht aus
  • shame you what = scham dich was
  • there we have the salad = da haben wir den salat
  • end good, everything good = ende gut, alles gut
  • zip you together = reiß dich zusammen
  • now butter by the fishes = jetzt mal butter bei die flsche
  • he made himself me nothing you nothing out of the dust — er machte sich mir nichts, dir nichts aus dem Staub
  • I belive you have the ass open — ich glaub du hast den Arsch auf!
  • you make me nothing for = du machst mir nichts vor
  • that makes me so fast nobody after = das macht mir so schnell keiner nach
  • I see black for you = ich seh schwarz fur dich
  • so a pig-weather = so ein Sauwetter
  • you are really the latest = du bist wirklich das letzte
  • your are so a fear-rabbit = du bist so ein angsthase
  • everybody dance after your nose = alle tanzen nach deiner nase
  • known home luck alone = trautes Heim, Glueck allein
  • I think I hear not right = Ich denk Ich hör nicht richtig
  • that have you your so thought = das hast du dir so gedacht
  • give not so on = gib nicht so an
  • heaven, ass and thread! = Himmel, Arsch und Zwirn’
  • of again see = auf wiedersehen
  • Human Meier = Mensch Meier
  • now we sit quite beautiful in the ink = jetzt sitzen wir ganz schoen in der Tinte
  • you have not more all cups in the board = du hast nicht mehr alle Tassen im Schrank
  • around heavens will = um Himmels willen
  • you are heavy in order = du bist schwer in Ordnung
  • l wish you what = ich wünsch dir was
  • she had a circleroundbreakdown = sie hatte einen kreislaufzusammenbruch
  • you are a blackdriver = du bist ein schwarzfahrer
  • I know me here out = ich kenn mich hier aus
  • l fell from all clouds = Ich fiel aus allen Wolken
  • that I not laugh = das ich nicht lache
  • no one can reach me the water = niemand kann mir das wasser relchen
  • that’s absolut afterfullpullable = das ist absolut nachvollziehbar
  • give good eight = gib gut acht
  • not the yellow of the egg = nicht das gelbe vom Ei
  • come good home = komm gut heim
  • evererything in the green area = alles im gruenen bererch
  • I die for Blackforrestcherrycake = Ich sterbe fuer Schwarzwalderkirschtorte
  • how too always = wie auch immer
  • I make you ready! = Ich mach dlch fertig!
  • I laugh me death = ich lach mich tot
  • it walks me icecold the back down = es lauft mir eiskalt den rücken runter
  • always with the silence = Immer mit der Ruhe
  • that’s one-wall-free = das Ist einwandfrei
  • I’m foxdevilswild = lch bin fuchsteufelswild
  • here goes the mail off = hier geht die post ab
  • me goes a light on = mir geht ein licht auf
  • it‘s highest railway = es ist hoechste Eisenbahn
Creative Commons Attribution-ShareAlike 3.0 Unported
This work by Muelli is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported.