OCRing a scanned book

I had the pleasure to use a “Bookeye” book scanner. It’s a huge device which helps scanning things like books or folders. It’s very quick and very easy to use. I got a huge PDF out of my good 100 pages that I’ve scanned.

Unfortunately the light was very bright and so the scanner scanned “through” the open pages revealing the back sides of the pages. That’s not very cool and I couldn’t really dim the light or put a sheet between the Pages.
Also, it doesn’t do OCR but my main point of digitalising this book was to actually have it searchable and copy&pastable.

There seem to be multiple options to do OCR on images:


covered already


Apparently this is supposed to be tesseract on steroids as it can recognise text on paper and different layouts and everything.
Since it’s a bit painful to compile, I’d love to share my experiences hoping that it will become useful to somebody.

During compilation of ocropus, you might run into issues like this or that, so be prepared to patch the code.

cd /tmp/
svn checkout http://iulib.googlecode.com/svn/trunk/ iulib
cd iulib/
./configure --prefix=/tmp/libiu-install
make && make install

cd /tmp/
wget http://www.leptonica.com/source/leptonlib-1.67.tar.gz -O- | tar xvzf -
cd leptonlib*/
./configure --prefix=/tmp/leptonica
make && make install

cd /tmp/
svn checkout http://ocropus.googlecode.com/svn/trunk/ ocropus
# This is due to this bug: http://code.google.com/p/ocropus/issues/detail?id=283
cat > ~/bin/leptheaders <
echo /tmp/leptonica/include/leptonica/
chmod a+x ~/bin/leptheaders
./configure --prefix=/tmp/ocropus-install --with-iulib=/tmp/libiu-install/
make && make install

muelli@bigbox /tmp $ LD_LIBRARY_PATH=/tmp/ocropus-install/lib/:/tmp/leptonica/lib/ ./ocropus-install/bin/ocroscript --help
usage: ./ocropus-install/bin/ocroscript [options] [script [args]].
Available options are:
  -e stat  execute string 'stat'
  -l name  require library 'name'
  -i       enter interactive mode after executing 'script'
  -v       show version information
  --       stop handling options
  -        execute stdin and stop handling options
muelli@bigbox /tmp $

However, I can’t do anything because I can’t make LUA load the scripts from the share/ directory of the prefix. Too sad. It looked very promising.


This is an interesting thing. It’s a BSD licensed russian OCR software that was once one the leading tools to do OCR.
Interestingly, it’s the most straight forward thing to install, compared to the other things listed here.

bzr branch lp:cuneiform-linux
cd cuneiform-linux/
mkdir build
cd build/
cmake .. -DCMAKE_INSTALL_PREFIX=/tmp/cuneiform
make install

This is supposed to produce some sort of HTML which we can glue to a PDF with the following tool.


Apparently takes “HTML annotated OCR data” and bundles that, together with the image, to a PDF.

cd /tmp/
svn co http://svn.exactcode.de/exact-image/trunk ei
cd ei/
./configure --prefix=/tmp/exactimage
make && make install

That, however, failed for me like this:

  LINK EXEC objdir/frontends/optimize2bw
/usr/bin/ld: objdir/codecs/lib.a: undefined reference to symbol 'QuantizeBuffer'
/usr/bin/ld: note: 'QuantizeBuffer' is defined in DSO /usr/lib64/libgif.so.4 so try adding it to the linker command line
/usr/lib64/libgif.so.4: could not read symbols: Invalid operation
collect2: ld returned 1 exit status
make: *** [objdir/frontends/optimize2bw] Error 1

Adding “LDFLAGS += -lgif” to the Makefile fixes that. I couldn’t find a bug tracker, hence I reported this issue via email but haven’t heard back yet.

Although the hOCR format seems to be the only option to actually know where in the file the text appears, no OCR program, except cuneiform and tesseract with a patch, seems to support it :-(


as a full suite it can import pictures or PDFs and use a OCR program mentioned above (tesseract or gocr). The whole thing can then be saved as a PDF again.
Results with gocr are not so good. I can’t really copy and paste stuff. Searching does kinda work though.

Using Tesseract, however, doesn’t work quite well:

 Tesseract Open Source OCR Engine
tesseract: unicharset.cpp:76: const UNICHAR_ID UNICHARSET::unichar_to_id(const char*, int) const: Assertion `ids.contains(unichar_repr, length)' failed.
sh: line 1:  6187 Aborted                 (core dumped) tesseract /tmp/4jZN0oNbB1/dLNBLkcjph.tif /tmp/4jZN0oNbB1/_4BdZMfGXJ -l fra
*** unhandled exception in callback:
***   Error: cannot open /tmp/4jZN0oNbB1/_4BdZMfGXJ.txt
***  ignoring at /usr/bin/gscan2pdf line 12513.
Tesseract Open Source OCR Engine
tesseract: unicharset.cpp:76: const UNICHAR_ID UNICHARSET::unichar_to_id(const char*, int) const: Assertion `ids.contains(unichar_repr, length)' failed.
sh: line 1:  6193 Aborted                 (core dumped) tesseract /tmp/4jZN0oNbB1/ELMbnDkaEI.tif /tmp/4jZN0oNbB1/C47fuqxX3S -l fra
*** unhandled exception in callback:
***   Error: cannot open /tmp/4jZN0oNbB1/C47fuqxX3S.txt
***  ignoring at /usr/bin/gscan2pdf line 12513.

It doesn’t seems to be able to work with cuneiform :-(

Archivista Box

This is actually an appliance and you can download an ISO image.
Running it is straight forward:

cd /tmp/
wget 'http://downloads.sourceforge.net/project/archivista/archivista/ArchivistaBox_2010_IV/archivista_20101218.iso?r=http%3A%2F%2Fsourceforge.net%2Fprojects%2Farchivista%2F&ts=1295436241&use_mirror=ovh'
qemu -cdrom /tmp/archivista_20101218.iso -m 786M -usb

Funnily enough, the image won’t boot with more than 786MB of RAM. Quite weird, but qemu just reports the CPU to be halted after a while. If it does work, it boots up a firefox with a nice WebUI which seems to be quite functional. However, I can’t upload my >100MB PDF probably because it’s a web based thing and either the server rejects big uploads or the CGI just times out or a mixture of both.

Trying to root this thing is more complex than usual. Apparently you can’t give “init=/bin/sh” as a boot parameter as it wouldn’t make a difference. So I tried to have a look at the ISO image. There is fuseiso to mount ISO images in userspace. Unfortunately, CDEmu doesn’t seem to be packaged for Fedora. Not surprisingly, there was a SquashFS on that ISO9660 filesystem. Unfortunately, I didn’t find any SquashFS FUSE implementation :-( But even with elevated privileges, I can’t mount that thing *sigh*:

$ file ~/empty/live.squash
/home/muelli/empty/live.squash: Squashfs filesystem, little endian, version 3.0, 685979128 bytes, 98267 inodes, blocksize: 65536 bytes, created: Sat Dec 18 06:54:54 2010
$ sudo mount ~/empty/live.squash /tmp/empty/
mount: wrong fs type, bad option, bad superblock on /dev/loop1,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so
$ dmesg | tail -n 2
[342853.796364] squashfs: version 4.0 (2009/01/31) Phillip Lougher
[342853.796726] SQUASHFS error: Major/Minor mismatch, older Squashfs 3.0 filesystems are unsupported

But unsquashfs helped to extract the whole thing onto my disk. They used “T2” to bundle everything to a CD and packaged software mentioned above. Unfortunately, very old versions were used, i.e. cuneiform is in version 0.4.0 as opposed to 1.0.0. Hence, I don’t really consider it to be very useful to poke around that thing.

It’s a huge thing worth exploring though. It all seems to come from this SVN repository: svn://svn.archivista.ch/home/data/archivista/svn.


For some reason, they built an ISO image as well. Probably to run an appliance.

cd /tmp/
wget http://www.watchocr.com/files/watchocr-V0.6-2010-12-10-en.iso
qemu -cdrom /tmp/watchocr-V0.6-2010-12-10-en.iso -m 1G

The image booted up a webbrowser which showed a webinterface to the WebOCR functionality.
I extraced the necessary scripts which wraps tools like cuniform, ghostscript and friends. Compared to the archivista box, the scripts here are rather simple. Please find webocr and img2pdf. They also use an old cuneiform 0.8.0 which is older than the version from Launchpad.

However, in my QEMU instance, the watchocr box took a very long time to process my good 100 pages PDF.

Some custom script

That tries to do the job did in fact quite well, although it’s quite slow as well. It lacks proper support for spawning multiple commands in parallel.

After you have installed the dependencies like mentioned above, you can run it:

wget http://www.konradvoelkel.de/download/pdfocr.sh
PATH="/tmp/exactimage/bin/:/tmp/cuneiform/bin/:$PATH" LD_LIBRARY_PATH=/tmp/cuneiform/lib64/ sh -x pdfocr.sh buch-test-1.pdf 0 0 0 0 2500 2000 fra SomeAuthor SomeTitle

The script, however, doesn’t really work for me, probably because of some quoting issues:

+ pdfjoin --fitpaper --tidy --outfile ../buch-test-1.pdf.ocr1.pdf 'pg_*.png.pdf'
  pdfjam: This is pdfjam version 2.08.
  pdfjam: Reading any site-wide or user-specific defaults...
          (none found)
  pdfjam ERROR: pg_*.png.pdf not found

Having overcome that problem, the following pdfjoin doesn’t work for an unknown reason. After having replaced pdfjoin manually, I realised, that the script sampled the pages down, made them monochrome and rotated them! Hence, no OCR was possible and the final PDF was totally unusable *sigh*.

It’s a mess.

To conclude…

I still don’t have a properly OCRd version of my scanned book, because of not very well integrated tools. I believe that programs like pdftk, imagemagick, unpaper, cuneiform, hocr2pdf, pdfjam do their job very well. But it appears that they are not very well knit together to form a useful tools to OCR a given PDF. Requirements would be, for example, that there is no loss of quality of the scanned images, that the number of programs to be called is reduced to a minimum and that everything needs to be able to do batch processing. So far, I couldn’t find anything that fulfills that requirements. If you know anything or have a few moments to bundle the necessary tools together, please tell me :o) The necessary pieces are all there, as far as I can see. It just needs someone to integrate everything nicely.

BAföG, PDF and Evince – Decrypted PDF documents

In Germany, students may apply for BAföG which basically makes them receive money for their studies. In order to apply, you have to fill out lots of forms. They provide PDFs with forms that you can –at least in theory– fill out. Well, filling out with Evince works quite well, but saving doesn’t. It complains, that the document is encrypted. WTF?

It’s a form provided by the government. You wouldn’t think that there is anything subject to DRM and that they stop you actually saving a filled document. Producing the document in first place was paid by us citizens so I’d fully expect to be at least allowed save the filled form. I don’t request the sources of that document (well, I like the idea but I probably couldn’t do anything with it anyway) but only that my government helps me filling out all those forms and that it doesn’t unnecessarily restrict me.

So I wrote those folks at the office, stating that they’ve accidentally restricted me saving the form. I received an answer quite quickly:

leider handelt es sich hier nicht um ein Versehen. Die Speicherbarkeit der Formulare unterliegt einem Rechtekonzept des Programm-Herstellers, nach welchem ab einer gewissen Abrufzahl das Abspeichern der Formulare nicht kostenfrei möglich ist.

Unterschiedliche Freewares bieten jedoch die Möglichkeit, die vorhandenen Formblätter auf dem eigenen PC abzuspeichern. Beispielhaft wird Ihnen auf dem Internet-Auftritt hierzu ein entsprechendes Softwarepaket zum kostenfreien Download genannt

Sorry for the German. The translation is roughly: It’s not an accident. The “program vendor’s right management” is responsible for that. And if many people actually download the PDF file, that Digital Restrictions Management requires that office to not allow the people to save the forms. Erm. Yes. I haven’t verified this but I fully expect the authoring software “Adobe LiveCycle Designer ES 8.2” to have a very weird license that makes us citizens suffer from those stupid restrictions. This, ladies and gentlemen, is why we need Free Software. And we need governments to stop using proprietary software with such retarded licenses.

Apparently, there are a few DRM technologies within PDF. One of them are stupid flags inside the document, that tell you whether you are allowed to, say, print or fill forms in the document. And it was heavily discussed what to do about those, because they can be silently ignored.

Anyway, I came across Ubuntu bug 477644 which mentions QPDF, a tool to manipulate PDFs while preserving its content. So if you go and download all those PDFs with forms, and do a “qpdf –decrypt input.pdf output.pdf” on them, you can save your filled form.

pushd /tmp/
for f in 1 1_anlage_1 1_anlage_2 2 3 4 5 6 7 8; do
wget --continue "http://www.das-neue-bafoeg.de/intern/upload/formblaetter/nbb_fbl_${f}.pdf"
qpdf --decrypt "/tmp/nbb_fbl_${f}.pdf" "/tmp/nbb_fbl_${f}_decrypted.pdf"

I’ve prepared that and you can download the fillable and savable decrypted BAfoeG Forms from here:

Hope you can use it.