pdf – muellis blog

Redacting PDFs to remove tracking information

Uh, it’s been a while. Let me try to pick up blogging again…

I got in touch with ISO standards, also called “norms”. In particular, the information security norms from the ISO 27000 family. I won’t talk much about them in particular but rather about their delivery and how to improve their visual appearance.

During the last three years, many things were converted to online-only formats. Conferences were held exclusively online, work was performed from the hopefully cosy places of one’s home, and trainings were given remotely, with no physical interaction whatsoever.

As part of one of those trainings, I received a set of ISO norms as a PDF file. The PDF renders nicely, despite some fonts not being embedded (looking at you, Helvetica!). One annoying problem, though, is that the document contains more ink than necessary. In particular, some tracking information is printed on the left border of each page. It’s also true for documents obtained via “Perinorm” or “Nautos” by German scholars.

Document with tracking in the left margin

I prefer real paper for reading so I intend to print the PDF but at the same time, I only want to make printing as expensive as it needs to be. Because I will keep the printed pages safely at home, I have no need to print the tracking information on the left border of each page, if only, because I don’t want to worry and run to the printer straight away, before a colleague can fetch it out of there and get me in trouble for sharing it somewhere. So I decided to save myself some toner and some electricity by removing all that information that is not part of the actual standard.

One approach is to open the document up in some PDF modification software and delete the offending objects from each page. But there are some obstacles. Firstly, the documents are “encrypted”, so my Master PDF editor complains:

MasterPDF complaining about the "encryption"

Bummer. The encryption, depending on the standard, is anything from a weak RC4 to modern AES. Ignoring the problems with cryptography in PDF, it can actually be pretty secure. So my best guess was to launch “pdfcrack“. I was prepared to wait for a few days or weeks. After all, I could wait for reading those documents for so many years, a few weeks more or later wouldn’t matter much to me. To my surprise, though, pdfcrack returned immediately. Reporting that the password was the empty string. Well.. Okay.. Then I could launch pdfcpu decrypt and finally edit the document. Selecting and deleting the object on one page was easily done.

The next problem, though, was that I had to deal with about 100 pages. Unfortunately, I could not find anything like a macro for MasterPDF-Editor. As in “Select this text”, “Delete”, “Scroll to next page”, “repeat”. Surely, going through each page and manually selecting the offending object and deleting it would be a waste of my time, as XKCD readers will appreciate.

Fortunately, removing content from the left border is a somewhat solved problem. I played around with pdfcrop, trying to first remove and then re-add the border. But that was all messy and didn’t work out, anyway.

I resorted to using “iText: The Leading PDF Library for Developers”:
$ cat RemoveContentInRectangle.java


package com.itextpdf.samples.sandbox.parse;

import com.itextpdf.kernel.colors.ColorConstants;
import com.itextpdf.kernel.geom.Rectangle;
import com.itextpdf.kernel.pdf.PdfDocument;
import com.itextpdf.kernel.pdf.PdfReader;
import com.itextpdf.kernel.pdf.PdfWriter;
import com.itextpdf.pdfcleanup.PdfCleanUpLocation;
import com.itextpdf.pdfcleanup.PdfCleaner;

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

public class RemoveContentInRectangle {
public static String DEST = "./target/sandbox/parse/remove_content_in_rectangle.pdf";

public static String SRC = "./page229.pdf";

public static void main(String[] args) throws IOException {
for (String arg: args) {
System.out.println("Arg: " + arg);
}
SRC = args[0];
DEST = args[1];
File file = new File(DEST);
//file.getParentFile().mkdirs();

new RemoveContentInRectangle().manipulatePdf(DEST);
}

protected void manipulatePdf(String dest) throws IOException {
PdfDocument pdfDoc = new PdfDocument(new PdfReader(SRC), new PdfWriter(dest));

List cleanUpLocations = new ArrayList();

int ppage = 1;
for (var page = 1; page <= pdfDoc.getNumberOfPages(); page++) {
// The arguments of the PdfCleanUpLocation constructor: the number of page to be cleaned up,
// a Rectangle defining the area on the page we want to clean up,
// a color which will be used while filling the cleaned area.
PdfCleanUpLocation location = new PdfCleanUpLocation(page, new Rectangle(5, 5, 15, 990),
ColorConstants.WHITE);
cleanUpLocations.add(location);
}

PdfCleaner.cleanUp(pdfDoc, cleanUpLocations);

pdfDoc.close();
}
}

I tried to adjust the rectangle arguments until I had a satisfactory result. That’s a bit annoying but worked well enough for me. Another annoyance is to get hold of the dependencies, but with a bit of searching it should be possible to obtain the jar files:
$ java -cp /usr/share/java/slf4j-api.jar:itext/kernel-7.2.2.jar:itext/commons-7.2.2.jar:itext/io-7.2.2.jar:itext/layout-7.2.2.jar:itext/cleanup-3.0.0.jar RemoveContentInRectangle.java iso.pdf foo.pdf

Before being able to process the files, I gave them a good rinse, because some files were causing trouble:

$ for pdf in ../*.PDF; do gs -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS="/ebook" -sOutputFile=$pdf.rewrite.pdf $pdf; done
And finally, for good measure, try to get rid of some metadata, to make a file weigh a bit less on the hard-drive:

If you want to restore the document’s metadata because you don’t like how pdfcpu messed with it, the following could help:
$ for pdf in *.PDF; do exiftool -tagsFromFile $pdf $pdf.cleaned.pdf ; done

But you could also delete all metadata (don’t use exiftool for cleaning PDF metadata without flattening the PDF afterwards).
After that treatment, I can print the PDF on the central office printer without fearing anybody taking my printout and getting me in trouble.

If you happen to have some documents that you want to free from tracking data, especially such standards, I’d be happy to assist you.

OCRing a scanned book

I had the pleasure to use a “Bookeye” book scanner. It’s a huge device which helps scanning things like books or folders. It’s very quick and very easy to use. I got a huge PDF out of my good 100 pages that I’ve scanned.

Unfortunately the light was very bright and so the scanner scanned “through” the open pages revealing the back sides of the pages. That’s not very cool and I couldn’t really dim the light or put a sheet between the Pages.
Also, it doesn’t do OCR but my main point of digitalising this book was to actually have it searchable and copy&pastable.

There seem to be multiple options to do OCR on images:

tesseract

covered already

ocropus

Apparently this is supposed to be tesseract on steroids as it can recognise text on paper and different layouts and everything.
Since it’s a bit painful to compile, I’d love to share my experiences hoping that it will become useful to somebody.

During compilation of ocropus, you might run into issues like this or that, so be prepared to patch the code.

cd /tmp/ svn checkout http://iulib.googlecode.com/svn/trunk/ iulib cd iulib/ ./configure --prefix=/tmp/libiu-install make && make install

cd /tmp/ wget http://www.leptonica.com/source/leptonlib-1.67.tar.gz -O- | tar xvzf - cd leptonlib*/ ./configure --prefix=/tmp/leptonica make && make install

cd /tmp/ svn checkout http://ocropus.googlecode.com/svn/trunk/ ocropus # This is due to this bug: http://code.google.com/p/ocropus/issues/detail?id=283 cat > ~/bin/leptheaders < #!/bin/sh echo /tmp/leptonica/include/leptonica/ EOF chmod a+x ~/bin/leptheaders ./configure --prefix=/tmp/ocropus-install --with-iulib=/tmp/libiu-install/ make && make install

muelli@bigbox /tmp $ LD_LIBRARY_PATH=/tmp/ocropus-install/lib/:/tmp/leptonica/lib/ ./ocropus-install/bin/ocroscript --help
usage: ./ocropus-install/bin/ocroscript [options] [script [args]].
Available options are:
  -e stat  execute string 'stat'
  -l name  require library 'name'
  -i       enter interactive mode after executing 'script'
  -v       show version information
  --       stop handling options
  -        execute stdin and stop handling options
muelli@bigbox /tmp $

However, I can’t do anything because I can’t make LUA load the scripts from the share/ directory of the prefix. Too sad. It looked very promising.

Cuneiform

This is an interesting thing. It’s a BSD licensed russian OCR software that was once one the leading tools to do OCR.
Interestingly, it’s the most straight forward thing to install, compared to the other things listed here.
bzr branch lp:cuneiform-linux cd cuneiform-linux/ mkdir build cd build/ cmake .. -DCMAKE_INSTALL_PREFIX=/tmp/cuneiform make make install

This is supposed to produce some sort of HTML which we can glue to a PDF with the following tool.

hocr2pdf

Apparently takes “HTML annotated OCR data” and bundles that, together with the image, to a PDF.

cd /tmp/ svn co http://svn.exactcode.de/exact-image/trunk ei cd ei/ ./configure --prefix=/tmp/exactimage make && make install

That, however, failed for me like this:

  LINK EXEC objdir/frontends/optimize2bw
/usr/bin/ld: objdir/codecs/lib.a: undefined reference to symbol 'QuantizeBuffer'
/usr/bin/ld: note: 'QuantizeBuffer' is defined in DSO /usr/lib64/libgif.so.4 so try adding it to the linker command line
/usr/lib64/libgif.so.4: could not read symbols: Invalid operation
collect2: ld returned 1 exit status
make: *** [objdir/frontends/optimize2bw] Error 1

Adding “LDFLAGS += -lgif” to the Makefile fixes that. I couldn’t find a bug tracker, hence I reported this issue via email but haven’t heard back yet.

Although the hOCR format seems to be the only option to actually know where in the file the text appears, no OCR program, except cuneiform and tesseract with a patch, seems to support it 🙁

gscan2pdf

as a full suite it can import pictures or PDFs and use a OCR program mentioned above (tesseract or gocr). The whole thing can then be saved as a PDF again.
Results with gocr are not so good. I can’t really copy and paste stuff. Searching does kinda work though.

Using Tesseract, however, doesn’t work quite well:

 Tesseract Open Source OCR Engine
tesseract: unicharset.cpp:76: const UNICHAR_ID UNICHARSET::unichar_to_id(const char*, int) const: Assertion `ids.contains(unichar_repr, length)' failed.
sh: line 1:  6187 Aborted                 (core dumped) tesseract /tmp/4jZN0oNbB1/dLNBLkcjph.tif /tmp/4jZN0oNbB1/_4BdZMfGXJ -l fra
*** unhandled exception in callback:
***   Error: cannot open /tmp/4jZN0oNbB1/_4BdZMfGXJ.txt
***  ignoring at /usr/bin/gscan2pdf line 12513.
Tesseract Open Source OCR Engine
tesseract: unicharset.cpp:76: const UNICHAR_ID UNICHARSET::unichar_to_id(const char*, int) const: Assertion `ids.contains(unichar_repr, length)' failed.
sh: line 1:  6193 Aborted                 (core dumped) tesseract /tmp/4jZN0oNbB1/ELMbnDkaEI.tif /tmp/4jZN0oNbB1/C47fuqxX3S -l fra
*** unhandled exception in callback:
***   Error: cannot open /tmp/4jZN0oNbB1/C47fuqxX3S.txt
***  ignoring at /usr/bin/gscan2pdf line 12513.

It doesn’t seems to be able to work with cuneiform 🙁

Archivista Box

This is actually an appliance and you can download an ISO image.
Running it is straight forward:
cd /tmp/ wget 'http://downloads.sourceforge.net/project/archivista/archivista/ArchivistaBox_2010_IV/archivista_20101218.iso?r=http%3A%2F%2Fsourceforge.net%2Fprojects%2Farchivista%2F&ts=1295436241&use_mirror=ovh' qemu -cdrom /tmp/archivista_20101218.iso -m 786M -usb

Funnily enough, the image won’t boot with more than 786MB of RAM. Quite weird, but qemu just reports the CPU to be halted after a while. If it does work, it boots up a firefox with a nice WebUI which seems to be quite functional. However, I can’t upload my >100MB PDF probably because it’s a web based thing and either the server rejects big uploads or the CGI just times out or a mixture of both.

Trying to root this thing is more complex than usual. Apparently you can’t give “init=/bin/sh” as a boot parameter as it wouldn’t make a difference. So I tried to have a look at the ISO image. There is fuseiso to mount ISO images in userspace. Unfortunately, CDEmu doesn’t seem to be packaged for Fedora. Not surprisingly, there was a SquashFS on that ISO9660 filesystem. Unfortunately, I didn’t find any SquashFS FUSE implementation 🙁 But even with elevated privileges, I can’t mount that thing *sigh*:

$ file ~/empty/live.squash
/home/muelli/empty/live.squash: Squashfs filesystem, little endian, version 3.0, 685979128 bytes, 98267 inodes, blocksize: 65536 bytes, created: Sat Dec 18 06:54:54 2010
$ sudo mount ~/empty/live.squash /tmp/empty/
mount: wrong fs type, bad option, bad superblock on /dev/loop1,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so
$ dmesg | tail -n 2
[342853.796364] squashfs: version 4.0 (2009/01/31) Phillip Lougher
[342853.796726] SQUASHFS error: Major/Minor mismatch, older Squashfs 3.0 filesystems are unsupported

But unsquashfs helped to extract the whole thing onto my disk. They used “T2” to bundle everything to a CD and packaged software mentioned above. Unfortunately, very old versions were used, i.e. cuneiform is in version 0.4.0 as opposed to 1.0.0. Hence, I don’t really consider it to be very useful to poke around that thing.

It’s a huge thing worth exploring though. It all seems to come from this SVN repository: svn://svn.archivista.ch/home/data/archivista/svn.

WatchOCR

For some reason, they built an ISO image as well. Probably to run an appliance.
cd /tmp/ wget http://www.watchocr.com/files/watchocr-V0.6-2010-12-10-en.iso qemu -cdrom /tmp/watchocr-V0.6-2010-12-10-en.iso -m 1G

The image booted up a webbrowser which showed a webinterface to the WebOCR functionality.
I extraced the necessary scripts which wraps tools like cuniform, ghostscript and friends. Compared to the archivista box, the scripts here are rather simple. Please find webocr and img2pdf. They also use an old cuneiform 0.8.0 which is older than the version from Launchpad.

However, in my QEMU instance, the watchocr box took a very long time to process my good 100 pages PDF.

Some custom script

That tries to do the job did in fact quite well, although it’s quite slow as well. It lacks proper support for spawning multiple commands in parallel.

After you have installed the dependencies like mentioned above, you can run it:
wget http://www.konradvoelkel.de/download/pdfocr.sh PATH="/tmp/exactimage/bin/:/tmp/cuneiform/bin/:$PATH" LD_LIBRARY_PATH=/tmp/cuneiform/lib64/ sh -x pdfocr.sh buch-test-1.pdf 0 0 0 0 2500 2000 fra SomeAuthor SomeTitle

The script, however, doesn’t really work for me, probably because of some quoting issues:

+ pdfjoin --fitpaper --tidy --outfile ../buch-test-1.pdf.ocr1.pdf 'pg_*.png.pdf'
          ----
  pdfjam: This is pdfjam version 2.08.
  pdfjam: Reading any site-wide or user-specific defaults...
          (none found)
  pdfjam ERROR: pg_*.png.pdf not found

Having overcome that problem, the following pdfjoin doesn’t work for an unknown reason. After having replaced pdfjoin manually, I realised, that the script sampled the pages down, made them monochrome and rotated them! Hence, no OCR was possible and the final PDF was totally unusable *sigh*.

It’s a mess.

To conclude…

I still don’t have a properly OCRd version of my scanned book, because of not very well integrated tools. I believe that programs like pdftk, imagemagick, unpaper, cuneiform, hocr2pdf, pdfjam do their job very well. But it appears that they are not very well knit together to form a useful tools to OCR a given PDF. Requirements would be, for example, that there is no loss of quality of the scanned images, that the number of programs to be called is reduced to a minimum and that everything needs to be able to do batch processing. So far, I couldn’t find anything that fulfills that requirements. If you know anything or have a few moments to bundle the necessary tools together, please tell me :o) The necessary pieces are all there, as far as I can see. It just needs someone to integrate everything nicely.

BAföG, PDF and Evince – Decrypted PDF documents

In Germany, students may apply for BAföG which basically makes them receive money for their studies. In order to apply, you have to fill out lots of forms. They provide PDFs with forms that you can –at least in theory– fill out. Well, filling out with Evince works quite well, but saving doesn’t. It complains, that the document is encrypted. WTF?

It’s a form provided by the government. You wouldn’t think that there is anything subject to DRM and that they stop you actually saving a filled document. Producing the document in first place was paid by us citizens so I’d fully expect to be at least allowed save the filled form. I don’t request the sources of that document (well, I like the idea but I probably couldn’t do anything with it anyway) but only that my government helps me filling out all those forms and that it doesn’t unnecessarily restrict me.

So I wrote those folks at the office, stating that they’ve accidentally restricted me saving the form. I received an answer quite quickly:

leider handelt es sich hier nicht um ein Versehen. Die Speicherbarkeit der Formulare unterliegt einem Rechtekonzept des Programm-Herstellers, nach welchem ab einer gewissen Abrufzahl das Abspeichern der Formulare nicht kostenfrei möglich ist.

Unterschiedliche Freewares bieten jedoch die Möglichkeit, die vorhandenen Formblätter auf dem eigenen PC abzuspeichern. Beispielhaft wird Ihnen auf dem Internet-Auftritt hierzu ein entsprechendes Softwarepaket zum kostenfreien Download genannt

Sorry for the German. The translation is roughly: It’s not an accident. The “program vendor’s right management” is responsible for that. And if many people actually download the PDF file, that Digital Restrictions Management requires that office to not allow the people to save the forms. Erm. Yes. I haven’t verified this but I fully expect the authoring software “Adobe LiveCycle Designer ES 8.2” to have a very weird license that makes us citizens suffer from those stupid restrictions. This, ladies and gentlemen, is why we need Free Software. And we need governments to stop using proprietary software with such retarded licenses.

Apparently, there are a few DRM technologies within PDF. One of them are stupid flags inside the document, that tell you whether you are allowed to, say, print or fill forms in the document. And it was heavily discussed what to do about those, because they can be silently ignored.

Anyway, I came across Ubuntu bug 477644 which mentions QPDF, a tool to manipulate PDFs while preserving its content. So if you go and download all those PDFs with forms, and do a “qpdf –decrypt input.pdf output.pdf” on them, you can save your filled form.
pushd /tmp/ for f in 1 1_anlage_1 1_anlage_2 2 3 4 5 6 7 8; do wget --continue "http://www.das-neue-bafoeg.de/intern/upload/formblaetter/nbb_fbl_${f}.pdf" qpdf --decrypt "/tmp/nbb_fbl_${f}.pdf" "/tmp/nbb_fbl_${f}_decrypted.pdf" done popd

I’ve prepared that and you can download the fillable and savable decrypted BAfoeG Forms from here:

Hope you can use it.