Perfectly scale an image to the rest of a page with LaTeX

I had the following problem for a long time: I wanted to embed a picture into a page and automatically have it scaled to the maximum size that possibly fits the page, but not more. Obviously, simply doing a

\includeimage[width=\textwidth]{myimage}

wouldn’t do the job, because if the image is more tall than wide, the image would grow beyond the page. One could use the information from the \textheigth register, i.e. like

\includeimage[width=\textwidth,height=\textheight,keepaspectration=true]{myimage}

But that doesn’t take already existing text into account, i.e. some description above the image that you definitely want to have on the same page.

So Simon cooked up a macro that would allow me to do exactly what I wanted by creating a new box, getting its height and subtracting that from \textheight. Lovely. Here’s the code:

\newlength{\textundbildtextheight}
 
\newcommand{\textundbild}[2]{
\settototalheight\textundbildtextheight{\vbox{#1}}
#1
\vfill
\begin{center}
\includegraphics[width=\textwidth,keepaspectratio=true,height=\textheight-\the\textundbildtextheight]{#2}
\end{center}
\vfill
}

I’m sure it’s not very correct and it’s possible to make it not work properly, but it does the job very well for me as you can see on the following rendered pages:


DIN A4 Page
DIN A5 Page
DIN A6 Page

And well, the contents of the image is a bit ugly, too, but if you know a nice bullshit bingo generator, let me know.

RFID Workshop at CampusGruen’s Datenschutzkongress

I was asked to give a workshop about RFID for the CampusGruen Datenschutzkongress in Hamburg. So I did 🙂

I used the opportunity to introduce the audience to the basics of RFID, i.e. what technologies exist and what they are used for. Also, I took arguments from pro and anti RFID groups to have them discussed.

You can have a look at the slides altough I doubt that they make much sense without actually having heard what was to be said. We spend good two hours talking and discussing over my twenty-something slides. Thanks again to the interested audience.

Afterwards, we had a small hacking session. I brought some RFID readers, tags, a passport, etc. and we used all that to play around. We also scanned some wallets to find out whether anybody had unwanted chips in their wallet.

GNOME 3 Launch Party in Hamburg

For the new GNOME-3 love we will have a release party in Hamburg, just as many places over Germany and the whole world!

If you want to join the fun, be in the Attraktor, the local hackerspace. The address is Mexikoring 21, 22999 Hamburg, Germany, Europe, Earth, Solarsystem. Find more detailed instruction on how to get there here. The party starts on Friday, 2011-04-08, at 18:00 and runs open end.

We have a page in the local wiki to describe the event and further planning will take place there: http://wiki.attraktor.org/Termin:GNOME-3-Launch-Party. As for the program: We intend to have a small introductory talk to show off what new user experience GNOME-3 will bring to the people. Afterwards, we will distribute GNOME-3 images to be put on pendrives to be able try GNOME-3. Finally, we’ll sit around, have some beers and snacks and discuss about the new and shiny GNOME 🙂

Besides the GNOME-3 images, we’ll have GNOME-3 goodies to give away! Thanks a lot to the GNOME Foundation making that possible! So show up early to claim your goodies!

So I expect you to be there 🙂

Sifting through a lot of similar photos

To keep the amount of photos in my photo library sane, I had to sift through many pictures and get rid of redundant ones. I defined redundancy as many pictures taken at the same time. Thus I had to pick one of the redundant pictures and delete the other ones.

My strategy so far was to use Nautilus and Eye of GNOME to spot pictures of the same group and delete all but the best one.

I realised that photos usually show the same picture if they were shot at the same time, i.e. many quick shots after another. I also realised that usually the best photograph was the biggest one in terms on bytes in JPEG format.

To automate the whole selection and deletion process, I hacked together a tiny script that stupidly groups files in a directory according to their mtime and deletes all but the biggest one.

Before deletion, it will show the pictures with eog and ask whether or not to delete the other pictures.

It worked quite well and helped to quickly weed out 15% of my pictures 🙂

I played around with another method: Getting the difference of the histograms of the images, to compare the similarity. But as the pictures were shot with a different exposure, the histograms were quite different, too. Hence that didn’t work out very well. But I’ll leave it in, just for reference.

So if you happen to have a similar problem, feel free to grab the following script 🙂

#!/usr/bin/env python
 
import collections
import math
import os
from os.path import join, getsize, getmtime
import operator
import subprocess
import sys
 
 
 
 
subprocess.Popen.__enter__ = lambda self: self
subprocess.Popen.__exit__ = lambda self, type, value, traceback: self.kill()
 
directory = '.'
THRESHOLD = 3
GET_RMS = False
 
mtimes = collections.defaultdict(list)
 
def get_picgroups_by_time(directory='.'):
 
	for root, dirs, files in os.walk(directory):
		for name in files:
			fname = join(root, name)
			mtime = getmtime(fname)
			mtimes[mtime].append(fname)
 
	# It's gotten a bit messy, but a OrderedDict is available in Python 3.1 hence this is the manually created ordered list.
	picgroups = [v for (k, v) in sorted([(k, v) for k, v in mtimes.iteritems() if len(v) >= THRESHOLD])]
 
	return picgroups
 
def get_picgroups(directory='.'):
	return get_picgroups_by_time()
 
picgroups = get_picgroups(directory)
 
print 'Got %d groups' % len(picgroups)
 
def get_max_and_picgroups(picgroups):
	for picgroup in picgroups:
		max_of_group = max(picgroup, key=lambda x: getsize(x))
		print picgroup
		print 'max: %s: %d' % (max_of_group, getsize(max_of_group))
 
		if GET_RMS:
			import PIL.Image
			last_pic = picgroup[0]
			for pic in picgroup[1:]:
				image1 = PIL.Image.open(last_pic).histogram()
				image2 = PIL.Image.open(pic).histogram()
 
				rms = math.sqrt(reduce(operator.add, map(lambda a,b: (a-b)**2, image1, image2))/len(image1))
 
				print 'RMS %s %s: %s' % (last_pic, pic, rms)
 
			last_pic = pic
		yield (max_of_group, picgroup)
 
 
max_and_picgroups = get_max_and_picgroups(picgroups)
 
 
def decide(prompt, decisions):
	import termios, fcntl, sys, os, select
 
	fd = sys.stdin.fileno()
 
	oldterm = termios.tcgetattr(fd)
	newattr = oldterm[:]
	newattr[3] = newattr[3] & ~termios.ICANON & ~termios.ECHO
	termios.tcsetattr(fd, termios.TCSANOW, newattr)
 
	oldflags = fcntl.fcntl(fd, fcntl.F_GETFL)
	fcntl.fcntl(fd, fcntl.F_SETFL, oldflags | os.O_NONBLOCK)
 
	print prompt
 
	decided = None
	try:
		while not decided:
			r, w, e = select.select([fd], [], [])
			if r:
				c = sys.stdin.read(1)
				print "Got character", repr(c)
				decision_made = decisions.get(c, None)
				if decision_made:
					decision_made()
					decided = True
 
	finally:
	    termios.tcsetattr(fd, termios.TCSAFLUSH, oldterm)
	    fcntl.fcntl(fd, fcntl.F_SETFL, oldflags)
 
for max_of_group, picgroup in max_and_picgroups:
	cmd = ['eog', '-n'] + picgroup
	print 'Showing %s' % ', '.join(picgroup)
 
	def delete_others():
		to_delete = picgroup[:]
		to_delete.remove(max_of_group)
		print 'deleting %s' % ', '.join (to_delete)
		[os.unlink(f) for f in to_delete]
 
	with subprocess.Popen(cmd) as p:
		decide('%s is max, delete others?' % max_of_group, {'y': delete_others, 'n': lambda: ''})

GNOME @ FOSDEM 2011

I am very excited about having attended this years FOSDEM. Unfortunately, times were a bit busy so I am a bit late reporting about it, but I still want to state a couple of things.

I'm going to FOSDEM, the Free and Open Source Software Developers' European Meeting (I wonder how that image will look in 2012 😉 )

First of all, I am very happy that our GNOME booth went very well. Thanks to Frederic Peters and Frederic Crozat for manning to booth almost all the time. I tried to organise everything remotely and I’d say I partly succeeded. We got stickers, t-shirts and staff for the booth. We lacked presentation material and instructions for the booth though. But it still worked out quite well. For the next time, I’d try to be communicate more clearly who is doing what to prevent duplicate work and ensure that people know who is responsible for what.

Secondly, I’d like to thank Canonical for their generosity to sponsor a GNOME Event Box. After the orginal one went missing, Canocical put stuff like a PC, a projector, a monitor and lots of other stuff together for us to be able to show off GNOME-3. The old Box, however, turns out to be back again *yay*!

Sadly, we will not represent GNOME at upcoming CeBIT. But we will at LinuxTag. Latest.

Anyway, during FOSDEM, we got a lot of questions about GNOME 3 and Ubuntu, i.e. will it be easily possible to run GNOME 3 on Ubuntu. I hope we can make it possible to have a smooth transition from Unity to GNOME Shell. Interestingly enough, there isn’t a gnome-shell package in the official natty repositories yet 🙁

It was especially nice to see and talk to old GNOME farts. And I enjoyed socialising with all the other GNOME and non-GNOME people as well. Sadly, I didn’t like the GNOME Beer Event very much because it was very hot in the bar so I left very quickly.

So FOSDEM was a success for GNOME I’d say. Let’s hope that future events will work at least as well and that we’ll have a strong GNOME representation even after the GNOME 3 release.

DFN Workshop 2011

I had the opportunity to attend the 18th DFN Workshop (I wonder how that link will look like next year) and since it’s a great event I don’t want you to miss out. Hence I’ll try to sum the talks and the happenings up.

It was the second year for the conference to take place in Hotel Grand Elysee in Hamburg, Germany. I was unable to attend last year, so I didn’t know the venue. But I am impressed. It is very spacious, friendly and well maintained. The technical equipment seems to be great and everything worked really well. I am not too sure whether this is the work of the Hotel or the Linux Magazin though.

After a welcome reception which provided a stock of caffeine that should last all day long, the first talk was given by Dirk Kollberg from Sophos. Actually his boss was supposed to give the talk but cancelled it on short notice so he had to jump in. He basically talked about Scareware and that it was a big business.

He claimed that it used to be cyber graffiti but nowadays it turned into cyber war and Stuxnet would be a good indicator for that. The newest trend, he said, was that a binary would not only be compressed or encrypted by a packer, but that the packer itself used special techniques like OpenGL functions. That was a problem for simulators which were commonly used in Antivirus products.

He investigated a big Ukrainian company (Innovative Marketing) that produced a lot of scareware and was in fact very well organised. But apparently not from a security point of view because he claimed to have retrieved a lot of information via unauthenticated HTTP. And I mean a lot. From the company’s employees address book, over ERM diagrams of internal databases to holiday pictures of the employees. Almost unbelievable. He also discovered a server that malware was distributed from and was able to retrieve the statistics page which showed how much traffic the page made and which clients with which IPs were connecting. He claimed to have periodically scraped the page to then compile a map with IPs per country. The animation was shown for about 90 scraped days. I was really wondering why he didn’t contact the ISP to shut that thing down. So I asked during Q&A and he answered that it would have been for Sophos because they wouldn’t have been able to gain more insight. That is obviously very selfish and instead of providing good to the whole Internet community, they only care about themselves.

The presentation style was a bit weird indeed. He showed and commented a pre-made video which lasted for 30 minutes out of his 50 minutes presentation time. I found that rather bold. What’s next? A pre-spoken video which he’ll just play while standing on the stage? Really sad. But the worst part was as he showed private photos of the guy of that Ukrainian company which he found by accident. I also told him that I found it disgusting that he pillared that guy in public and showed off his private life. The people in the audience applauded.

A coffee break made us calm down.

The second talk about Smart Grid was done by Klaus Mueller. Apparently Smart Grids are supposed to be the new big thing in urban power networks. It’s supposed to be a power *and* communications network and the household or every device in it would be able to communicate, i.e. to tell or adapt its power consumption.

He depicted several attack scenarios and drew multiple catastrophic scenarios, i.e. what happens if that Smart Grid system was remotely controllable (which it is by design) and also remotely exploitable so that you could turn off power supply for a home or a house?
The heart of the Smart Grid system seemed to be so called Smart Meters which would ultimately replace traditional, mechanical power consumption measuring devices. These Smart Meters would of course be designed to be remotely controllable because you will have an electrified car which you only want to be charged when the power is at its cheapest price, i.e. in the night. Hence, the power supplier would need to tell you when to turn the car charging, dish or clothes washing machine on.

Very scary if you ask me. And even worse: Apparently you can already get Smart Meters right now! For some weird reason, he didn’t look into them. I would have thought that if he was interested in that, he would buy such a device and open it. He didn’t even have a good excuse, i.e. no time or legal reasons. He gave a talk about attack scenarios on a system which is already partly deployed but without actually having a look at the rolled out thing. That’s funny…

The next guy talked about Smart Grids as well, but this time more from a privacy point of view. Although I was not really convinced. He proposed a scheme to anonymously submit power consumption data. Because the problem was that the Smart Meter submitted power consumption data *very* regularly, i.e. every 15 minutes and that the power supplier must not know exactly how much power was consumed in each and every interval. I follow and highly appreciate that. After all, you can tell exactly when somebody comes back home, turns the TV on, puts something in the fridge, makes food, turns the computer on and off and goes to bed. That kind of profiles are dangerous albeit very useful for the supplier. Anyway, he committed to submitting aggregated usage data to the supplier and pulled off self-made protocols instead of looking into the huge fundus of cryptographic protocols which were designed for anonymous or pseudonymous encryption. During Q&A I told him that I had the impression of the proposed protocols and the crypto being designed on a Sunday evening in front of the telly and whether he actually had a look at any well reviewed cryptographic protocols. He didn’t. Not at all. Instead he pulled some random protocols off his nose which he thought was sufficient. But of course it was not, which was clearly understood during the Q&A. How can you submit a talk about privacy and propose a protocol without actually looking at existing crypto protocols beforehand?! Weird dude.

The second last man talking to the crowd was a bit off, too. He had interesting ideas though and I think he was technically competent. But he first talked about home routers being able of getting hacked and becoming part of a botnet and then switched to PCs behind the router being able to become part of a botnet to then talk about installing an IDS on every home router which not only tells the ISP about potential intrusions but also is controllable by the ISP, i.e. “you look like you’re infected with a bot, let’s throttle your bandwidth”. I didn’t really get the connection between those topics.

But both ideas are a bit weird anyway: Firstly, your ISP will see the exact traffic it’s routing to you whatsoever. Hence there is no need to install an IDS on your home router because the ISP will have the information anyway. Plus their IDS will be much more reliable than some crap IDS that will be deployed on a crap Linux which will run on crappy hardware. Secondly, having an ISP which is able to control your home router to shape, shut down or otherwise influence your traffic is really off the wall. At least it is today. If he assumes the home router and the PCs behind it to be vulnerable, he can’t trust the home router to deliver proper IDS results anyway. Why would we want the ISP then to act upon that potentially malicious data coming from a potentially compromised home router? And well, at least in the paper he submitted he tried to do an authenticated boot (in userspace?!) so that no hacked firmware could be booted, but that would require the software in the firmware to be secure in first place, otherwise the brilliantly booted device would be hacked during runtime as per the first assumption.

But I was so confused about him talking about different things that the best question I could have asked would have been what he was talking about.

Finally somebody with practical experience talked and he presented us how they at Leibniz Rechenzentrum. Stefan Metzger showed us their formal steps and how they were implemented. At the heart of their system was OSSIM which aggregated several IDSs and provided a neat interface to search and filter. It wasn’t all too interesting though, mainly because he talked very sleepily.

The day ended with a lot of food, beer and interesting conversations 🙂

The next day started with Joerg Voelker talking about iPhone security. Being interested in mobile security myself, I really looked forward to that talk. However, I was really disappointed. He showed what more or less cool stuff he could do with his phone, i.e. setting an alarm or reading email… Since it was so cool, everybody had it. Also, he told us what important data was on such a phone. After he built his motivation, which lasted very long and showed many pictures of supposed to be cool applications, he showed us which security features the iPhone allegedly had, i.e. Code Signing, Hardware and File encryption or a Sandbox for the processes. He read the list without indicating any problems with those technologies, but he eventually said that pretty much everything was broken. It appears that you can jailbreak the thing to make it run unsigned binaries, get a dump of the disk with dd without having to provide the encryption key or other methods that render the protection mechanisms useless. But he suffered a massive cognitive dissonance because he kept praising the iPhone and how cool it was.
When he mentioned the sandbox, I got suspicious, because I’ve never heard of such a thing on the iPhone. So I asked him whether he could provide details on that. But he couldn’t. I appears that it’s a policy thing and that your application can very well read and write data out of the directory it is supposed to. Apple just rejects applications when they see it accessing files it shouldn’t.
Also I asked him which protection mechanisms on the iPhone that were shipped by Apple do actually work. He claimed that with the exception of the File encryption, none was working. I told him that the File encryption is proprietary code and that it appears to be a designed User Experience that the user does not need to provide a password for syncing files, hence a master key would decrypt files while syncing.

That leaves me with the impression that an enthusiastic Apple fanboy needed to justify his iPhone usage (hey, it’s cool) without actually having had a deeper look at how stuff works.

A refreshing talk was given by Liebchen on Physical Security. He presented ways and methods to get into buildings using very simple tools. He is part of the Redteam Pentesting team and apparently was ordered to break into buildings in order to get hold of machines, data or the network. He told funny stories about how they broke in. Their tools included a “Keilformgleiter“, “Tuerfallennadeln” or “Tuerklinkenangel“.
Once you’re in you might encounter glass offices which have the advantage that, since passwords are commonly written on PostIts and sticked to the monitor, you can snoop the passwords by using a big lens!

Peter Sakal presented a so called “Rapid in-Depth Security Framework” which he developed (or so). He introduced to secure software development and what steps to take in order to have a reasonably secure product. But all of that was very high level and wasn’t really useful in real life. I think his main point was that he classified around 300 fuzzers and if you needed one, you could call him and ask him. I expected way more, because he teased us with a framework and introduced into the whole fuzzing thing, but didn’t actually deliver any framework. I really wonder how the term “framework” even made it into the title of his talk. Poor guy. He also presented softscheck.com on every slide which now makes a good entry in my AdBlock list…

Fortunately, Chritoph Wegener was a good speaker. He talked about “Cloud Security 2.0” and started off with an introduction about Cloud Computing. He claimed that several different types exist, i.e. “Infrastructure as a Service” (IaaS), i.e. EC2 or Dropbox, “Platform as a Service” (PaaS), i.e. AppEngine or “Software as a Service (SaaS), i.e. GMail or Twitter. He drew several attack scenarios and kept claiming that you needed to trust the provider if you wanted to do serious stuff. Hence, that was the unspoken conclusion, you must not use Cloud Services.

Lastly, Sven Gabriel gave a presentation about Grid Security. Apparently, he supervises boatloads of nodes in a grid and showed how he and his team manage to do so. Since I don’t operate 200k nodes myself, I didn’t think it was relevant albeit it was interesting.

To conclude the DFN Workshop: It’s a nice conference with a lot of nice people but it needs to improve content wise.

OCRing a scanned book

I had the pleasure to use a “Bookeye” book scanner. It’s a huge device which helps scanning things like books or folders. It’s very quick and very easy to use. I got a huge PDF out of my good 100 pages that I’ve scanned.

Unfortunately the light was very bright and so the scanner scanned “through” the open pages revealing the back sides of the pages. That’s not very cool and I couldn’t really dim the light or put a sheet between the Pages.
Also, it doesn’t do OCR but my main point of digitalising this book was to actually have it searchable and copy&pastable.

There seem to be multiple options to do OCR on images:

tesseract

covered already

ocropus

Apparently this is supposed to be tesseract on steroids as it can recognise text on paper and different layouts and everything.
Since it’s a bit painful to compile, I’d love to share my experiences hoping that it will become useful to somebody.

During compilation of ocropus, you might run into issues like this or that, so be prepared to patch the code.


cd /tmp/
svn checkout http://iulib.googlecode.com/svn/trunk/ iulib
cd iulib/
./configure --prefix=/tmp/libiu-install
make && make install


cd /tmp/
wget http://www.leptonica.com/source/leptonlib-1.67.tar.gz -O- | tar xvzf -
cd leptonlib*/
./configure --prefix=/tmp/leptonica
make && make install


cd /tmp/
svn checkout http://ocropus.googlecode.com/svn/trunk/ ocropus
# This is due to this bug: http://code.google.com/p/ocropus/issues/detail?id=283
cat > ~/bin/leptheaders <
#!/bin/sh
echo /tmp/leptonica/include/leptonica/
EOF
chmod a+x ~/bin/leptheaders
./configure --prefix=/tmp/ocropus-install --with-iulib=/tmp/libiu-install/
make && make install

muelli@bigbox /tmp $ LD_LIBRARY_PATH=/tmp/ocropus-install/lib/:/tmp/leptonica/lib/ ./ocropus-install/bin/ocroscript --help
usage: ./ocropus-install/bin/ocroscript [options] [script [args]].
Available options are:
  -e stat  execute string 'stat'
  -l name  require library 'name'
  -i       enter interactive mode after executing 'script'
  -v       show version information
  --       stop handling options
  -        execute stdin and stop handling options
muelli@bigbox /tmp $

However, I can’t do anything because I can’t make LUA load the scripts from the share/ directory of the prefix. Too sad. It looked very promising.

Cuneiform

This is an interesting thing. It’s a BSD licensed russian OCR software that was once one the leading tools to do OCR.
Interestingly, it’s the most straight forward thing to install, compared to the other things listed here.

bzr branch lp:cuneiform-linux
cd cuneiform-linux/
mkdir build
cd build/
cmake .. -DCMAKE_INSTALL_PREFIX=/tmp/cuneiform
make
make install

This is supposed to produce some sort of HTML which we can glue to a PDF with the following tool.

hocr2pdf

Apparently takes “HTML annotated OCR data” and bundles that, together with the image, to a PDF.


cd /tmp/
svn co http://svn.exactcode.de/exact-image/trunk ei
cd ei/
./configure --prefix=/tmp/exactimage
make && make install

That, however, failed for me like this:

  LINK EXEC objdir/frontends/optimize2bw
/usr/bin/ld: objdir/codecs/lib.a: undefined reference to symbol 'QuantizeBuffer'
/usr/bin/ld: note: 'QuantizeBuffer' is defined in DSO /usr/lib64/libgif.so.4 so try adding it to the linker command line
/usr/lib64/libgif.so.4: could not read symbols: Invalid operation
collect2: ld returned 1 exit status
make: *** [objdir/frontends/optimize2bw] Error 1

Adding “LDFLAGS += -lgif” to the Makefile fixes that. I couldn’t find a bug tracker, hence I reported this issue via email but haven’t heard back yet.

Although the hOCR format seems to be the only option to actually know where in the file the text appears, no OCR program, except cuneiform and tesseract with a patch, seems to support it 🙁

gscan2pdf

as a full suite it can import pictures or PDFs and use a OCR program mentioned above (tesseract or gocr). The whole thing can then be saved as a PDF again.
Results with gocr are not so good. I can’t really copy and paste stuff. Searching does kinda work though.

Using Tesseract, however, doesn’t work quite well:

 Tesseract Open Source OCR Engine
tesseract: unicharset.cpp:76: const UNICHAR_ID UNICHARSET::unichar_to_id(const char*, int) const: Assertion `ids.contains(unichar_repr, length)' failed.
sh: line 1:  6187 Aborted                 (core dumped) tesseract /tmp/4jZN0oNbB1/dLNBLkcjph.tif /tmp/4jZN0oNbB1/_4BdZMfGXJ -l fra
*** unhandled exception in callback:
***   Error: cannot open /tmp/4jZN0oNbB1/_4BdZMfGXJ.txt
***  ignoring at /usr/bin/gscan2pdf line 12513.
Tesseract Open Source OCR Engine
tesseract: unicharset.cpp:76: const UNICHAR_ID UNICHARSET::unichar_to_id(const char*, int) const: Assertion `ids.contains(unichar_repr, length)' failed.
sh: line 1:  6193 Aborted                 (core dumped) tesseract /tmp/4jZN0oNbB1/ELMbnDkaEI.tif /tmp/4jZN0oNbB1/C47fuqxX3S -l fra
*** unhandled exception in callback:
***   Error: cannot open /tmp/4jZN0oNbB1/C47fuqxX3S.txt
***  ignoring at /usr/bin/gscan2pdf line 12513.

It doesn’t seems to be able to work with cuneiform 🙁

Archivista Box

This is actually an appliance and you can download an ISO image.
Running it is straight forward:

cd /tmp/
wget 'http://downloads.sourceforge.net/project/archivista/archivista/ArchivistaBox_2010_IV/archivista_20101218.iso?r=http%3A%2F%2Fsourceforge.net%2Fprojects%2Farchivista%2F&ts=1295436241&use_mirror=ovh'
qemu -cdrom /tmp/archivista_20101218.iso -m 786M -usb

Funnily enough, the image won’t boot with more than 786MB of RAM. Quite weird, but qemu just reports the CPU to be halted after a while. If it does work, it boots up a firefox with a nice WebUI which seems to be quite functional. However, I can’t upload my >100MB PDF probably because it’s a web based thing and either the server rejects big uploads or the CGI just times out or a mixture of both.

Trying to root this thing is more complex than usual. Apparently you can’t give “init=/bin/sh” as a boot parameter as it wouldn’t make a difference. So I tried to have a look at the ISO image. There is fuseiso to mount ISO images in userspace. Unfortunately, CDEmu doesn’t seem to be packaged for Fedora. Not surprisingly, there was a SquashFS on that ISO9660 filesystem. Unfortunately, I didn’t find any SquashFS FUSE implementation 🙁 But even with elevated privileges, I can’t mount that thing *sigh*:

$ file ~/empty/live.squash
/home/muelli/empty/live.squash: Squashfs filesystem, little endian, version 3.0, 685979128 bytes, 98267 inodes, blocksize: 65536 bytes, created: Sat Dec 18 06:54:54 2010
$ sudo mount ~/empty/live.squash /tmp/empty/
mount: wrong fs type, bad option, bad superblock on /dev/loop1,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so
$ dmesg | tail -n 2
[342853.796364] squashfs: version 4.0 (2009/01/31) Phillip Lougher
[342853.796726] SQUASHFS error: Major/Minor mismatch, older Squashfs 3.0 filesystems are unsupported

But unsquashfs helped to extract the whole thing onto my disk. They used “T2” to bundle everything to a CD and packaged software mentioned above. Unfortunately, very old versions were used, i.e. cuneiform is in version 0.4.0 as opposed to 1.0.0. Hence, I don’t really consider it to be very useful to poke around that thing.

It’s a huge thing worth exploring though. It all seems to come from this SVN repository: svn://svn.archivista.ch/home/data/archivista/svn.

WatchOCR

For some reason, they built an ISO image as well. Probably to run an appliance.

cd /tmp/
wget http://www.watchocr.com/files/watchocr-V0.6-2010-12-10-en.iso
qemu -cdrom /tmp/watchocr-V0.6-2010-12-10-en.iso -m 1G

The image booted up a webbrowser which showed a webinterface to the WebOCR functionality.
I extraced the necessary scripts which wraps tools like cuniform, ghostscript and friends. Compared to the archivista box, the scripts here are rather simple. Please find webocr and img2pdf. They also use an old cuneiform 0.8.0 which is older than the version from Launchpad.

However, in my QEMU instance, the watchocr box took a very long time to process my good 100 pages PDF.

Some custom script

That tries to do the job did in fact quite well, although it’s quite slow as well. It lacks proper support for spawning multiple commands in parallel.

After you have installed the dependencies like mentioned above, you can run it:

wget http://www.konradvoelkel.de/download/pdfocr.sh
PATH="/tmp/exactimage/bin/:/tmp/cuneiform/bin/:$PATH" LD_LIBRARY_PATH=/tmp/cuneiform/lib64/ sh -x pdfocr.sh buch-test-1.pdf 0 0 0 0 2500 2000 fra SomeAuthor SomeTitle

The script, however, doesn’t really work for me, probably because of some quoting issues:

+ pdfjoin --fitpaper --tidy --outfile ../buch-test-1.pdf.ocr1.pdf 'pg_*.png.pdf'
          ----
  pdfjam: This is pdfjam version 2.08.
  pdfjam: Reading any site-wide or user-specific defaults...
          (none found)
  pdfjam ERROR: pg_*.png.pdf not found

Having overcome that problem, the following pdfjoin doesn’t work for an unknown reason. After having replaced pdfjoin manually, I realised, that the script sampled the pages down, made them monochrome and rotated them! Hence, no OCR was possible and the final PDF was totally unusable *sigh*.

It’s a mess.

To conclude…

I still don’t have a properly OCRd version of my scanned book, because of not very well integrated tools. I believe that programs like pdftk, imagemagick, unpaper, cuneiform, hocr2pdf, pdfjam do their job very well. But it appears that they are not very well knit together to form a useful tools to OCR a given PDF. Requirements would be, for example, that there is no loss of quality of the scanned images, that the number of programs to be called is reduced to a minimum and that everything needs to be able to do batch processing. So far, I couldn’t find anything that fulfills that requirements. If you know anything or have a few moments to bundle the necessary tools together, please tell me :o) The necessary pieces are all there, as far as I can see. It just needs someone to integrate everything nicely.

LaTeX leaftlet and background colours

I was playing around with the LaTeX’s leaflet class to produce brochures, leaflets or flyers, however you’d like to call them. Basically a DIN A4 in portrait mode and three “columns” which I wanted to feel like pages. The backside needs to be upside down and the “pages” need to be properly ordered in order for the whole thing to be printed and folded properly.

So I had a look at the manual and noticed, that it uses background colour for pages. I wanted that, too.

As the manual reads, you can use \AddToBackground to add stuff to the background. But what is to add if you want a page to have background colour? Well, Wikibooks says to use \pagecolor. But that colours the whole DIN A4 paper and not just one virtual page in a column on the DIN A4 sheet.

I browsed around and didn’t find any real explanation but an example. At least the code uses different colours for different virtual pages and it just works. Nice.


So whenever you want to have a background colour on a single column with the leaflet class, use

\usepackage[usenames,dvipsnames]{color}
 
\AddToBackground{1}{
    \put(0,0){\textcolor{green}{\rule{\paperwidth}{\paperheight}}}}
\AddToBackground{2}{
    \put(0,0){\textcolor{red}{\rule{\paperwidth}{\paperheight}}}}
\AddToBackground{3}{
    \put(0,0){\textcolor{blue}{\rule{\paperwidth}{\paperheight}}}}
\AddToBackground{4}{
    \put(0,0){\textcolor{Magenta}{\rule{\paperwidth}{\paperheight}}}}
\AddToBackground{5}{
    \put(0,0){\textcolor{Orange}{\rule{\paperwidth}{\paperheight}}}}
\AddToBackground{6}{
    \put(0,0){\textcolor{Fuchsia}{\rule{\paperwidth}{\paperheight}}}}

It doesn’t seem to be possible to have coloured virtual pages *and* a background picture spanning over the whole DIN A4 page. I tried several things, including playing around with the wallpaper package, but I didn’t have any success so far. One could split the background up in three pieces and include one of those on each page, but that’s really ugly and hacky. I don’t like that.

I kinda got it working using eso-pic and transparent, but the result is messy, because the image, which is supposed to be in the background, is in foreground. And even with transparency, it looks bad. Just like a stamp, not a watermark.

I also tried to make the pages background colour transparent but putting the background image is very idiotic: I would have to place \AddToShipoutPicture to the very correct place in the TeX file instead of defining it in the headers somewhere *sigh*
But anyway, it still wouldn’t work correctly as the image, which is supposed to be in the background, would be rendered *on top* of the first virtual page on each physical page, making the colours look very weird:

So I stepped back and didn’t really want to use LaTeX anymore. So I had a look at pdftk. It is able to put a watermark behind a given PDF once the PDF has transparent background colours. I changed my Makefile to read like that (which is not necessarily beautiful but I still want to share my experience):

Logo390BG-DINA4-180.pdf: Logo390BG-DINA4.pdf
        # Expand background to two pages and rotate second page by 180 deg
        pdftk I=$< cat I1 I1D output $@
 
broschuere-print.pdf: broschuere.pdf Logo390BG-DINA4-180.pdf *.tex
        # Doesn't work with pdftk 1.41, but with pdftk 1.44.
        pdftk broschuere.pdf multibackground Logo390BG-DINA4-180.pdf output $@

That worked quite well:
That's how it's supposed to be

But I wasn't quite happy having to use external tools. I want my LaTeX to do as much as possible to not have to rely on external circumstances. Also, my Fedora doesn't ship a pdftk version that is able to do the multibackground. So I had another look and by now it is almost obvious. Just put the background picture at (0,0), and *then* draw the background. Note that virtual pages 2 and 5 make the first column on a physical page. Hence, we draw the background picture there and scale it by three, to make it spawn across the physical page.

\AddToBackground{1}{
    \put(0,0){\transparent{0.5}{\textcolor{green}{\rule{\paperwidth}{\paperheight}}}}
}
\AddToBackground{2}{
    \put(0,0){
        \includegraphics[width=3\paperwidth]{Logo390BG}%
    }
    \put(0,0){%
        \transparent{0.5}{\textcolor{red}{\rule{\paperwidth}{\paperheight}}}}
}
\AddToBackground{3}{
    \put(0,0){\transparent{0.5}{\textcolor{blue}{\rule{\paperwidth}{\paperheight}}}}}
\AddToBackground{4}{
    \put(0,0){\transparent{0.5}{\textcolor{Magenta}{\rule{\paperwidth}{\paperheight}}}}%
}
\AddToBackground{5}{
    \put(0,0){
        \includegraphics[width=3\paperwidth]{Logo390BG}%
    }
    \put(0,0){%
        \transparent{0.5}{\textcolor{Orange}{\rule{\paperwidth}{\paperheight}}}
    }
}
\AddToBackground{6}{
    \put(0,0){\transparent{0.5}{\textcolor{Fuchsia}{\rule{\paperwidth}{\paperheight}}}}}

FOSS.in last edition 2010

I had the pleasure to be invited to FOSS.in 2010. As I was there to represent parts of GNOME I feel obliged to report what actually happened.

The first day was really interesting. It was very nice to see that many people having a real interest in Free Software. It was mostly students that I have talked to and they said that Free Software was by far not an issue at colleges in India.

Many people queued up to register for the conference. That’s very good to see. Apparently, around 500 people showed up to share the Free Software love. the usual delays in the conference setup were there as expected 😉 So the opening ceremony started quite late and started, as usual, with lighting the lamp.

Danese from the Wikimedia Foundation started the conference with her keynote on the technical aspects of Wikipedia.

She showed that there is a lot of potential for Wikipedia in India, because so far, there was a technical language barrier in Wikipedia’s software. Also, companies like Microsoft have spent loads of time and money on wiping out a free (software) culture, hence not so many Indians got the idea of free software or free content and were simply not aware of the free availability of Wikipedia.

According to Danese, Wikipedia is the Top 5 website after companies like Google or Facebook. And compared to the other top websites, the Wikimedia Foundation has by far the least employees. It’s around 50, compared to the multiple tens of thousands of employees that the other companies employ. She also described the openness of Wikipedia in almost every aspect. Even the NOC is quite open to the outside world, you can supposedly see the network status. Also, all the documentation is on the web about all the internal process so that you could learn a lot about the Foundation a lot if you wanted to.

She presented us several methods and technologies which help them to scale the way the Wikipedia does, as well as some very nerdy details like the Squid proxy setup or customisations they made to MySQL. They are also working on offline delivery methods because many people on the world do not have continuous internet access which makes browsing the web pretty hard.

After lunch break, Bablir Singh told us about caching in virtualised environments. He introduced into a range of problems that come with virtualisation. For example the lack of memory and that all the assumption of caches that Linux makes were broken when virtualising.
Basically the problem was that if a Linux guest runs on a Linux host, both of them would cache, say, the hard disk. This is, of course, not necessary and he proposed two strategies to mitigate that problem. One of them was to use a memory balloon driver and give the kernel a hint that the for the caching allocated pages should be wiped earlier.

Lenny then talked about systemd and claimed that it was Socket Based Activation that made it so damn fast. It was inspired by Apples launchd and performs quite well.

Afterwards, I have been to the Meego room where they gave away t-shirts and Rubix-cubes. I was told a technique on how to solve the Rubix-cube and I tried to do it. I wasn’t too successful though but it’s still very interesting. I can’t recite the methods and ways to solve the cube but there are tutorials on the internet.

Rahul talked about failures he seen in Fedora. He claimed that Fedora was the first project to adopt a six month release cycle. He questioned whether six month is actually a good time frame. Also the governance modalities were questioned. The veto right in the Fedora Board was prone to misuse. Early websites were ugly and not very inviting. By now, the website is more appealing and should invite the audience to contribute. MoinMoin was accused of not being as good MediaWiki, simply because Wikipedia uses MediaWiki. Not a very good reasoning in my opinion.

I was invited to do a talk about Security and Mobile Devices (again). I had a very interested audience which pulled off an interesting Q&A Session. People still come with questions and ideas. I just love that. You can find the slides here.

As we are on mobile security, I wrote a tiny program for my N900 to sidejack Twitter accounts. It’s a bit like firesheep, but does Twitter only (for now) and it actually posts a nice message. But I’ve also been pnwed… 😉

But more on that in a separate post.


Unfortunately, the FOSS.in team announced, that this will be the last FOSS.in they organise. That’s very sad because it was a lot of fun with a very interesting set of people. They claim that they are burnt out and that if one person is missing, nothing will work, because everyone knew exactly what role to take and what to do. I don’t really like this reasoning, because it reveals that the Busfactor is extremely low. This, however, should be one of the main concerns when doing community work. Hence, the team is to blame for having taken care of increasing the Busfactor and thus leading FOSS.in to a dead end. Very sad. But thanks anyway for the last FOSS.in. I am very proud of having attended it.

BAföG, PDF and Evince – Decrypted PDF documents

In Germany, students may apply for BAföG which basically makes them receive money for their studies. In order to apply, you have to fill out lots of forms. They provide PDFs with forms that you can –at least in theory– fill out. Well, filling out with Evince works quite well, but saving doesn’t. It complains, that the document is encrypted. WTF?

It’s a form provided by the government. You wouldn’t think that there is anything subject to DRM and that they stop you actually saving a filled document. Producing the document in first place was paid by us citizens so I’d fully expect to be at least allowed save the filled form. I don’t request the sources of that document (well, I like the idea but I probably couldn’t do anything with it anyway) but only that my government helps me filling out all those forms and that it doesn’t unnecessarily restrict me.

So I wrote those folks at the office, stating that they’ve accidentally restricted me saving the form. I received an answer quite quickly:

leider handelt es sich hier nicht um ein Versehen. Die Speicherbarkeit der Formulare unterliegt einem Rechtekonzept des Programm-Herstellers, nach welchem ab einer gewissen Abrufzahl das Abspeichern der Formulare nicht kostenfrei möglich ist.

Unterschiedliche Freewares bieten jedoch die Möglichkeit, die vorhandenen Formblätter auf dem eigenen PC abzuspeichern. Beispielhaft wird Ihnen auf dem Internet-Auftritt hierzu ein entsprechendes Softwarepaket zum kostenfreien Download genannt

Sorry for the German. The translation is roughly: It’s not an accident. The “program vendor’s right management” is responsible for that. And if many people actually download the PDF file, that Digital Restrictions Management requires that office to not allow the people to save the forms. Erm. Yes. I haven’t verified this but I fully expect the authoring software “Adobe LiveCycle Designer ES 8.2” to have a very weird license that makes us citizens suffer from those stupid restrictions. This, ladies and gentlemen, is why we need Free Software. And we need governments to stop using proprietary software with such retarded licenses.

Apparently, there are a few DRM technologies within PDF. One of them are stupid flags inside the document, that tell you whether you are allowed to, say, print or fill forms in the document. And it was heavily discussed what to do about those, because they can be silently ignored.

Anyway, I came across Ubuntu bug 477644 which mentions QPDF, a tool to manipulate PDFs while preserving its content. So if you go and download all those PDFs with forms, and do a “qpdf –decrypt input.pdf output.pdf” on them, you can save your filled form.

pushd /tmp/
for f in 1 1_anlage_1 1_anlage_2 2 3 4 5 6 7 8; do
wget --continue "http://www.das-neue-bafoeg.de/intern/upload/formblaetter/nbb_fbl_${f}.pdf"
qpdf --decrypt "/tmp/nbb_fbl_${f}.pdf" "/tmp/nbb_fbl_${f}_decrypted.pdf"
done
popd

I’ve prepared that and you can download the fillable and savable decrypted BAfoeG Forms from here:

Hope you can use it.

Creative Commons Attribution-ShareAlike 3.0 Unported
This work by Muelli is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported.