February 2011 – muellis blog

GNOME @ FOSDEM 2011

I am very excited about having attended this years FOSDEM. Unfortunately, times were a bit busy so I am a bit late reporting about it, but I still want to state a couple of things.

(I wonder how that image will look in 2012 😉 )

First of all, I am very happy that our GNOME booth went very well. Thanks to Frederic Peters and Frederic Crozat for manning to booth almost all the time. I tried to organise everything remotely and I’d say I partly succeeded. We got stickers, t-shirts and staff for the booth. We lacked presentation material and instructions for the booth though. But it still worked out quite well. For the next time, I’d try to be communicate more clearly who is doing what to prevent duplicate work and ensure that people know who is responsible for what.

Secondly, I’d like to thank Canonical for their generosity to sponsor a GNOME Event Box. After the orginal one went missing, Canocical put stuff like a PC, a projector, a monitor and lots of other stuff together for us to be able to show off GNOME-3. The old Box, however, turns out to be back again *yay*!

Sadly, we will not represent GNOME at upcoming CeBIT. But we will at LinuxTag. Latest.

Anyway, during FOSDEM, we got a lot of questions about GNOME 3 and Ubuntu, i.e. will it be easily possible to run GNOME 3 on Ubuntu. I hope we can make it possible to have a smooth transition from Unity to GNOME Shell. Interestingly enough, there isn’t a gnome-shell package in the official natty repositories yet 🙁

It was especially nice to see and talk to old GNOME farts. And I enjoyed socialising with all the other GNOME and non-GNOME people as well. Sadly, I didn’t like the GNOME Beer Event very much because it was very hot in the bar so I left very quickly.

So FOSDEM was a success for GNOME I’d say. Let’s hope that future events will work at least as well and that we’ll have a strong GNOME representation even after the GNOME 3 release.

DFN Workshop 2011

I had the opportunity to attend the 18th DFN Workshop (I wonder how that link will look like next year) and since it’s a great event I don’t want you to miss out. Hence I’ll try to sum the talks and the happenings up.

It was the second year for the conference to take place in Hotel Grand Elysee in Hamburg, Germany. I was unable to attend last year, so I didn’t know the venue. But I am impressed. It is very spacious, friendly and well maintained. The technical equipment seems to be great and everything worked really well. I am not too sure whether this is the work of the Hotel or the Linux Magazin though.

After a welcome reception which provided a stock of caffeine that should last all day long, the first talk was given by Dirk Kollberg from Sophos. Actually his boss was supposed to give the talk but cancelled it on short notice so he had to jump in. He basically talked about Scareware and that it was a big business.

He claimed that it used to be cyber graffiti but nowadays it turned into cyber war and Stuxnet would be a good indicator for that. The newest trend, he said, was that a binary would not only be compressed or encrypted by a packer, but that the packer itself used special techniques like OpenGL functions. That was a problem for simulators which were commonly used in Antivirus products.

He investigated a big Ukrainian company (Innovative Marketing) that produced a lot of scareware and was in fact very well organised. But apparently not from a security point of view because he claimed to have retrieved a lot of information via unauthenticated HTTP. And I mean a lot. From the company’s employees address book, over ERM diagrams of internal databases to holiday pictures of the employees. Almost unbelievable. He also discovered a server that malware was distributed from and was able to retrieve the statistics page which showed how much traffic the page made and which clients with which IPs were connecting. He claimed to have periodically scraped the page to then compile a map with IPs per country. The animation was shown for about 90 scraped days. I was really wondering why he didn’t contact the ISP to shut that thing down. So I asked during Q&A and he answered that it would have been for Sophos because they wouldn’t have been able to gain more insight. That is obviously very selfish and instead of providing good to the whole Internet community, they only care about themselves.

The presentation style was a bit weird indeed. He showed and commented a pre-made video which lasted for 30 minutes out of his 50 minutes presentation time. I found that rather bold. What’s next? A pre-spoken video which he’ll just play while standing on the stage? Really sad. But the worst part was as he showed private photos of the guy of that Ukrainian company which he found by accident. I also told him that I found it disgusting that he pillared that guy in public and showed off his private life. The people in the audience applauded.

A coffee break made us calm down.

The second talk about Smart Grid was done by Klaus Mueller. Apparently Smart Grids are supposed to be the new big thing in urban power networks. It’s supposed to be a power *and* communications network and the household or every device in it would be able to communicate, i.e. to tell or adapt its power consumption.

He depicted several attack scenarios and drew multiple catastrophic scenarios, i.e. what happens if that Smart Grid system was remotely controllable (which it is by design) and also remotely exploitable so that you could turn off power supply for a home or a house?
The heart of the Smart Grid system seemed to be so called Smart Meters which would ultimately replace traditional, mechanical power consumption measuring devices. These Smart Meters would of course be designed to be remotely controllable because you will have an electrified car which you only want to be charged when the power is at its cheapest price, i.e. in the night. Hence, the power supplier would need to tell you when to turn the car charging, dish or clothes washing machine on.

Very scary if you ask me. And even worse: Apparently you can already get Smart Meters right now! For some weird reason, he didn’t look into them. I would have thought that if he was interested in that, he would buy such a device and open it. He didn’t even have a good excuse, i.e. no time or legal reasons. He gave a talk about attack scenarios on a system which is already partly deployed but without actually having a look at the rolled out thing. That’s funny…

The next guy talked about Smart Grids as well, but this time more from a privacy point of view. Although I was not really convinced. He proposed a scheme to anonymously submit power consumption data. Because the problem was that the Smart Meter submitted power consumption data *very* regularly, i.e. every 15 minutes and that the power supplier must not know exactly how much power was consumed in each and every interval. I follow and highly appreciate that. After all, you can tell exactly when somebody comes back home, turns the TV on, puts something in the fridge, makes food, turns the computer on and off and goes to bed. That kind of profiles are dangerous albeit very useful for the supplier. Anyway, he committed to submitting aggregated usage data to the supplier and pulled off self-made protocols instead of looking into the huge fundus of cryptographic protocols which were designed for anonymous or pseudonymous encryption. During Q&A I told him that I had the impression of the proposed protocols and the crypto being designed on a Sunday evening in front of the telly and whether he actually had a look at any well reviewed cryptographic protocols. He didn’t. Not at all. Instead he pulled some random protocols off his nose which he thought was sufficient. But of course it was not, which was clearly understood during the Q&A. How can you submit a talk about privacy and propose a protocol without actually looking at existing crypto protocols beforehand?! Weird dude.

The second last man talking to the crowd was a bit off, too. He had interesting ideas though and I think he was technically competent. But he first talked about home routers being able of getting hacked and becoming part of a botnet and then switched to PCs behind the router being able to become part of a botnet to then talk about installing an IDS on every home router which not only tells the ISP about potential intrusions but also is controllable by the ISP, i.e. “you look like you’re infected with a bot, let’s throttle your bandwidth”. I didn’t really get the connection between those topics.

But both ideas are a bit weird anyway: Firstly, your ISP will see the exact traffic it’s routing to you whatsoever. Hence there is no need to install an IDS on your home router because the ISP will have the information anyway. Plus their IDS will be much more reliable than some crap IDS that will be deployed on a crap Linux which will run on crappy hardware. Secondly, having an ISP which is able to control your home router to shape, shut down or otherwise influence your traffic is really off the wall. At least it is today. If he assumes the home router and the PCs behind it to be vulnerable, he can’t trust the home router to deliver proper IDS results anyway. Why would we want the ISP then to act upon that potentially malicious data coming from a potentially compromised home router? And well, at least in the paper he submitted he tried to do an authenticated boot (in userspace?!) so that no hacked firmware could be booted, but that would require the software in the firmware to be secure in first place, otherwise the brilliantly booted device would be hacked during runtime as per the first assumption.

But I was so confused about him talking about different things that the best question I could have asked would have been what he was talking about.

Finally somebody with practical experience talked and he presented us how they at Leibniz Rechenzentrum. Stefan Metzger showed us their formal steps and how they were implemented. At the heart of their system was OSSIM which aggregated several IDSs and provided a neat interface to search and filter. It wasn’t all too interesting though, mainly because he talked very sleepily.

The day ended with a lot of food, beer and interesting conversations 🙂

The next day started with Joerg Voelker talking about iPhone security. Being interested in mobile security myself, I really looked forward to that talk. However, I was really disappointed. He showed what more or less cool stuff he could do with his phone, i.e. setting an alarm or reading email… Since it was so cool, everybody had it. Also, he told us what important data was on such a phone. After he built his motivation, which lasted very long and showed many pictures of supposed to be cool applications, he showed us which security features the iPhone allegedly had, i.e. Code Signing, Hardware and File encryption or a Sandbox for the processes. He read the list without indicating any problems with those technologies, but he eventually said that pretty much everything was broken. It appears that you can jailbreak the thing to make it run unsigned binaries, get a dump of the disk with dd without having to provide the encryption key or other methods that render the protection mechanisms useless. But he suffered a massive cognitive dissonance because he kept praising the iPhone and how cool it was.
When he mentioned the sandbox, I got suspicious, because I’ve never heard of such a thing on the iPhone. So I asked him whether he could provide details on that. But he couldn’t. I appears that it’s a policy thing and that your application can very well read and write data out of the directory it is supposed to. Apple just rejects applications when they see it accessing files it shouldn’t.
Also I asked him which protection mechanisms on the iPhone that were shipped by Apple do actually work. He claimed that with the exception of the File encryption, none was working. I told him that the File encryption is proprietary code and that it appears to be a designed User Experience that the user does not need to provide a password for syncing files, hence a master key would decrypt files while syncing.

That leaves me with the impression that an enthusiastic Apple fanboy needed to justify his iPhone usage (hey, it’s cool) without actually having had a deeper look at how stuff works.

A refreshing talk was given by Liebchen on Physical Security. He presented ways and methods to get into buildings using very simple tools. He is part of the Redteam Pentesting team and apparently was ordered to break into buildings in order to get hold of machines, data or the network. He told funny stories about how they broke in. Their tools included a “Keilformgleiter“, “Tuerfallennadeln” or “Tuerklinkenangel“.
Once you’re in you might encounter glass offices which have the advantage that, since passwords are commonly written on PostIts and sticked to the monitor, you can snoop the passwords by using a big lens!

Peter Sakal presented a so called “Rapid in-Depth Security Framework” which he developed (or so). He introduced to secure software development and what steps to take in order to have a reasonably secure product. But all of that was very high level and wasn’t really useful in real life. I think his main point was that he classified around 300 fuzzers and if you needed one, you could call him and ask him. I expected way more, because he teased us with a framework and introduced into the whole fuzzing thing, but didn’t actually deliver any framework. I really wonder how the term “framework” even made it into the title of his talk. Poor guy. He also presented softscheck.com on every slide which now makes a good entry in my AdBlock list…

Fortunately, Chritoph Wegener was a good speaker. He talked about “Cloud Security 2.0” and started off with an introduction about Cloud Computing. He claimed that several different types exist, i.e. “Infrastructure as a Service” (IaaS), i.e. EC2 or Dropbox, “Platform as a Service” (PaaS), i.e. AppEngine or “Software as a Service (SaaS), i.e. GMail or Twitter. He drew several attack scenarios and kept claiming that you needed to trust the provider if you wanted to do serious stuff. Hence, that was the unspoken conclusion, you must not use Cloud Services.

Lastly, Sven Gabriel gave a presentation about Grid Security. Apparently, he supervises boatloads of nodes in a grid and showed how he and his team manage to do so. Since I don’t operate 200k nodes myself, I didn’t think it was relevant albeit it was interesting.

To conclude the DFN Workshop: It’s a nice conference with a lot of nice people but it needs to improve content wise.

OCRing a scanned book

I had the pleasure to use a “Bookeye” book scanner. It’s a huge device which helps scanning things like books or folders. It’s very quick and very easy to use. I got a huge PDF out of my good 100 pages that I’ve scanned.

Unfortunately the light was very bright and so the scanner scanned “through” the open pages revealing the back sides of the pages. That’s not very cool and I couldn’t really dim the light or put a sheet between the Pages.
Also, it doesn’t do OCR but my main point of digitalising this book was to actually have it searchable and copy&pastable.

There seem to be multiple options to do OCR on images:

tesseract

covered already

ocropus

Apparently this is supposed to be tesseract on steroids as it can recognise text on paper and different layouts and everything.
Since it’s a bit painful to compile, I’d love to share my experiences hoping that it will become useful to somebody.

During compilation of ocropus, you might run into issues like this or that, so be prepared to patch the code.

cd /tmp/ svn checkout http://iulib.googlecode.com/svn/trunk/ iulib cd iulib/ ./configure --prefix=/tmp/libiu-install make && make install

cd /tmp/ wget http://www.leptonica.com/source/leptonlib-1.67.tar.gz -O- | tar xvzf - cd leptonlib*/ ./configure --prefix=/tmp/leptonica make && make install

cd /tmp/ svn checkout http://ocropus.googlecode.com/svn/trunk/ ocropus # This is due to this bug: http://code.google.com/p/ocropus/issues/detail?id=283 cat > ~/bin/leptheaders < #!/bin/sh echo /tmp/leptonica/include/leptonica/ EOF chmod a+x ~/bin/leptheaders ./configure --prefix=/tmp/ocropus-install --with-iulib=/tmp/libiu-install/ make && make install

muelli@bigbox /tmp $ LD_LIBRARY_PATH=/tmp/ocropus-install/lib/:/tmp/leptonica/lib/ ./ocropus-install/bin/ocroscript --help
usage: ./ocropus-install/bin/ocroscript [options] [script [args]].
Available options are:
  -e stat  execute string 'stat'
  -l name  require library 'name'
  -i       enter interactive mode after executing 'script'
  -v       show version information
  --       stop handling options
  -        execute stdin and stop handling options
muelli@bigbox /tmp $

However, I can’t do anything because I can’t make LUA load the scripts from the share/ directory of the prefix. Too sad. It looked very promising.

Cuneiform

This is an interesting thing. It’s a BSD licensed russian OCR software that was once one the leading tools to do OCR.
Interestingly, it’s the most straight forward thing to install, compared to the other things listed here.
bzr branch lp:cuneiform-linux cd cuneiform-linux/ mkdir build cd build/ cmake .. -DCMAKE_INSTALL_PREFIX=/tmp/cuneiform make make install

This is supposed to produce some sort of HTML which we can glue to a PDF with the following tool.

hocr2pdf

Apparently takes “HTML annotated OCR data” and bundles that, together with the image, to a PDF.

cd /tmp/ svn co http://svn.exactcode.de/exact-image/trunk ei cd ei/ ./configure --prefix=/tmp/exactimage make && make install

That, however, failed for me like this:

  LINK EXEC objdir/frontends/optimize2bw
/usr/bin/ld: objdir/codecs/lib.a: undefined reference to symbol 'QuantizeBuffer'
/usr/bin/ld: note: 'QuantizeBuffer' is defined in DSO /usr/lib64/libgif.so.4 so try adding it to the linker command line
/usr/lib64/libgif.so.4: could not read symbols: Invalid operation
collect2: ld returned 1 exit status
make: *** [objdir/frontends/optimize2bw] Error 1

Adding “LDFLAGS += -lgif” to the Makefile fixes that. I couldn’t find a bug tracker, hence I reported this issue via email but haven’t heard back yet.

Although the hOCR format seems to be the only option to actually know where in the file the text appears, no OCR program, except cuneiform and tesseract with a patch, seems to support it 🙁

gscan2pdf

as a full suite it can import pictures or PDFs and use a OCR program mentioned above (tesseract or gocr). The whole thing can then be saved as a PDF again.
Results with gocr are not so good. I can’t really copy and paste stuff. Searching does kinda work though.

Using Tesseract, however, doesn’t work quite well:

 Tesseract Open Source OCR Engine
tesseract: unicharset.cpp:76: const UNICHAR_ID UNICHARSET::unichar_to_id(const char*, int) const: Assertion `ids.contains(unichar_repr, length)' failed.
sh: line 1:  6187 Aborted                 (core dumped) tesseract /tmp/4jZN0oNbB1/dLNBLkcjph.tif /tmp/4jZN0oNbB1/_4BdZMfGXJ -l fra
*** unhandled exception in callback:
***   Error: cannot open /tmp/4jZN0oNbB1/_4BdZMfGXJ.txt
***  ignoring at /usr/bin/gscan2pdf line 12513.
Tesseract Open Source OCR Engine
tesseract: unicharset.cpp:76: const UNICHAR_ID UNICHARSET::unichar_to_id(const char*, int) const: Assertion `ids.contains(unichar_repr, length)' failed.
sh: line 1:  6193 Aborted                 (core dumped) tesseract /tmp/4jZN0oNbB1/ELMbnDkaEI.tif /tmp/4jZN0oNbB1/C47fuqxX3S -l fra
*** unhandled exception in callback:
***   Error: cannot open /tmp/4jZN0oNbB1/C47fuqxX3S.txt
***  ignoring at /usr/bin/gscan2pdf line 12513.

It doesn’t seems to be able to work with cuneiform 🙁

Archivista Box

This is actually an appliance and you can download an ISO image.
Running it is straight forward:
cd /tmp/ wget 'http://downloads.sourceforge.net/project/archivista/archivista/ArchivistaBox_2010_IV/archivista_20101218.iso?r=http%3A%2F%2Fsourceforge.net%2Fprojects%2Farchivista%2F&ts=1295436241&use_mirror=ovh' qemu -cdrom /tmp/archivista_20101218.iso -m 786M -usb

Funnily enough, the image won’t boot with more than 786MB of RAM. Quite weird, but qemu just reports the CPU to be halted after a while. If it does work, it boots up a firefox with a nice WebUI which seems to be quite functional. However, I can’t upload my >100MB PDF probably because it’s a web based thing and either the server rejects big uploads or the CGI just times out or a mixture of both.

Trying to root this thing is more complex than usual. Apparently you can’t give “init=/bin/sh” as a boot parameter as it wouldn’t make a difference. So I tried to have a look at the ISO image. There is fuseiso to mount ISO images in userspace. Unfortunately, CDEmu doesn’t seem to be packaged for Fedora. Not surprisingly, there was a SquashFS on that ISO9660 filesystem. Unfortunately, I didn’t find any SquashFS FUSE implementation 🙁 But even with elevated privileges, I can’t mount that thing *sigh*:

$ file ~/empty/live.squash
/home/muelli/empty/live.squash: Squashfs filesystem, little endian, version 3.0, 685979128 bytes, 98267 inodes, blocksize: 65536 bytes, created: Sat Dec 18 06:54:54 2010
$ sudo mount ~/empty/live.squash /tmp/empty/
mount: wrong fs type, bad option, bad superblock on /dev/loop1,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so
$ dmesg | tail -n 2
[342853.796364] squashfs: version 4.0 (2009/01/31) Phillip Lougher
[342853.796726] SQUASHFS error: Major/Minor mismatch, older Squashfs 3.0 filesystems are unsupported

But unsquashfs helped to extract the whole thing onto my disk. They used “T2” to bundle everything to a CD and packaged software mentioned above. Unfortunately, very old versions were used, i.e. cuneiform is in version 0.4.0 as opposed to 1.0.0. Hence, I don’t really consider it to be very useful to poke around that thing.

It’s a huge thing worth exploring though. It all seems to come from this SVN repository: svn://svn.archivista.ch/home/data/archivista/svn.

WatchOCR

For some reason, they built an ISO image as well. Probably to run an appliance.
cd /tmp/ wget http://www.watchocr.com/files/watchocr-V0.6-2010-12-10-en.iso qemu -cdrom /tmp/watchocr-V0.6-2010-12-10-en.iso -m 1G

The image booted up a webbrowser which showed a webinterface to the WebOCR functionality.
I extraced the necessary scripts which wraps tools like cuniform, ghostscript and friends. Compared to the archivista box, the scripts here are rather simple. Please find webocr and img2pdf. They also use an old cuneiform 0.8.0 which is older than the version from Launchpad.

However, in my QEMU instance, the watchocr box took a very long time to process my good 100 pages PDF.

Some custom script

That tries to do the job did in fact quite well, although it’s quite slow as well. It lacks proper support for spawning multiple commands in parallel.

After you have installed the dependencies like mentioned above, you can run it:
wget http://www.konradvoelkel.de/download/pdfocr.sh PATH="/tmp/exactimage/bin/:/tmp/cuneiform/bin/:$PATH" LD_LIBRARY_PATH=/tmp/cuneiform/lib64/ sh -x pdfocr.sh buch-test-1.pdf 0 0 0 0 2500 2000 fra SomeAuthor SomeTitle

The script, however, doesn’t really work for me, probably because of some quoting issues:

+ pdfjoin --fitpaper --tidy --outfile ../buch-test-1.pdf.ocr1.pdf 'pg_*.png.pdf'
          ----
  pdfjam: This is pdfjam version 2.08.
  pdfjam: Reading any site-wide or user-specific defaults...
          (none found)
  pdfjam ERROR: pg_*.png.pdf not found

Having overcome that problem, the following pdfjoin doesn’t work for an unknown reason. After having replaced pdfjoin manually, I realised, that the script sampled the pages down, made them monochrome and rotated them! Hence, no OCR was possible and the final PDF was totally unusable *sigh*.

It’s a mess.

To conclude…

I still don’t have a properly OCRd version of my scanned book, because of not very well integrated tools. I believe that programs like pdftk, imagemagick, unpaper, cuneiform, hocr2pdf, pdfjam do their job very well. But it appears that they are not very well knit together to form a useful tools to OCR a given PDF. Requirements would be, for example, that there is no loss of quality of the scanned images, that the number of programs to be called is reduced to a minimum and that everything needs to be able to do batch processing. So far, I couldn’t find anything that fulfills that requirements. If you know anything or have a few moments to bundle the necessary tools together, please tell me :o) The necessary pieces are all there, as far as I can see. It just needs someone to integrate everything nicely.