Bossa Conference 2010

I’ve just attended Bossa Conference 2010 in Manaus, Amazonas, Brazil. Thanks again to the Instituto Nokia de Tecnologia (INdT) for holding this amazing conference. I’d say it’s somewhat like FOSS.in, but with less people and a more relaxed atmosphere.

I gave a talk about “Security in Mobile Devices” and went very well although I refactored my slides just shortly before I gave it and I expected more fuckups. But the people apparently enjoyed it and I got lots of interesting feedback. You can find my slides here.

If you’ve been there and want to follow-up, you might find the Maemo Wiki on Security interesting. I recommend to read through the stuff that Collin Mulliner did, on i.e. NFC or the iPhone. Also the things that he did together with Charlie Miller are worth reading, basically fuzzing the Operating System by pretending to be the modem which produced interesting results. But there is more work to be done which I am convinced will give more interesting results in the future. Maemo on the N900 apparently doesn’t talk via a serial line to the modem but rather via PhoNet, making it even more interesting to fiddle around with the low level GSM stack.

As for policies and statistics,  Symantecs Ollie Whitehouse wrote some interesting articles such as this or that. Other, more technical papers include Yves Younans Filter Resistant ARM Shellcode or some guys proposing Kirin to extend the Android security model. For a more general overview, have a loot at a good Android link list.

As for the rest of the conference, I felt that it was a bit shallow content-wise probably because of all that Qt stuff that was presented. But in fairness, they had to bring it since it’s going to be used by Maemo Meego. Anyway, I enjoyed it pretty much, because the people were all open and interested and I had good conversations. And good food 😉

MSN Shutdown in 2003

During CA640 I was made to write an ethical review which I was supposed to hand in using a dodgy webservice. Since it got 90% people mugged me to make it available 😉 Of course, I don’t have a problem with that, so people now have a reference or know what to expect when they enter the course.

You can find the PDF here and its abstract reads:

At the end 2003 Microsoft closed the public chat-rooms of its Internet service called MSN.
MSN was pushed by Children’s Charities because they feared an abuse of these chat-rooms.
In some countries, however, the service was still available but subject to a charge.
This review raises ethical questions about Microsoft’s and the Children’s Charities’ behaviour because making the people pay with the excuse of protecting children is considered ethically questionable.
Also the Children’s Charities pushed for closure of a heavily used service although there is absolutely no evidence that children would be safer after closing down a chat-room.

If you are not interested in the non-technical details you might be interested to know that I use a Mercurial Hook on the server side to automatically compile the LaTeX sources one I push changes to the server:

$ cat .hg/hgrc
[hooks]
changegroup.compile = export FILE=paper && hg up -C && pdflatex --interaction=batchmode $FILE && bibtex $FILE && pdflatex --interaction=batchmode $FILE && pdflatex --interaction=batchmode $FILE

And then I just symlink that resulting PDF file to my public_html directory.

Subverting (Soft) Quota

My home directory  in my university has some restrictions, one of them being a ridiculously small 100 megabyte and 5000 files (soft) quota… How could you ever study with that?! My Firefox instance (with e.g. Zotero) uses 4393 files already:

$ find  ~/.mozilla/firefox/*.default/ -type f | wc -l
4349
$ du -hs ~/.mozilla/firefox/*.default/  | awk '{print $1}'
90M
$

So these restrictions don’t even allow me to run my research tools. Let alone checking out stuff from a Git/Mercurial repository and working on anything.

Needless to say that I am pretty annoyed by these restrictions. Fortunately, quotas will forget about you as soon as you fall below the limit so that you only need to fall below the limit every now and then. So let’s do this automatically then:

#!/bin/sh
 
RAND=$$
BACKUP=~/.mozilla
TARGET=/tmp/.mozilla.$RAND
 
cp -ar "$BACKUP" "$TARGET" && rm -rf "$BACKUP" && cp -ar "$TARGET" "$BACKUP" && rm -rf "$TARGET" && echo "Finished successfully" || echo "Failure :("

And let cron run it once a week:

42 23 * * Sun       ~/bin/sneak-quota.sh

*yay*

WTFOTM: Hotels warming your bed

My favourite service, in the series WTFOTM, of this month is *drumroll* a Hotel that sends its employees, wearing an electric blanket, to your bed to warm it up for you.

A hotel chain is employing human bed warmers to help guests get a good night’s sleep.

There’s nothing wrong with having a warm bed, but having hotel employees warming that up for you?! That just feels a bit weird and thus: WTF?!

FOSDEM 2010

This years FOSDEM involved meeting familiar and new people as well as a lot of beer 😉 I can’t understand why the Belgians are so proud of their beer though :> Anyway, I got way too less sleep and spent too much money…
I wished I connected to more new people but I was terribly busy catching up with all the faces that I haven’t seen in a while. Hopefully, I can meet more new people next time. FOSDEM Logo

Although I was scheduled as the very first in the morning after the official Beer-Event (thx teuf…) my talk in the GNOME devroom went well and I hope I represented GNOMEs Bugsquad well. At least two people wanted to help out 🙂 I hope I was inviting and clear enough. I definitely need to try to hold the people by at least writing to bugsquad-list. I hope I come around doing that, but I also have a huge backlog that wants to be processed. On the todo list is a new bugsquad as well as a membership-committee meeting, so if you are interested, watch out for mails 🙂

If you happen to have seen my talk at FOSDEM and want to look over the slides, please find them  here. If you have been there and want to join the bugsquad fun: Awesome! Join the mailinglist now and wait for the next meeting to be organized. Don’t hesitate to push for it 😉
If you haven’t been there but you want to help the Free Software movement or GNOME in particular: Awesome! Consider subscribing the mailinglist or join the IRC Channel and make sure that you’ve read our awesome TriageGuide 🙂

Talks that I have enjoyed at FOSDEM include Maemo6 Platform Security by Elena because Nokia is about to build yet another security for Linux to meet their needs. Apparently the new Maemo devices will come with a TPM to allow DRM like scenarios. But also encrypting data on the device will be possible using an API which in turn uses the built-in keys. These turn out to be recoverable nowadays. If I read this correctly, then the “Open Mode” will not make use of the TPM keys. This means that if your contacts, images, texts, etc…, were encrypted using the above mentioned API, then you couldn’t get hold of this data in Open Mode 🙁 I thus reckon that stuff like Contacts will not be stored encrypted. Hence you would leak all your data when losing the device. So I don’t expect a real advantage but we’ll see.
Another not very informative yet entertaining talk was done by Greg Kroah-Hartman and dealt with creating a patch for Linux. It actually motivated me so that I put “fixing some random driver in staging” on my Todo-List 😉

Note to self for the next FOSDEM: Book accommodation early. Very early! Also, Charleroi might not be worth it, because the Bus from Brussels to CLR is 13 Euro, return 21.

WTFOTM: Email validating RegExp

I think I’ll start a new series: My wtf of the month. This time, it’s a regular expression I found.

How much does it take to validate an email address, you might ask. Well, can’t be that hard, right? If you read the corresponding RFC 5322, you’ll notice that the local part of an email address (that is the part in front of the “@”) contains “dot-atoms”. Section 3.4.1 writes:

local-part      =   dot-atom / quoted-string / obs-local-part

At the end of the day, a “dot-atom” is a “dot-atom-text” which is a “atext” which is according to section 3.2.3:

atext           =   ALPHA / DIGIT /    ; Printable US-ASCII
“!” / “#” /        ;  characters not including
“$” / “%” /        ;  specials.  Used for atoms.
“&” / “‘” /
“*” / “+” /
“-” / “/” /
“=” / “?” /
“^” / “_” /
“`” / “{” /
“|” / “}” /
“~”

That effectively allows you to have email addresses like !foo$bar/baz=qux@example.com, "#~foo@bar^^"@example.com, `echo${LFS}ssh-rsa${LFS}AAA...|tee${LFS}~/.ssh/authorized_keys`@example.com. I am more than curious to see how servers and MUAs (especially on mobile devices) handle these cases.

I came around to bother because some poor guy wanted to implement email address validation in Evolution. I found the yet untested but obviously correct way in a Perl module:

$RFC822PAT = <<'EOF';
[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\
xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xf
f\n\015()]*)*\)[\040\t]*)*(?:(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\x
ff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|"[^\\\x80-\xff\n\015
"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[\040\t]*(?:\([^\\\x80-\
xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80
-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*
)*(?:\.[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\
\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\
x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x8
0-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|"[^\\\x80-\xff\n
\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[\040\t]*(?:\([^\\\x
80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^
\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040
\t]*)*)*@[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([
^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\
\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\
x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-
\xff\n\015\[\]]|\\[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()
]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\
x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\04
0\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\
n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\
015()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?!
[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\
]]|\\[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\
x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\01
5()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*)*|(?:[^(\040)<>@,;:".
\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]
)|"[^\\\x80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[^
()<>@,;:".\\\[\]\x80-\xff\000-\010\012-\037]*(?:(?:\([^\\\x80-\xff\n\0
15()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][
^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)|"[^\\\x80-\xff\
n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015"]*)*")[^()<>@,;:".\\\[\]\
x80-\xff\000-\010\012-\037]*)*<[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?
:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-
\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:@[\040\t]*
(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015
()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()
]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\0
40)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\
[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\
xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*
)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x80
-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x
80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t
]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\
\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])
*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x
80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80
-\xff\n\015()]*)*\)[\040\t]*)*)*(?:,[\040\t]*(?:\([^\\\x80-\xff\n\015(
)]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\
\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*@[\040\t
]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\0
15()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015
()]*)*\)[\040\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(
\040)<>@,;:".\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|
\\[^\x80-\xff])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80
-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()
]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x
80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^
\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040
\t]*)*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".
\\\[\]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff
])*\])[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\
\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x
80-\xff\n\015()]*)*\)[\040\t]*)*)*)*:[\040\t]*(?:\([^\\\x80-\xff\n\015
()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\
\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*)?(?:[^
(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-
\037\x80-\xff])|"[^\\\x80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\xff\
n\015"]*)*")[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|
\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))
[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x80-\xff
\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\x
ff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(
?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\
000-\037\x80-\xff])|"[^\\\x80-\xff\n\015"]*(?:\\[^\x80-\xff][^\\\x80-\
xff\n\015"]*)*")[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\x
ff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)
*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*)*@[\040\t]*(?:\([^\\\x80-\x
ff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-
\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)
*(?:[^(\040)<>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\
]\000-\037\x80-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])*\]
)[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-
\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\x
ff\n\015()]*)*\)[\040\t]*)*(?:\.[\040\t]*(?:\([^\\\x80-\xff\n\015()]*(
?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]*(?:\\[^\x80-\xff][^\\\x80
-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)*\)[\040\t]*)*(?:[^(\040)<
>@,;:".\\\[\]\000-\037\x80-\xff]+(?![^(\040)<>@,;:".\\\[\]\000-\037\x8
0-\xff])|\[(?:[^\\\x80-\xff\n\015\[\]]|\\[^\x80-\xff])*\])[\040\t]*(?:
\([^\\\x80-\xff\n\015()]*(?:(?:\\[^\x80-\xff]|\([^\\\x80-\xff\n\015()]
*(?:\\[^\x80-\xff][^\\\x80-\xff\n\015()]*)*\))[^\\\x80-\xff\n\015()]*)
*\)[\040\t]*)*)*>)
EOF

This is a handy 6.5kB regular expression that validates an email address. I wonder how long it takes to compile and to actually match an email address against… (Arr, stupid wordpress escapes all those fancy characters everytime I have the edit widget open 🙁 )

So, now go and fix your email address validating script.

26C3 Review

Attending last years CCCongress was a great pleasure. Although there were great lectures, it’s the spirit that’s the best part of the conference. Meeting all these nice hacker people, hanging around, talking, discussing, hacking is just brilliant. You’ve got all those smart hackers around you and it just can’t get boring.26c3 logo

A good way of socialising is, of course, visiting the various parties that take place. The Phenoelit party was awesome. Thanks FX for the invites 🙂

Besides drinking I spent time on some crypto problems and tried to investigate on the magnetic-stripe-card authentication in Hotels and Hostels. I found out, that all our cards for one room are equal, but not one card that has been obtained later. The data on the card is just ~100bits and I tried to find timestamps and room numbers in it but I failed. I blame my dataset to be too small. I’ll launch more advanced experiments next year. If you happen to have insider knowledge in magnetic-stripe locks, drop me a line.

I want to highlight two things about the last CCCongress. Firstly, Friend Tickets were available and the concept is just awesome: Basically you can propose a friend of yours you think would benefit of attending the CCCongress but has no way to cover the expenses. The organisers then decide whether you can get a discount (which will, of course, apportioned to every regularly paying attendee). I like to see this solidarity among hackers. Unfortunately, no stats are available to see how many people were enabled to come through this method. I hope, having these friend tickets will be considered next year again. So if you wanted to come to the CCCongress but feared the expenses, consider asking for a discount. Just for the record: The prices are at rock bottom anyway: 80 Euros for a 4 day conference of this kind is amazingly cheap. Thanks to all the angels! 🙂

The second noteworthy concept to distribute the CCCongress as much as possible (called Dragons Everywhere). The idea is fantastic: Increase the number of attendees as much as possible by building mini conferences and stream the most important things. It would be even better, if the gatherings had a feedback channel, i.e. Webcam. Hopefully, it’ll be better next year, i.e. better and more reliable streaming services and more places, especially in Berlin, because many people were sent away because the conference was already sold out 🙁

If you want to get a feeling of what the CCCongress is like, you might want to have a look at the recordings. If you organize a public viewing, make sure you show these videos 🙂 Based on the feedback, the best talks were:

And for entertainment, the following German talks are very good:

I hope you enjoy watching the CCCongress and consider coming in next year!

Adding Linux Syscall

In a course (CA644) we were asked to add a new syscall to the Linux kernel.Linux Oxi Power!

As I believe that knowledge should be as free and as accessible as possible, I thought I have to at least publish our results. Another (though minor) reason is that the society -to some extend- pays for me doing science so I believe that the society deserves to at least see the results.

The need to actually publish that is not very big since a lot of information on how to do that exists already. However, that is mostly outdated. A good article is from macboy but it misses to mention a minor fact: The syscall() function is variadic so that it takes as many arguments as you give it.

So the abstract of the paper, that I’ve written together with Nosmo, reads:

This paper shows how to build a recent Linux kernel from scratch, how to add a new system call to it and how to implement new functionality easily.
The chosen functionality is to retrieve the stack protecting canary so that mitigation of buffer overflow attacks can be circumvented.

And you can download the PDF here.

If it’s not interesting for you content wise, it might be interesting from a technical point of view: The PDF has files attached, so that you don’t need to do the boring stuff yourself but rather save working files and modify them. That is achieved using the embedfile package.

\usepackage{embedfile}        % Provides \embedfile[filename=foo, desc={bar}]{file}
[...]
\embedfile[filespec=writetest.c, mimetype=text/x-c,desc={Program which uses the new systemcall}]{../code/userland/writetest.c}%

If your PDF client doesn’t allow you save the files (Evince does 🙂 ), you might want to use pdftk $PDF unpack_files in some empty directory.

Why I cannot use turnitin.com

turnitin logoWe are were supposed to use a proprietary webservice to hand in a paper:

You should also upload the essay to turnitin.com using the password key:

5vu0h5fw and id: 2998602

Late entries will suffer a penalty.

I cannot use this service. The simplest reason being that I cannot agree to their ToS.

Let me clarify just by picking some of their points off their ToS:

By clicking the “I agree — create profile” button below You: (1) represent that You have read and understand

As I am not a native speaker of neither English nor law-speak, I cannot  agree that I fully understand those ToS.

With the exception of the limited license granted below, nothing contained herein shall be construed as granting You any right, […]

Whatever that means, it sounds scary to me.

You further represent that You are not barred from receiving the Services or using the Site under the laws of the United States or other applicable jurisdiction.

I am sorry but I do not know whether this holds for me.

You may not modify, copy, distribute, transmit, display, perform, reproduce, publish, license, create derivative works from, transfer, or sell any information, Licensed Programs or Services from the Site without the prior written consent of iParadigms,

Lucky me, that I did not agree to their ToS yet so that I can copy them and bring them up here…

You further agree not to cause or permit the disassembly, decompilation, recompilation, or reverse engineering of any Licensed Program or technology underlying the Site. In jurisdictions where a right to reverse engineer is provided by law unless information is available about products in order to achieve interoperability, functional compatibility, or similar objectives, You agree to submit a detailed written proposal to iParadigms concerning any information You need for such purposes before engaging in reverse engineering.

I seriously do not want to write a proposal to this company for every new website I will build just because they use a <form> or some AJAX.

You are entirely responsible for maintaining the confidentiality of Your password

I cannot do that because I do not even know how they store my password (we are talking about an ASP program after all…).

You agree to use reasonable efforts to retain the confidentiality of class identification numbers and passwords. In no circumstance shall You transmit or make Your password or class identification number or any other passwords for the Site or class identification numbers available in any public forum, including, but not limited to any web page, blog, advertisement or other posting on the Internet, any public bulletin board, and any file that is accessible in a peer-to-peer network.

Yeah, sure. Nobody will find it on the page itself anyway.

This User Agreement is governed by the laws of the State of California, U.S.A. You hereby consent to the exclusive jurisdiction and venue of state and federal courts in Alameda County, California, U.S.A., in all disputes arising out of or relating to the use of the Site or the Services.

Might sound weird, but I do not want to be arraigned in the USA.

You agree not to use the Site in any jurisdiction that does not give effect to all provisions of these terms and conditions, including without limitation this paragraph.

Of course, I do not know enough about this jurisdiction to agree to those ToS.

Needless to say, that I do not want my data to fall under the American 9-11 Patriot Act.

Besides the above mentioned legal issues, I also have ethical concerns to contribute to the profit of a dodgy company by providing them my written essay so that they can use that to check other works against mine. If I believed in copyright, I could probably claim infringement as well.

Other topics, such as the violation of the presumption of innocence, are covered by resources on the web. And there is plenty of it. The most interesting ones include this and this.

Admittedly, I could care not as much as I do, but being an academic also means to think critically.

I more or less sent this email to the lecturer and it turned out that it’s not compulsory to use this dodgy service! *yay*

The future, however, is not safe yet, so more action is needed…

Wikify Pages

In one of our modules, “System Software”, we were asked to write a bash script which wikifies a page. That means to identify all nouns and replace them with a link to the Wikipedia.

I managed to write that up in two hours or so and I think I have a not so ugly solution (*cough* it’s still bash… *cough*). It has (major) drawbacks though. Valid X(HT)ML, i.e.

<![CDATA[ <body>

before the actual body will be recognized as the beginning. But parsing XML with plain bash is not that easy.

Also, my script somehow does not parse the payload correctly, that is it tails all the way down to the end of the file instead of stopping when </body> is reached.

Anyway, here’s the script:

#!/bin/bash
### A script which tries to Wikipediafy a given HTML page
### That means that every proper noun is linked against the Wikipedia
### but only if it's not already linked against something.
### Assumptions are, that the HTML file has a "<body>" Tag on a seperate
### line and that "<a>" Tags don't span multiple lines.
### Also, Capitalised words in XML Tags are treated as payload words, just
### because parsing XML properly is a matter of 100k Shellscript (consider
### parsing <[DATA[!). Also, this assumption is not too far off, because
### Captd words in XML happen seldomly anyway.
### As this is written in Bash, it is horribly slow. You'd rather want to do
### this in a language that actually understand {X,HT}ML like Python
 
# You might want to change this
BASEURL="http://en.wikipedia.org/wiki/"
 
set -e
 
### Better not change anything below, it might kill kittens
# To break only on newlines, set the IFS to that
OLD_IFS=$IFS
IFS='
'
HTML=$(cat $1) # Read the file for performance reasons
# Find the beginning and end of Document and try to be permissive with HTML
# Errors by only stopping after hitting one <body> Tag
START_BODY=$(grep --max-count=1 -ni '<body>'<<<"$HTML" | cut -d: -f1)
END_BODY=$(grep --max-count=1 -ni '</body>'<<<"$HTML" | cut -d: -f1)
 
HEAD=$(head -n $START_BODY<<<"$HTML") # Extract the Head for later use
# $(()) is most probably a non-portable bashism, so one wants to get rid of that
RANGE_BODY=$(($END_BODY-$START_BODY))
 
# And the extract the body
PAYLOAD=$(tail -n +${START_BODY} <<<"$HTML" | tail -n ${RANGE_BODY})
 
### This is the main part
### Basically search for all words beginning with a capital letter
### and match that. We can use that later with \1.
 
### Try to find already linked words, replace them by their MD5 hash,
### Run generic Word finding mechanism and replace back later
 
# We simply assume that a link doesn't span multiple lines
LINKMATCHES=$(grep -i -E --only-matching '<a .*>.*</a>' $1 || true)
 
LINKMATCH_ARRAY=()
MD5_ARRAY=()
CLEANEDPAYLOAD=$PAYLOAD
if [[ -n $LINKMATCHES ]]; then
    # We have found already linked words, put them into an array
    LINKMATCH_ARRAY=( $LINKMATCHES )
    index=0 # iterate over array
    for MATCH in $LINKMATCHES; do
        # Uniquely hash the found link and replace it, saving it's origin
        MATCHMD5=$(md5sum <<<$MATCH | awk '{print $1}')
        MD5_ARRAY[$index]=$MATCHMD5
        # We simply assume that there's no "," in the match
        # Use Bash internals string replacement facilities
        CLEANEDPAYLOAD=${CLEANEDPAYLOAD//${MATCH}/${MATCHMD5}}
        let "index = $index + 1"
    done
fi
 
 
# Find the matches
WORDMATCHES=$(grep --only-matching '[A-Z][a-z][a-z]*'<<<$CLEANEDPAYLOAD | sort | uniq)
WORDMATCHES_ARRAY=( $WORDMATCHES )
index=0
WIKIFIED=$CLEANEDPAYLOAD
while [[ "$index" -lt ${#WORDMATCHES_ARRAY[@]} ]]; do
    # Yeah, iterating over an array with 300+ entries is fun *sigh*
    # You could now ask Wikipedia and only continue if the page exist
    # if wget -q "${BASEURL}${SEARCH}"; then ...; else ...; fi
    SEARCH="${WORDMATCHES_ARRAY[$index]}"
    REPLACE="<a href=\"${BASEURL}${SEARCH}\">\2</a>"
    # Note, that we replace the first occurence only
    #WIKIFIED=${WIKIFIED/$SEARCH/$REPLACE} ## That's horribly slow, so use sed
    # Watch out for a problem: "<p>King" shall match as well as "King</p>"
    # or "King." but not eBook.
    # We thus match the needle plus the previous/following char,
    # iff it's not [A-Za-z]
    WIKIFIED=$(sed -e "s,\([^A-Za-z]\)\($SEARCH\)\([^A-Za-z]\),\1$REPLACE\3,"<<<$WIKIFIED) # so use sed
    let "index += 1"
done
 
# Replace hashed links with their original, same as above, only reverse.
# One could apply this technique to other tags besides <a>, but you really
# want to write that in a proper language :P
index=0
NOLINKWIKIPEDIAFIED=$WIKIFIED
while [[ "$index" -lt ${#MD5_ARRAY[@]} ]]; do
    SEARCH=${MD5_ARRAY[$index]}
    REPLACE=${LINKMATCH_ARRAY[$index]}
    NOLINKWIKIPEDIAFIED=${NOLINKWIKIPEDIAFIED//$SEARCH/$REPLACE}
    let "index += 1"
done
 
### Since we have the head and the payload separate, echo both
echo $HEAD
echo $NOLINKWIKIPEDIAFIED
 
# Reset the IFS, e.g. for following scripts
IFS=$OLD_IFS
Creative Commons Attribution-ShareAlike 3.0 Unported
This work by Muelli is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported.