paper – muellis blog

Talking at ARES 2019 in Canterbury, UK

It’s conference season and I attended the International Conference on Availability, Reliability, and Security (ARES) in Canterbury, UK. (note that in the future, the link might change to something more sustainable)

A representative of the Kent University opened the event. It is the UK’s European University, he said, with 20000 students, many of them being from other countries. He attributed that to the proximity to mainland Europe. Indeed it’s only an hour away (if you don’t have to go back to London to catch a direct Eurostar rather than one that stops in, say, Ashford). The conference was fairly international, too, with 230 participants from 33 countries. As an academic conference, they care about the “acceptance rate” which, in this case, was at 20.75%. Of course, he could have mentioned any number, because it’s impossible to verify.

The opening keynote was given by Alistair MacWilson from Bletchley Park. Yeah, the same Bletchley Park which Alan Turing worked at. He talked about the importance of academia in closing the cybersecurity talent gap. He said that the deficit of people knowing anything about cybersecurity skills is 3.3M with 380k alone in Europe, but APAC being desperately short of 2.1M professionals. All that is good news for us youngsters in the business, but not so good, he said, if you rely on the security of your IT infrastructure… It’s not getting any better, he said, considering that the number of connected devices and the complexity of our infrastructure is rising. You might think, he said, that highly technical skills are required to perform cybersecurity tasks. But he mentioned that 88% of the security problems that the global 5000 companies have stem from human factors. Inadequate and unfocussed training paired with insufficient resources contribute to that problem, he said. So if you don’t get continuous training then you will fall behind with your skill-set.

There were many remarkable talks and the papers can be found online; albeit behind a paywall. But I expect SciHub to have copies and authors to be willing to share their work if you ask. Anyway, one talk I remember was about delivering Value Added Services to electric vehicle charging. They said that it is currently not very attractive for commercial operators to provide charging stations, because the margin is low. Hence, additional monetisation in form of Value Added Services (VAS) could be added. They were thinking of updating the software of the vehicle while it is charging. I am not convinced that updating the car’s firmware makes a good VAS but I’m not an economist and what do I know about the world of electric vehicles. Anyway, their proposal to add VAS to the communication protocol might be justified, but their scenario of delivering software updates over that channel seems like a lost opportunity to me. Software updates are currently the most successful approach to protecting users, so it seems warranted to have an update protocol rather than a VAS protocol for electric vehicles.

My own talk was about using the context and provenance of USB-borne events (illegal public copy) to mitigate attacks via that channel. So general idea, known to readers of my blog, is to take the state of the session into account when dealing with events stemming from USB devices. More precisely, when your session is locked, don’t automatically load drivers for a new USB device. Your session is locked, after all. You’re not using your machine and cannot insert a new device. Hence, the likelihood of someone else maliciously inserting a device is higher than when your session is unlocked. Of course, that’s only a heuristic and some will argue that they frequently plug devices into their machine when it’s locked. Fair enough. I argue that we need to be sensitive and change as little as possible to the user’s way of working with the machine to get high acceptance rates. Hence, we need to be careful when devices like keyboards are inserted. Another scenario is the new network card that has been attached via USB. It should be more suspicious to accept that nameserver that came from the new network card’s DHCP server when the system has a perfectly working network configuration (and the DHCP response did not contain a default gateway). Turns out, that those attacks are mounted right now in real-life and we have yet to find defences that we can deploy on a large scale.

It’s been a nice event, even though the sandwiches for lunch got boring after a few days 😉 I am happy to have met researchers from other areas and I hope to stay in touch.

Talking on PrivacyScore at DFN Security Conference 2018 in Hamburg, Germany

I seem to have skipped last year, but otherwise I have been to the DFN Workshop regularly. While I had a publication at this venue before, it’s only this year that I got to have a the conference.

I cannot comment on the other talks so much, because I could not attend too many 🙁 But our talk (slides) was well visited and I think people appreciated the presentation being a bit lighter than the previous one about the upcoming GDPR.

I talked about PrivacyScore.org and how we’ve measured German universities. The paper is here. Our results were mixed. As for TLS deployment, with a lot of imagination we can see a line dividing Germany. The West seems to have fewer problems with their TLS deployment than the East. The more red an area is, the worse its TLS support is. That ranges from not offering TLS at all to having an invalid certificate or using broken parameters.

As for tracking its users we had the hypothesis that privately run institutions have a higher interest in tracking its users than publicly run institutions. The following graphic reflects the geographic distribution of trackers on German university’s Web sites.
That hypothesis can be confirmed by looking at the PrivacyScore list that discriminates those institutions.

We found data that was very likely not meant to be there, such as database dumps or Git repositories of the Web site’s code (including passwords for their staging environments, etc.). We tried to report these issues to the Web site operators, but it was difficult to get hold of the responsible people. For the 21 leaks we found I have 93 emails in my mailbox. Ideally, the 21 I sent off were enough. But even sending those emails is hard, because people don’t respect RFC 2142 and have a security@ address. Eventually, we made the Internet a tiny bit more secure by having those Website operators remove the leaks from their Web site, but there are still some pages which have (supposedly) unwanted information such as their visitors’ IP addresses online. The graph below shows that most of the operators who reacted did so in the first few days. So management of security incidents seems to be an area of improvement.

I hope to be able to return next year, if only for the catering 😉 Then, I better attend some more talks and chat with the other guests.

Taint Tracking for Chromium

I forgot to blog about one of my projects. I had actually already talked about it more than one year ago and we had a paper at USENIX Security.

Essentially, we built a protection against DOM-based Cross-site Scripting (DOMXSS) into Chromium. We did that by detecting whenever potentially attacker provided strings become JavaScript code. To that end, we made the HTML rendering engine (WebKit/Blink) and the JavaScript engine taint aware. That is, we identified sources of values that an attacker could control (think window.name) and marked all strings coming from those sources as tainted. Then, during parsing of JavaScript, we check whether the string to be compiled is actually tainted. If that is indeed the case, then we abort the compilation.

That description is a bit simplified. For example, not compiling code because it contains some fragments of the URL would break a substantial number of Web sites. It’s an unfortunate fact that many Web sites either eval code containing parts of the URL or do a document.write with a string containing parts of the URL. The URL, in our attacker model, can be controlled by the attacker. So we must be more clever about aborting compilation. The idea was to only allow literals in JavaScript (like true, false, numbers, or strings) to be compiled, but not “code”. So if a tainted (sub)string compiles to a string: fine. If, however, we compile a tainted string to a function call or an operation, then we abort. Let me give an example of an allowed compilation and a disallowed one.

<HTML>


<TITLE>Welcome!</TITLE>

Hi
<SCRIPT>

var pos=document.URL.indexOf("name=")+5;

document.write(document.URL.substring(pos,document.URL.length));

</SCRIPT>
<BR>

Welcome to our system
…

</HTML>

Which is from the original report on DOM-based XSS. You see that nothing bad will happen when you open http://www.vulnerable.site/welcome.html?name=Joe. However, opening http://www.vulnerable.site/welcome.html?name=alert(document.cookie) will lead to attacker provided code being executed in the victim’s context. Even worse, when opening with a hash (#) instead of a question mark (?) then the server will not even see the payload, because Web browsers do not transmit it as part of their request.

“Why does that happen?”, you may ask. We see that the document.write call got fed a string derived from the URL. The URL is assumed to be provided by the attacker. The string is then used to create new DOM elements. In the good case, it’s only a simple text node, representing text to be rendered. That’s a perfectly legit use case and we must, unfortunately, allow that sort of usage. I say unfortunate, because using these APIs is inherently insecure. The alternative is to use createElement and friends to properly inject DOM nodes. But that requires comparatively much more effort than using the document.write. Coming back to the security problem: In the bad case, a script element is created with attacker provided contents. That is very bad, because now the attacker controls your browser. So we must prevent the attacker provided code from execution.

You see, tracking the taint information is a non-trivial effort and must be done beyond newly created DOM nodes and multiple passes of JavaScript (think eval(eval(eval(tainted_string)))). We must also track the taint information not on the full string, but on each character in order to not break existing Web applications. For example, if you first concatenate with a tainted string and then remove all tainted characters, the string should not be marked as tainted. This non-trivial effort manifests itself in the over 15000 Lines of Code we patched Chromium with to provide protection against DOM-based XSS. These patches, as indicated, create, track, propagate, and evaluate taint information. Also, the compilation of JavaScript has been modified to adhere to the policy that tainted strings must only compile to literals. Other policies are certainly possible and might actually increase protection (or increase compatibility without sacrificing security). So not only WebKit (Blink) needed to be patched, but also V8, the JavaScript engine. These patches add to the logic and must be execute in order to protect the user. Thus, they take time on the CPU and add to the memory consumption. Especially the way the taint information is stored could blow up the memory required to store a string by 100%. We found, however, that the overhead incurred was not as big as other solutions proposed by academia. Actually, we measure that we are still faster than, say, Firefox or Opera. We measured the execution speed of various browsers under various benchmarks. We concluded that our patched version added 23% runtime overhead compared to the unpatched version.

xss-runtime

As for compatibility, we crawled the Alexa Top 10000 and observed how often our protection mechanism has stopped execution. Every blocked script would count towards the incompatibility, because we assume that our browser was not under attack when crawling. That methodology is certainly not perfect, because only shallowly crawling front pages does not actually indicate how broken the actual Web app is. To compensate, we used the WebKit rendering tests, hoping that they cover most of the important functionality. Our results indicate that scripts from 26 of the 10000 domains were blocked. Out of those, 18 were actually vulnerable against DOM-based XSS, so blocking their execution happened because a code fragment like the following is actually indistinguishable from a real attack. Unfortunately, those scripts are quite common 🙁 It’s being used mostly by ad distribution networks and is really dangerous. So using an AdBlocker is certainly an increase in security.

var location_parts = window.location.hash.substring(1).split(’|’); var rand = location_parts[0]; var scriptsrc = decodeURIComponent(location_parts[1]); document.write("<scr"+"ipt src=’" + scriptsrc + "’></scr"+"ipt>");

Modifying the WebKit for the Web parts and V8 for the JavaScript parts to be taint aware was certainly a challenge. I have neither seriously programmed C++ before nor looked much into compilers. So modifying Chromium, the big beast, was not an easy task for me. Besides those handicaps, there were technical challenges, too, which I didn’t think of when I started to work on a solution. For example, hash tables (or hash sets) with tainted strings as keys behave differently from untainted strings. At least they should. Except when they should not! They should not behave differently when it’s about querying for DOM elements. If you create a DOM element from a tainted string, you should be able to find it back with an untainted string. But when it comes to looking up a string in a cache, we certainly want to have the taint information preserved. I hence needed to inspect each and every hash table for their usage of tainted or untainted strings. I haven’t found them all as WebKit’s (extensive) Layout tests still showed some minor rendering differences. But it seems to work well enough.

As for the protection capabilities of our approach, we measured 100% protection against DOM-based XSS. That sounds impressive, right? Our measurements were two-fold. We used the already mentioned Layout Tests to include some more DOM-XSS test cases as well as real-life vulnerabilities. To find those, we used the reports the patched Chromium generated when crawling the Web as mentioned above to scan for compatibility problems, to automatically craft exploits. We then verified that the exploits do indeed work. With 757 of the top 10000 domains the number of exploitable domains was quite high. But that might not add more protection as the already existing built in mechanism, the XSS Auditor, might protect against those attacks already. So we ran the stock browser against the exploits and checked how many of those were successful. The XSS Auditor protected about 28% of the exploitable domains. Our taint tracking based solution, as already mentioned, protected against 100%. That number is not very surprising, because we used the very same codebase to find vulnerabilities. But we couldn’t do any better, because there is no source of DOM-based XSS vulnerabilities…

You could, however, trick the mechanism by using indirect flows. An example of such an indirect data flow is the following piece of code:

// Explicit flow: Taint propagates var value1 = tainted_value === "shibboleth" ? tainted_value : ""; // Implicit flow: Taint does not propagate var value2 = tainted_value === "shibboleth" ? "shibboleth" : "";

If you had such code, then we cannot protect against exploitation. At least not easily.

For future work in the Web context, the approach presented here can be made compatible with server-side taint tracking to persist taint information beyond the lifetime of a Web page. A server-side Web application could transmit taint information for the strings it sends so that the client could mark those strings as tainted. Following that idea it should be possible to defeat other types of XSS. Other areas of work are the representation of information about the data flows in order to help developers to secure their applications. We already receive a report in the form of structured information about the blocked code generation. If that information was enriched and presented in an appealing way, application developers could use that to understand why their application is vulnerable and when it is secure. In a similar vein, witness inputs need to be generated for a malicious data flow in order to assert that code is vulnerable. If these witness inputs were generated live while browsing a Web site, a developer could more easily assess the severity and address the issues arising from DOM-based XSS.

AMCIS Towards inter-organizational Enterprise Architecture Management – Applicability of TOGAF 9.1 for Network Organizations

First of all, there is a LaTeX template for the ACMIS conference now. I couldn’t believe that those academics use Word to typeset their papers. I am way too lazy to use Word so I decided to implement their (incomplete and somewhat incoherent) style guide as a LaTeX class. I guess it was an investment but it paid off the moment we needed to compile our list of references. Because, well, we didn’t have to do it… Our colleagues used Word and they spent at least a day to double check whether references are coherent. Not fun. On the technical side: Writing LaTeX classes is surprisingly annoying. The infrastructure is very limited. Everything feels like a big hack. Managing control flow, implementing data structures, de-duplicating code… How did people manage to write all these awesome LaTeX packages without having even the very basic infrastructure?!

As I promised in a recent post, I am coming back to literature databases. We wrote a literature review and thus needed to query databases. While doing the research I took note of some features and oddities and to save some souls from having to find out all that manually, I want to provide my list of these databases. One of my requirements was to export to a sane format. Something text based, well defined, easy to parse. The export shall include as much meta-data as possible, like keywords, citations, and other simple bibliographic data. Another requirement was the ability to deep link to a search. Something simple, you would guess. But many fall short. Not only do I want the convenience of not having to enter rather complex search queries manually (again), I also want to collaborate. And sending a link to results is much easier than exchanging instructions as to where to click.

Proquest
- Export to RIS with keywords
- Deeplink is hidden, after “My Searches” and “actions”
Palgrave
- Export as CSV: Title, Subtitle, Authors/Editors, Publication, Date, Online, Date, Ebook, Collection, Journal, Title, ISBN13, ISSN, Content Type, URL
- No ability to link to a search
Wiley
- Export possible (BibTex, others), with keywords, but limited to 20 at a time
- Link to Search not possible
JSTOR
- Deeplinks to a search are possible (just copy the URL)
- Export works (BibTeX, RIS), but not with keywords
EBSCO
- Link to search a bit hidden via “Share”
- No mass export of search results. Individual records can be exported.
bepress
- Linking to a search is possible
- Export not possible directly, but via other bepress services, such as AISNet. But then it’s hidden behind “show search”, then “advanced search” and then you can select “Bibliography Export” (Endote)
Science Direct
- Not possible to link to a search. But one can create an RSS feed.
- But it export with Keywords
Some custom web interface
- Export with Keywords: ?
- link to a search

On the paper (pdf link) itself: It’s called “Towards inter-organizational Enterprise Architecture Management – Applicability of TOGAF 9.1 for Network Organizations” and we investigated what problems the research community identified for modern enterprises and how well an EAM framework catered for those needs.

The abstract is as follows:

Network organizations and inter-organizational systems (IOS) have recently been the subjects of extensive research and practice.
Various papers discuss technical issues as well as several complex business considerations and cultural issues. However, one interesting aspect of this context has only received adequate coverage so far, namely the ability of existing Enterprise Architecture Management (EAM) frameworks to address the diverse challenges of inter-organizational collaboration. The relevance of this question is grounded in the increasing significance of IOS and the insight that many organizations model their architecture using such frameworks. This paper addresses the question by firstly conducting a conceptual literature review in order to identify a set of challenges. An EAM framework was then chosen and its ability to address the challenges was evaluated. The chosen framework is The Open Group Architecture Framework (TOGAF) 9.1 and the analysis conducted with regard to the support of network organizations highlights which issues it deals with. TOGAF serves as a good basis to solve the challenges of “Process and Data Integration” and “Infrastructure and Application Integration”. Other areas such as the “Organization of the Network Organization” need further support. Both the identification of challenges and the analysis of TOGAF assist academics and practitioners alike to identify further
research topics as well as to find documentation related to inter-organizational problems in EAM.

FTR: The permissions I needed to give away were surprisingly relaxed:

By checking the box below, I grant AMCIS 2013 Manuscript Submission on behalf of AMCIS 2013 the non-exclusive right to distribute my submission (“the Work”) over the Internet and make it part of the AIS Electronic Library (AISeL).
I warrant as follows:

that I have the full power and authority to make this agreement;

that the Work does not infringe any copyright, nor violate any proprietary rights, nor contain any libelous matter, nor invade the privacy of any person or third party;

that the Work has not been published elsewhere with the same content or in the same format; and

that no right in the Work has in any way been sold, mortgaged, or otherwise disposed of, and that the Work is free from all liens and claims.

I understand that once a peer-reviewed Work is deposited in the repository, it may not be removed.

DFN Workshop 2015

As in the last few years, the DFN Workshop happened in Hamburg, Germany.

The conference was keynoted by Steven Le Blond who talked about targeted attacks, e.g. against dissidents. He mentioned that he already presented the content at the USENIX security conference which some people think is very excellent. He first showed how he used Skype to look up IP addresses of his boss and how similarly targeted attacks were executed in the past. Think Stuxnet. His main focus were attacks on NGOs though. He focussed on an attacker sending malicious emails to the victim.

In order to find out what attack vectors were used, they contacted over 100 NGOs to ask whether they were attacked. Two NGOs, which are affiliated with the Chinese WUC, which represents the Uyghur minority, received 1500 malicious emails, out of which 1100 were carrying malware. He showed examples of those emails and some of them were indeed very targeted. They contained a personalised message with enough context to look genuine. However, the mail also had a malicious DOC file attached. Interestingly enough though, the infrastructure used by the attacker for the targeted attacks was re-used for several victims. You could have expected the attacker to have their infrastructure separated for the various victims, especially when carrying out targeted attacks.

They also investigated how quickly the attacker exploited publicly known vulnerabilities. They measured the time of the malicious email sent minus the release date of the vulnerability. They found that some of the attacks were launched on day 0, meaning that as soon as a vulnerability was publicly disclosed, an NGO was attacked with a relevant exploit. Maybe interestingly, they did not find any 0-day exploits launched. They also measured how the security precautions taken by Adobe for their Acrobat Reader and Microsoft for their Office product (think sandboxing) affected the frequency of attacks. It turned out that it does help to make your software more secure!

To defend against targeted attacks based on spoofed emails he proposed to detect whether the writing style of an email corresponds to that of previously seen emails of the presumed contact. In fact, their research shows that they are able to tell whether the writing style matches that of previous emails with very high probability.

The following talk assessed end-to-end email solutions. It was interesting, because they created a taxonomy for 36 existing projects and assessed qualities such as their compatibility, the trust-model used, or the platform it runs on.
The 36 solutions they identified were (don’t hold your breath, wall of links coming): Neomailbox, Countermail, salusafe, Tutanota, Shazzlemail, Safe-Mail, Enlocked, Lockbin, virtru, APG, gpg4o, gpg4win, Enigmail, Jumble Mail, opaqueMail, Scramble.io, whiteout.io, Mailpile, Bitmail, Mailvelope, pEp, openKeychain, Shwyz, Lavaboom, ProtonMail, StartMail, PrivateSky, Lavabit, FreedomBox, Parley, Mega, Dark Mail, opencom, okTurtles, End-to-End, kinko.me, and LEAP (Bitmask).

Many of them could be discarded right away, because they were not production ready. The list could be further reduced by discarding solutions which do not use open standards such as OpenPGP, but rather proprietary message formats. After applying more filters, such as that the private key must not leave the realm of the user, the list could be condensed to seven projects. Those were: APG, Enigmail, gpg4o, Mailvelope, pEp, Scramble.io, and whiteout.io.

Interestingly, the latter two were not compatible with the rest. The speakers attributed that to the use of GPG/MIME vs. GPG/Inline and they favoured the latter. I don’t think it’s a good idea though. The authors attest pEp a lot of potential and they seem to have indeed interesting ideas. For example, they offer to sign another person’s key by reading “safe words” over a secure channel. While this is not a silver bullet to the keysigning problem, it appears to be much easier to use.

As we are on keysigning. I have placed an article in the conference proceedings. It’s about GNOME Keysign. The paper’s title is “Welcome to the 2000s: Enabling casual two-party key signing” which I think reflects in what era the current OpenPGP infrastructure is stuck. The mindsets of the people involved are still a bit left in the old days where dealing with computation machines was a thing for those with long and white beards. The target group of users for secure communication protocols has inevitably grown much larger than it used to be. While this sounds trivial, the interface to GnuPG has not significantly changed since. It also still makes it hard for others to build higher level tools by making bad default decisions, demanding to be in control of “trust” decisions, and by requiring certain environmental conditions (i.e. the filesystem to be used). GnuPG is not a mere library. It seems it understands itself as a complete crypto suite. Anyway, in the paper, I explained how I think contemporary keysigning protocols work, why it’s not a good thing, and how to make it better.

I propose to further decentralise OpenPGP by enabling people to have very small keysigning “parties”. Currently, the setup cost of a keysigning party is very high. This is, amongst other things, due to the fact that an organiser is required to collect all the keys, to compile a list of participant, and to make the keys available for download. Then, depending on the size of the event, the participants queue up for several hours. And to then tick checkboxes on pieces of paper. A gigantic secops fail. The smarter people sign every box they tick so that an attacker cannot “inject” a maliciously ticked box onto the paper sheet. That’s not fun. The not so smart people don’t even bring their sheets of paper or have them printed by a random person who happens to also be at the conference and, surprise, has access to a printer. What a gigantic attack surface. I think this is bad. Let’s try to reduce that surface by reducing the size of the events.

In order to enable people to have very small events, i.e. two people keysigning, I propose to make most of the actions of a keysigning protocol automatic. So instead of requiring the user to manually compare the fingerprint, I propose that we securely transfer the key to be signed. You might rightfully ask, how to do that. My answer is that we’ve passed the 2000s and that we bring devices which are capable of opening a TCP connection on a link local network, e.g. WiFi. I know, this is not necessarily a given, but let’s just assume for the sake of simplicity that one of our device we carry along can actually do WiFi (and that the network does not block connections between machines). This also prevents certain attacks that users of current Best Practises are still vulnerable against, namely using short key ids or leaking who you are communicating with.

Another step that needs to be automated is signing the key. It sounds easy, right? But it’s not just a mere gpg --sign-key. The first problem is, that you don’t want the key to be signed to pollute your keyring. That can be fixed by using --homedir or the GNUPGHOME environment variable. But then you also want to sign each UID on the key separately. And this is were things get a bit more interesting. Anyway, to make a long story short: We’re not able to do that with plain GnuPG (as of now) in a sane manner. And I think it’s a shame.

Lastly, sending the key needs to be as “zero-click” as possible, too. I propose to simply reuse the current MUA of the user. That sounds easy, but unfortunately, it’s only 2015 and we cannot interact with, say, Evolution and Thunderbird in a standardised manner. There is xdg-email, but it has annoying bugs and doesn’t seem to be maintained. I’m waiting for a sane Email-API. I mean, Email has been around for some time now, let’s now try to actually use it. I hope to be able to make another more formal announcement on GNOME Keysign, soon.

the userbase for strong cryptography declines by half with every additional keystroke or mouseclick required to make it work

— attributed to Ellison.

Anyway, the event was good, I am happy to have attended. I hope to be able to make it there next year again.

On Academia…

A paper that I have authored has ~~recently~~ been published a while ago, but I’ve put this post off for a long time now. Before talking about the paper itself, I want to talk about Academia as I have the feeling that I need to defend myself for playing their game™. The following may sounds overly pessimistic and a while a few bright spots are going to be mentioned, many have been left out for ranting reasons. Keep that in mind when reading that somewhat unstructured rant…

Published papers are the currency in Academia. The more you have, the more respected you are. The quantity is the main metric. No wonder, given that quality control measures are not very well deployed. Pretty much the only mechanism to ensure quality is peer review. The holy grail.

Although the more papers at “better” conferences or journals you have, the better you are, the quality of the conference or journal and the quality of the paper are rarely questioned after the publication. Again, I don’t have proper proof for the statements I make as this is supposed to be a more general rant on current practises in Academia. I can only tell from experience. From me listening to people talking about fellow academics, from observing key metrics in various web portals, or seeing people applying for academic positions. Those people usually have an enumeration of their publications. Maybe it’s a “selection”. But I’ve never seen that people put a “ranking” of the quality of the publisher nor the publication itself. And it wouldn’t make sense, because we don’t have metrics for that, anyway. Sure, there are some people or companies trying to come up with something meaningful. But metrics such as “rejection rate”, “number of citations”, or “h-index” are inherently flawed. For many reasons. Mainly because the data is proprietary. You have to rely on the conference or the journal providing you with correct data. You cannot know whether it is correct as there is no right for you to know. Secondarily, the metric might suffer from chilling effects, such that people think the quality of their publication in spe is too weak to be able to be published on a “high ranked” conference. So they don’t even bother to submit. Other metrics like the average citation count after five years resembles much more a stochastic experiment rather than reflecting the quality of the publications (Ike Antkare anyone?). Again, you have the effect of people wanting to cite some paper of a “high ranked” conference, because that is what people will cite in the future. And in order to be found more easily in the future via backwards citation searches, you’d rather cite publications you think will be cited more often in the future (cf.).

Talking about quality…

You have to trust the peer review of the conference or journal but you actually cannot because you don’t even know who the peers were. It’s good to have an informed opinion and it’s a good thing to be able to rely on an informed judgement. But it’s not good having to rely on that. If, for whatever reason, a peer fails to provide appropriate reviews, one should be able to make a decision oneself. Some studies have indeed shown that the peer review process is no better than flipping a coin. So there seems to be some need to review the peer review.

Once again to be clear: I don’t mind peer review. I think it’s good. Blindly publishing without ensuring that there is indeed an advancement of world’s knowledge wouldn’t be good. And peer review could be a tool to control that. But it doesn’t do it right now. I don’t have any concrete proposal. But I think if the reviews themselves and the reviewers were known, then we could make better decisions as to whether to “trust” a publication or not.

Another proposal is to not have “journals” as physical hard copies anymore. It is ~~2014~~2015, we have the Web, we have some cool technologies. But we don’t make use of any of that. Instead, we are maintaining the status from 20, or rather 200, years ago. We still subscribe to one-off bundles of printed and stapled paper. And we pay loads for that. And not only do we pay loads for receiving that, if you wanted to publish in one of those journals (or conferences), you have to pay, too. In fairness, it’s not only the printing and stapling that costs money, but the services around that. Things like proof reading (has anyone ever gotten a lectorate?), the peer review (has any peer ever gotten any reimbursement?), or the maintenance of an online database (why is it so damn hard to use any of these web databases?) are things we pay money for. I doubt that we need Journals in their current form. We probably do need entities (call them “publishers”), who in turn will need to earn some money, to make sure everything is going smoothly. But we don’t need print-and-forget style publishing. If we could add things like comments, annotations, links, reviews, supplementary material, a varying level of detail, to a paper, even after a few years or even decades, we could move to a “permanently peer reviewed” model. A publication is being reviewed all the time. Ideally by the general public. We could model our current workflow by delegating some form of trust to a group of people, say “reviewers of Journal X”, and only see what these people have vouched for. We could then selectively exclude people from that group of trustees, much like the web of trust. We could, if a paper makes an assumption which is falsified in the future, render some warning when opening the publication. We could decentralise the data such that everyone could build their own index, search mechanism, or interface.

On interfaces

Right now, if you wanted to, say, re-conduct the experiments done in published papers and share your results, you would have to create a publication (which is expected, but right now you would likely have to pay for that) and cite the papers whose results you are trying to reproduce. That’s okay. But if I then wanted to see when and how successful people tried to redo the experiments, I’d have to rely on the database I’m using to provide a reverse citation search and have the correct data (which, for some databases, seems to be the ability to do OCR on the PDF…). That’s not how things should work nowdays, right? We’d expect something more interactive, with tags, open data, something wikiesque. While the ability to reverse-search citations, to highlight some key references, or to link to a key contribution that followed a paper at hand would be nice indeed, we probably have to step back and make existing functionality somewhat usable. I’m not talking about advanced stuff like exporting search results in a standardised format or about deep linking to a result set from a query. That would need treatment after we’ve solved actually searching for multiple keywords, excluding some conferences or journals, or joining or intersecting queries. All that only works to some extent and it’s depressing that we cannot do anything about it, because we don’t have the relevant access or data. Don’t believe me? Well, you shouldn’t. But I’ll provide a table, probably in another post, showing what works with which database and what does not.

On experiments

As I was referring to reproducing results: It is pretty much impossible to reproduce any result, at least in my field, computer science. You don’t get the raw data, let alone the programs to run to get the results. You could argue that it is too complicated to maintain a program that can be run on any platform. Fair enough. I don’t have a solution. But the situation right now is not a good status quo. Right now you don’t get anything. So even if you had the very same setup as the authors of some publication, you would not be able to redo the experiments. It’s likely to be similar in other disciplines. I imagine that rocket scientists do experiments with self made devices or with some utterly expensive appliance (think LHC). Nobody will be able to reproduce the results, simply because there is just that one LHC out there… But… fortunately we have many digital things which are easy to archive and distribute. We, computer scientists, should make use of that. Why not require authors to submit a virtual appliance in some openly specified format? Obviously, source code would be nice, but even in academia there doesn’t seem to be a culture of sharing code freely, so I’m not even suggesting that.

Phew. After having criticised Academia and having made some half baked proposals I forgot what I actually wanted to do: Being a good academic (not caring about the public perception of “good” in terms of quantity of publications), and discuss a few things around the paper that we paid a couple of hundred dollars for to get published. But I leave that for another ~~rant~~ post.

In what ways do you think is Academia broken?

Critical Review of Tesseract

For CA640 we were supposed to pick a paper from International Conference of Software Engineering 2009 (ICSE 2009) and critically review it.

I chose to review Tesseract: Interactive Visual Exploration of Socio-Technical Relationships in Software Development.

You can find the review in PDF here. Its abstract reads:

This critical review of a paper, which presents Tesseract and was handed in for the ICSE 2009, focusses on
strength and weaknesses of the idea behind Tesseract: Visualising and exploring freely available and loosly coupled fragments (mailing lists, bug tracker or commits) of Free Software development.
Tesseract is thus a powerful data miner as well as a GUI to browse the obtained data.

This critique evaluates the usefulness of Tesseract by questioning the fundamental motivation it was built on, the data which it analyses and its general applicability.

Existing gaps in the original research are filled by conducting interviews with relevant developers as well as providing information about the internal structure of a Free Software project.

Tesseract is a program that builds and visualises a social network based on freely available data from a software project such as mailing lists, bug tracker or commits to a software repository. This network can be interactively explored with the Tesseract tool. This tool shows how communication among developers relates to changes in the actual code. The authors used a project under the GNOME umbrella named Rhythmbox to show their data mining and the program in operation. GNOME is a Free/Libre Software Desktop used as default by many Linux distributions including the most popular ones, i.e. Ubuntu and Fedora. To assess Tesseracts usability and usefulness, the authors interviewed people not related to Rhythmbox asking whether Tesseract was usable and provided useful information.

The paper was particularly interesting for me because the authors analysed data from the GNOME project. As I am a member of that development community, I wanted to see how their approach can or cannot increase the quality of the project. Another focus was to help their attempt to improve GNOME by highlighting where they may have gaps in their knowledge of its internals.

During this critique, I will show that some assumptions were made that do not hold for Free/Libre and Open Source Software (FLOSS) in general and for GNOME in particular either because the authors simply did not have the internal knowledge or did not research carefully enough. Also I will show that the used data is not necessarily meaningful and I will attempt to complement the lacking data by presenting the results of interviews I conducted with actual GNOME developers. This will show how to further improve Tesseract by identifying new usage scenarios. Lastly, this text will question the general usefulness of Tesseract for the majority of Free Software projects.

MSN Shutdown in 2003

During CA640 I was made to write an ethical review which I was supposed to hand in using a dodgy webservice. Since it got 90% people mugged me to make it available 😉 Of course, I don’t have a problem with that, so people now have a reference or know what to expect when they enter the course.

You can find the PDF here and its abstract reads:

At the end 2003 Microsoft closed the public chat-rooms of its Internet service called MSN.
MSN was pushed by Children’s Charities because they feared an abuse of these chat-rooms.
In some countries, however, the service was still available but subject to a charge.
This review raises ethical questions about Microsoft’s and the Children’s Charities’ behaviour because making the people pay with the excuse of protecting children is considered ethically questionable.
Also the Children’s Charities pushed for closure of a heavily used service although there is absolutely no evidence that children would be safer after closing down a chat-room.

If you are not interested in the non-technical details you might be interested to know that I use a Mercurial Hook on the server side to automatically compile the LaTeX sources one I push changes to the server:

$ cat .hg/hgrc
[hooks]
changegroup.compile = export FILE=paper && hg up -C && pdflatex --interaction=batchmode $FILE && bibtex $FILE && pdflatex --interaction=batchmode $FILE && pdflatex --interaction=batchmode $FILE

And then I just symlink that resulting PDF file to my public_html directory.

Digital Divide

Als Student kommt es hin und wieder vor, dass ich eine Hausarbeit schreiben muss. Da ich fest davon ueberzeugt, dass Uni, Wissenschaft und Wissen so frei wie moeglich sein sollten, und jedermensch auch noch durch Zahlung von Steuern potentiell das Studieren finanziert, denke ich, hat jedermensch das Recht mindestens zu sehen was ich so eigentlich den lieben langen Tag so mache.

Internet sei dank ist es heutzutage eher einfach, Dinge zu publizieren und Wissen fortzutragen. Deswegen gibt es hier nun eine Hausarbeit, die ich im letzten Semester in Gender Studies geschrieben habe.

Das Paper traegt den Namen “Weiblicher Zugang zu Technik und feministische Politiken” und das Abstract liesst sich wiefolgt:

Die Gründe, die zum Digital Divide, der digitalen Kluft, führen, sind vielfältig und Geschlecht ist einer davon.
Auch weibliche Gruppierungen haben das Ziel, den Anteil weiblicher Teilnehmer im Digitalen zu erhöhen.
Die Arbeit analysiert, wie dieses Ziel erreicht werden soll, warum das nicht gelingt und wie es eventuell doch erreicht werden kann.

Das PDF gibt es hier und ist als “Namensnennung-Keine kommerzielle Nutzung-Weitergabe unter gleichen Bedingungen 3.0 Deutschland” fuer jedermensch lizensiert. Das heisst aber nicht, dass ich es auf Anfrage anders lizensieren kann.

Die Arbeit liesst sich an einigen Stellen etwas ruckelig, was der Entstehungsgeschichte geschuldet ist. Im Prinzip sind aus 2.5 Arbeiten eine geworden. Ich hoffe, es ist dennoch nicht so schlimm.

Sollte das PDF inhaltlich nicht so spannend sein, lohnt es sich doch auf die technischen Details zu achten. So weiss das PDF, wie dessen Inhalt lizensiert ist. Dazu benutzt es XMP streams, die in das PDF eingebetted wurden. Die sind mit dem Paket hyperxmp ueber LaTeX in das PDF gekommen. Offiziell wird noch xmpincl empfohlen, aber das ist wirklich fies zu benutzen, weil mensch den XMP stream selbst erstellen muss.

\usepackage{hyperxmp}         % To be have an XMP Data Stream f.e. to include the license
[...]
\hypersetup{
        pdftitle={Weiblicher Zugang zu Technik und feministische Politiken},
        pdfauthor={Tobias Mueller},
        [...]
        pdfcopyright={This work is licensed to the public under the Creative Commons Attribution-Non-Commercial-Share Alike 3.0 Germany License.},
        pdflicenseurl={http://creativecommons.org/licenses/by-nc-sa/3.0/de/}
}

Mein Evince 2.29.1 (mit JHBuild gebaut) zeigt die Lizenzinformation froehlich an, Okular 0.9.2 nicht. Wie es sonst moeglich ist, in PDF eingebettete XMP Daten anzusehen, weiss ich nicht. Es waere fuer eine automatisierte Verarbeitung sicherlich interessant.

Vielen Dank and Chillum und Sourci, die mir beratend und patchend zur Seite standen und denen der Text wahrscheinlich zu den Ohren wieder herauskommt 😉

Fuer eine inhaltliche Diskussion ist die Kommentarfunktion wohl eher schlecht geeignet aber in Ermangelung an Alternativen steht sie dazu zur Verfuegung. Ich mag die Loesung, die das Djangobook benutzt. Am Rand von jedem Absatz gibt es eine Kommentarfunktion die sehr gut funktioniert.

Adding Linux Syscall

In a course (CA644) we were asked to add a new syscall to the Linux kernel.

As I believe that knowledge should be as free and as accessible as possible, I thought I have to at least publish our results. Another (though minor) reason is that the society -to some extend- pays for me doing science so I believe that the society deserves to at least see the results.

The need to actually publish that is not very big since a lot of information on how to do that exists already. However, that is mostly outdated. A good article is from macboy but it misses to mention a minor fact: The syscall() function is variadic so that it takes as many arguments as you give it.

So the abstract of the paper, that I’ve written together with Nosmo, reads:

This paper shows how to build a recent Linux kernel from scratch, how to add a new system call to it and how to implement new functionality easily.
The chosen functionality is to retrieve the stack protecting canary so that mitigation of buffer overflow attacks can be circumvented.

And you can download the PDF here.

If it’s not interesting for you content wise, it might be interesting from a technical point of view: The PDF has files attached, so that you don’t need to do the boring stuff yourself but rather save working files and modify them. That is achieved using the embedfile package.

\usepackage{embedfile}        % Provides \embedfile[filename=foo, desc={bar}]{file}
[...]
\embedfile[filespec=writetest.c, mimetype=text/x-c,desc={Program which uses the new systemcall}]{../code/userland/writetest.c}%

If your PDF client doesn’t allow you save the files (Evince does 🙂 ), you might want to use pdftk $PDF unpack_files in some empty directory.