Talking at GPN 2018 in Karlsruhe, Germany

Similar to last year I managed to attend the Gulasch Programmier-Nacht (GPN) in Karlsruhe, Germany. Not only did I attend, I also managed to squeeze in a talk about PrivacyScore. We got the prime time slot on the opening day along with all the other relevant talks, including the Eurovision Song Contest, so we were not overly surprised that the audience had a hard time deciding where to go and eventually decided to attend talks which were not recorded. Our talk was recorded and is available here.

Given the tough selection of the audience by the other talks, we had the people who were really interested. And that showed during the official Q&A as well as in the hallway track. We exchanged contacts with other interested parties and got a few excellent comments on the project.

Another excellent part of this year’s GPN was the exhibition in the museum. As GPN takes places in a joint building belonging to the local media university as well as the superb art and media museum, the proximity to the artsy things allows for an interesting combination. This year, the open codes exhibition was not hosted in the ZKM, but GPN also took place in that exhibition. A fantastic setup. Especially with the GPN’s motto being “digital naïves”. One of the exhibition’s pieces is an assembly robot’s hand doing nothing else but writing a manifesto. Much like a disciplinary action for a school child. Except that the robot doesn’t care so much. Yet, it’s usefulness only expands to writing these manifestos. And the robot doesn’t learn anything from it. I like this piece, because it makes me think about the actions we take hoping that they have a desired effect on something or someone but we actually don’t know whether this is indeed the case.

I also like the Critical Engineering Manifesto being exhibited. I like to think about how the people who actual implement cetain technologies can be held responsible for the effects of it on individuals or the society. Especially with more and more “IoT” deployments where the “S” represents their security. It’s easy to blame Facebook for “leaking” user profiles although it’s in their Terms of Services, but it’s harder to shift the blame for the smart milk sensor in your fridge invading my privacy by reporting how much I consume. We will have interesting times ahead of us.

An exhibit pointing out the beauty of algorithms and computation is a board that renders a Julia Set. That’s wouldn’t be so impressive in itself, but you can watch the machine actually compute the values. The exhibit has a user controllable speed regulator and an insight into the CPU as well as the higher level code. I think it’s just an ingenious idea to enable the user to go full speed and see the captivating movements of the beautiful Julia set while also allowing the go super slow to investigate how this beauty is composed of relatively simple operations. Also, the slow execution itself is relatively boring. We get to see that we have to go very fast in order to be entertained. So fast that we cannot really comprehend what is going on.

I whole heartedly recommend visiting this exhibition. And the GPN, of course, too. It’s a nice chaotic event with a particular flair. It’s getting more and more crowded though, so better while the feeling lasts and doesn’t get drowned by all the tourists.

Talking on PrivacyScore at DFN Security Conference 2018 in Hamburg, Germany

I seem to have skipped last year, but otherwise I have been to the DFN Workshop regularly. While I had a publication at this venue before, it’s only this year that I got to have a the conference.

I cannot comment on the other talks so much, because I could not attend too many :( But our talk (slides) was well visited and I think people appreciated the presentation being a bit lighter than the previous one about the upcoming GDPR.

I talked about PrivacyScore.org and how we’ve measured German universities. The paper is here. Our results were mixed. As for TLS deployment, with a lot of imagination we can see a line dividing Germany. The West seems to have fewer problems with their TLS deployment than the East. The more red an area is, the worse its TLS support is. That ranges from not offering TLS at all to having an invalid certificate or using broken parameters.

As for tracking its users we had the hypothesis that privately run institutions have a higher interest in tracking its users than publicly run institutions. The following graphic reflects the geographic distribution of trackers on German university’s Web sites.
That hypothesis can be confirmed by looking at the PrivacyScore list that discriminates those institutions.

We found data that was very likely not meant to be there, such as database dumps or Git repositories of the Web site’s code (including passwords for their staging environments, etc.). We tried to report these issues to the Web site operators, but it was difficult to get hold of the responsible people. For the 21 leaks we found I have 93 emails in my mailbox. Ideally, the 21 I sent off were enough. But even sending those emails is hard, because people don’t respect RFC 2142 and have a security@ address. Eventually, we made the Internet a tiny bit more secure by having those Website operators remove the leaks from their Web site, but there are still some pages which have (supposedly) unwanted information such as their visitors’ IP addresses online. The graph below shows that most of the operators who reacted did so in the first few days. So management of security incidents seems to be an area of improvement.

I hope to be able to return next year, if only for the catering ;-) Then, I better attend some more talks and chat with the other guests.

Speaking at FOSDEM 2018 in Brussels, Belgium

As in the last ten (or so) years I attended FODSEM, the biggest European Free Software event. This year, though, I went a day earlier to attend one of the fringe events, the CHAOSSCon.

I didn’t take notice of the LinuxFoundation announcing CHAOSS, an attempt to bundle various efforts regarding measuring and creating metrics of Open Source projects. The CHAOSS community is thus a bunch of formerly separate projects now having one umbrella.

OpenStack’s Ildiko Vancsa opened the conference by saying that metrics is what drives our understanding of communities and that we’re all interested in numbers. That helps us to understand how projects work and make a more educated guess how healthy a project currently is, and, more importantly, what needs to be done in order to make it more sustainable. She also said that two communities within the CHAOSS project exist: The Metrics and the Software team. The metrics care about what information should be extracted and how that can be presented in an informational manner. The Software team implements the extraction parts and makes the analytics. She pointed the audience to the Wiki which hosts more information.

Georg Link from the metrics team then continued saying that health cannot universally be determined as every project is different and needs a different perspective. The metrics team does not work at answering the health question for each and every project, but rather enables such conclusions to be drawn by providing the necessary infrastructure. They want to provide facts, not opinions.

Jesus from Bitergia and Harish from Red Hat were talking on behalf of the technical team. Their idea is to build a platform to understand how software is developed. The core projects are prospector, cregit, ghdata, and grimoire, they said.

I think that we in the GNOME community can use data to make more informed decisions. For example, right now we’re fading out our Bugzilla instance and we don’t really have any way to measure how successful we are. In fact, we don’t even know what it would mean to be successful. But by looking at data we might get a better feeling of what we are interested in and what metric we need to refine to express better what we want to know. Then we can evaluate measures by looking at the development of the metrics over time. Spontaneously, I can think of these relatively simple questions: How much review do our patches get? How many stale wiki links do we have? How soon are security issues being dealt with? Do people contribute to the wiki, documentation, or translations before creating code? Where do people contribute when coding stalls?

Bitergia’s Daniel reported on Diversity and Inclusion in CHAOSS and he said he is building a bridge between the metrics and the software team. He tried to produce data of how many women were contributing what. Especially, whether they would do any technical work. Questions they want to answer include whether minorities take more time to contribute or what impact programs like the GNOME Outreach Program for Women have. They do need to code up the relevant metrics but intend to be ready for the next OpenStack Gender diversity report.

Bitergia’s CEO talked about the state of the GriomoireLab suite.
It’s software development analysis toolkit written largely in Python, ElasticSearch, and Kibana. One year ago it was still complicated to run the stack, he said. Now it’s easy and organisations like the Document Foundation run run a public instance. Also because they want to be as transparent as possible, he said.

Yousef from Mozilla’s Open Innovation team then showed how they make use of Grimoire to investigate the state of their community. They ingest data from Github, Bugzilla, newsgroup, meetups, discourse, IRC, stackoverflow, their wiki, rust creates, and a few other things reaching back as far as 20 years. Quite impressive. One of the graphs he found interesting was one showing commits by time zone. He commented that it was not as diverse as he hope as there were still many US time zones and much fewer Asian ones.

Raymond from the Linux Foundation talked about Metrics in Open Source Communities, what are they measuring and what do they do with the data. Measuring things is not too complicated, he said. But then you actually need to do stuff with it. Certain things are simply hard to measure, he said. As an example he gave the level of user or community support people give. Another interesting aspect he mentioned is that it may be a very good thing when numbers go down, also because projects may follow a hype cycle, too. And if your numbers drop, it’ll eventually get to a more mature phase, he said. He closed with a quote he liked and noted that he’s not necessarily making fun of senior management: Not everything that can be counted counts and not everything that counts can be counted.

Boris then talked about Crossminer, which is a European funded research project. They aim for improving the management of software projects by providing in-context recommendations and analytics. It’s a continuation of the Ossmeter project. He said that such projects usually die after the funding runs out. He said that the Crossminer project wants to be sustainable and survive the post-funding state by building an actual community around the software the project is developing. He presented a rather high level overview of what they are doing and what their software tries to achieve. Essentially, it’s an Eclipse plugin which gives you recommendations. The time was too short for going into the details of how they actually do it, I suppose.

Eleni talked about merging identities. When tapping various data sources, you have to deal with people having different identity domains. You may want to merge the identities belonging to the same person, she said. She gave a few examples of what can go wrong when trying to merge identities. One of them is that some identities do not represent humans but rather bots. Commonly used labels is a problem, she said. She referred to email address prefixes which may very well be the same for different people, think j.wright@apple.com, j.wright@gmail.com, j.wright@amazon.com. They have at least 13 different problems, she said, and the impact of wrongly merging identities can be to either underestimate or overestimate the number of community members. Manual inspection is required, at least so far, she said.

The next two days were then dedicated to FOSDEM which had a Privacy Devroom. There I had a talk on PrivacyScore.org (slides). I had 25 minutes which I was overusing a little bit. I’m not used to these rather short slots. You just warm up talking and then the time is already up. Anyway, we had very interesting discussions afterwards with a few suggestions regarding new tests. For example, someone mentioned that detecting a CDN might be worthwhile given that CloudFlare allegedly terminates 10% of today’s Web traffic.

When sitting with friends we noticed that FOSDEM felt a bit like Christmas for us: Nobody really cares a lot about Christmas itself, but rather about the people coming together to spend time with each other. The younger people are excited about the presents (or the talks, in this case), but it’s just a matter of time for that to change.

It’s been an intense yet refreshing weekend and I’m looking very much forward to coming back next time. For some reason it feels really good to see so many people caring about Free Software.

Talking at GI Tracking Workshop in Darmstadt, Germany

Uh, I almost forgot about blogging about having talked at the GI Tracking Workshop in Darmstadt, Germany. The GI is, literally translated, the “informatics society” and sort of a union of academics in the field of computer science (oh boy, I’ll probably get beaten up for that description). And within that body several working groups exist. And one of these groups working on privacy organised this workshop about tracking on the Web.

I consider “workshop” a bit of a misnomer for this event, because it was mainly talks with a panel at the end. I was an invited panellist for representing the Free Software movement contrasting a guy from affili.net, someone from eTracker.com, a lady from eyeo (the AdBlock Plus people), and professors representing academia. During the panel discussion I tried to focus on Free Software being the only tool to enable the user to exercise control over what data is being sent in order to control tracking. Nobody really disagreed, which made the discussion a bit boring for me. Maybe I should have tried to find another more controversial argument to make people say more interesting things. Then again, it’s probably more the job of the moderator to make the participants discuss heatedly. Anyway, we had a nice hour or so of talking about the future of tracking, not only the Web, but in our lives.

One of the speakers was Lars Konzelmann who works at Saxony’s data protection office. He talked about the legislative nature of data protection issues. The GDPR is, although being almost two years old, a thing now. Several types of EU-wide regulations exist, he said. One is “Regulation” and the other is “Directive”. The GDPR has been designed as a Regulation, because the EU wanted to keep a minimum level of quality across the EU and prevent countries to implement their own legislation with rather lax rules, he said. The GDPR favours “privacy by design” but that has issues, he said, as the usability aspects are severe. Because so far, companies can get the user’s “informed consent” in order to do pretty much anything they want. Although it’s usefulness is limited, he said, because people generally don’t understand what they are consenting to. But with the GDPR, companies should implement privacy by design which will probably obsolete the option for users to simply click “agree”, he said. So things will somehow get harder to agree to. That, in turn, may cause people to be unhappy and feel that they are being patronised and being told what they should do, rather than expressing their free will with a simple click of a button.

Next up was a guy with their solution against tracking in the Web. They sell a little box which you use to surf the Web with, similar to what Pi Hole provides. It’s a Raspberry Pi with a modified (and likely GPL infringing) modification of Raspbian which you plug into your network and use as a gateway. I assume that the device then filters your network traffic to exclude known bad trackers. Anyway, he said that ads are only the tip of the iceberg. Below that is your more private intimate sphere which is being pried on by real time bidding for your screen estate by advertising companies. Why would that be a problem, you ask. And he said that companies apply dynamic pricing depending on your profile and that you might well be interested in knowing that you are being treated worse than other people. Other examples include a worse credit- or health rating depending on where you browse or because your bank knows that you’re a gambler. In fact, micro targeting allows for building up a political profile of yours or to make identity theft much easier. He then went on to explain how Web tracking actually works. He mentioned third party cookies, “social” plugins (think: Like button), advertisement, content providers like Google Maps, Twitter, Youtube, these kind of things, as a means to track you. And that it’s possible to do non invasive customer recognition which does not involve writing anything to the user’s disk, e.g. no cookies. In fact, such a fingerprinting of the users’ browser is the new thing, he said. He probably knows, because he is also in the business of providing a tracker. That’s probably how he knows that “data management providers” (DMP) merge data sets of different trackers to get a more complete picture of the entity behind a tracking code. DMPs enrich their profiles by trading them with other DMPs. In order to match IDs, the tracker sends some code that makes the user’s browser merge the tracking IDs, e.g. make it send all IDs to all the trackers. He wasn’t really advertising his product, but during Q&A he was asked what we can do against that tracking behaviour and then he was forced to praise his product…

Eye/o’s legal counsel Judith Nink then talked about the juristic aspects of blocking advertisements. She explained why people use adblockers in first place. I commented on that before, claiming that using an adblocker improves your security. She did indeed mention privacy and security being reasons for people to run adblockers and explicitly mentionedmalvertising. She said that Jerusalem Post had ads which were actually malware. That in turn caused some stir-up in Germany, because it was coined as attack on German parliament… But other reasons for running and adblocker were data consumption and the speed of loading Web pages, she said. And, of course, the simple annoyance of certain advertisements. She presented some studies which showed that the typical Web site has 50+ or so trackers and that the costs of downloading advertising were significant compared to downloading the actual content. She then showed a statement by Edward Snowden saying that using an ad-blocker was not only a right but is a duty.

Everybody should be running adblock software, if only from a safety perspective

Browser based ad blockers need external filter lists, she said. The discussion then turned towards the legality of blocking ads. I wasn’t aware that it’s a thing that law people discuss. How can it possibly not be legal to control what my client does when being fed a bunch of HTML and JavaScript..? Turns out that it’s more about the entity offering these lists and a program to evaluate them *shrug*. Anyway, ad-blockers use either blocking or hiding of elements, she said where “blocking” is to stop the browser from issuing the request in first place while “hiding” is to issue the request, but to then hide the DOM element. Yeah, law people make exactly this distinction. She then turned to the question of how legal either of these behaviours is. To the non German folks that question may seem silly. And I tend to agree. But apparently, you cannot simply distribute software which modifies a Browser to either block requests or hide DOM elements without getting sued by publishers. Those, she said, argue that gratis content can only be delivered along with ads and that it’s part of the deal with the customer. Like that they also transfer ads along with the actual content. If you think that this is an insane argument, especially in light of the customer not having had the ability to review that deal before loading that page, you’re in good company. She argued, that the simple act of loading a page cannot be a statement of consent, let alone be a deal of some sorts. In order to make it a deal, the publishers would have to show their terms of service first, before showing anything, she said. Anyway, eye/o’s business is to provide those filter lists and a browser plugin to make use of those lists. If you pay them, however, they think twice before blocking your content and make exceptions. That feels a bit mafiaesque and so they were sued for “aggressive geschäftliche Handlung”, an “aggressive commercial behaviour”. I found the history of cases interesting, but I’ll spare the details for the reader here. You can follow that, and other cases, by looking at OLG Koeln 6U149/15.

Next up was Dominik Herrmann to present on PrivacyScore.org, a Web portal for scanning Web sites for security and privacy issues. It is similar to other scanners, he said, but the focus of PrivacyScore is publicity. By making results public, he hopes that a race to the top will occur. Web site operators might feel more inclined to implement certain privacy or security mechanisms if they know that they are the only Web site which doesn’t protect the privacy of their users. Similarly, users might opt to use a Web site providing a more privacy friendly service. With the public portal you can create lists in order to create public benchmarks. I took the liberty to create a list of Free Desktop environments. At the time of creation, GNOME fell behind many others, because the mail server did not implement TLS 1.2. I hope that is being taking as a motivational factor to make things more secure.

Talking at PET-CON 2017.2 in Hamburg, Germany

A few weeks ago, I was fortunate enough to talk at the 7th Privacy Enhancing Techniques Conference (PET-CON 2017.2) in Hamburg, Germany. It’s a teeny tiny academic event with a dozen or so experts in the field of privacy.

The talks were quite technical, involving things like machine learning over logs or secure multi-party computation. I talked about how I think that the best technical solution does not necessarily enable the people to be more private, simply because the people might not be able to make use of the tool properly. A concern that’s generally shared in the academic community. Yet, the methodology to create and assess the effectiveness of a design is not very elaborated. I guess we need to invest more brain power into creating models, metrics, and tools for enabling people to do safer computing.

So I’m happy to have gone and to have had the opportunity of discussing the issues I’m seeing. Likewise, I find it very interesting to see where the people are currently headed towards.

Taint Tracking for Chromium

I forgot to blog about one of my projects. I had actually already talked about it more than one year ago and we had a paper at USENIX Security.

Essentially, we built a protection against DOM-based Cross-site Scripting (DOMXSS) into Chromium. We did that by detecting whenever potentially attacker provided strings become JavaScript code. To that end, we made the HTML rendering engine (WebKit/Blink) and the JavaScript engine taint aware. That is, we identified sources of values that an attacker could control (think window.name) and marked all strings coming from those sources as tainted. Then, during parsing of JavaScript, we check whether the string to be compiled is actually tainted. If that is indeed the case, then we abort the compilation.

That description is a bit simplified. For example, not compiling code because it contains some fragments of the URL would break a substantial number of Web sites. It’s an unfortunate fact that many Web sites either eval code containing parts of the URL or do a document.write with a string containing parts of the URL. The URL, in our attacker model, can be controlled by the attacker. So we must be more clever about aborting compilation. The idea was to only allow literals in JavaScript (like true, false, numbers, or strings) to be compiled, but not “code”. So if a tainted (sub)string compiles to a string: fine. If, however, we compile a tainted string to a function call or an operation, then we abort. Let me give an example of an allowed compilation and a disallowed one.


<HTML>

<TITLE>Welcome!</TITLE>
Hi

<SCRIPT>
var pos=document.URL.indexOf("name=")+5;
document.write(document.URL.substring(pos,document.URL.length));
</SCRIPT>

<BR>
Welcome to our system

</HTML>

Which is from the original report on DOM-based XSS. You see that nothing bad will happen when you open http://www.vulnerable.site/welcome.html?name=Joe. However, opening http://www.vulnerable.site/welcome.html?name=alert(document.cookie) will lead to attacker provided code being executed in the victim’s context. Even worse, when opening with a hash (#) instead of a question mark (?) then the server will not even see the payload, because Web browsers do not transmit it as part of their request.

“Why does that happen?”, you may ask. We see that the document.write call got fed a string derived from the URL. The URL is assumed to be provided by the attacker. The string is then used to create new DOM elements. In the good case, it’s only a simple text node, representing text to be rendered. That’s a perfectly legit use case and we must, unfortunately, allow that sort of usage. I say unfortunate, because using these APIs is inherently insecure. The alternative is to use createElement and friends to properly inject DOM nodes. But that requires comparatively much more effort than using the document.write. Coming back to the security problem: In the bad case, a script element is created with attacker provided contents. That is very bad, because now the attacker controls your browser. So we must prevent the attacker provided code from execution.

You see, tracking the taint information is a non-trivial effort and must be done beyond newly created DOM nodes and multiple passes of JavaScript (think eval(eval(eval(tainted_string)))). We must also track the taint information not on the full string, but on each character in order to not break existing Web applications. For example, if you first concatenate with a tainted string and then remove all tainted characters, the string should not be marked as tainted. This non-trivial effort manifests itself in the over 15000 Lines of Code we patched Chromium with to provide protection against DOM-based XSS. These patches, as indicated, create, track, propagate, and evaluate taint information. Also, the compilation of JavaScript has been modified to adhere to the policy that tainted strings must only compile to literals. Other policies are certainly possible and might actually increase protection (or increase compatibility without sacrificing security). So not only WebKit (Blink) needed to be patched, but also V8, the JavaScript engine. These patches add to the logic and must be execute in order to protect the user. Thus, they take time on the CPU and add to the memory consumption. Especially the way the taint information is stored could blow up the memory required to store a string by 100%. We found, however, that the overhead incurred was not as big as other solutions proposed by academia. Actually, we measure that we are still faster than, say, Firefox or Opera. We measured the execution speed of various browsers under various benchmarks. We concluded that our patched version added 23% runtime overhead compared to the unpatched version.

xss-runtime

As for compatibility, we crawled the Alexa Top 10000 and observed how often our protection mechanism has stopped execution. Every blocked script would count towards the incompatibility, because we assume that our browser was not under attack when crawling. That methodology is certainly not perfect, because only shallowly crawling front pages does not actually indicate how broken the actual Web app is. To compensate, we used the WebKit rendering tests, hoping that they cover most of the important functionality. Our results indicate that scripts from 26 of the 10000 domains were blocked. Out of those, 18 were actually vulnerable against DOM-based XSS, so blocking their execution happened because a code fragment like the following is actually indistinguishable from a real attack. Unfortunately, those scripts are quite common :( It’s being used mostly by ad distribution networks and is really dangerous. So using an AdBlocker is certainly an increase in security.


var location_parts = window.location.hash.substring(1).split(’|’);
var rand = location_parts[0];
var scriptsrc = decodeURIComponent(location_parts[1]);
document.write("<scr"+"ipt src=’" + scriptsrc + "’></scr"+"ipt>");

Modifying the WebKit for the Web parts and V8 for the JavaScript parts to be taint aware was certainly a challenge. I have neither seriously programmed C++ before nor looked much into compilers. So modifying Chromium, the big beast, was not an easy task for me. Besides those handicaps, there were technical challenges, too, which I didn’t think of when I started to work on a solution. For example, hash tables (or hash sets) with tainted strings as keys behave differently from untainted strings. At least they should. Except when they should not! They should not behave differently when it’s about querying for DOM elements. If you create a DOM element from a tainted string, you should be able to find it back with an untainted string. But when it comes to looking up a string in a cache, we certainly want to have the taint information preserved. I hence needed to inspect each and every hash table for their usage of tainted or untainted strings. I haven’t found them all as WebKit’s (extensive) Layout tests still showed some minor rendering differences. But it seems to work well enough.

As for the protection capabilities of our approach, we measured 100% protection against DOM-based XSS. That sounds impressive, right? Our measurements were two-fold. We used the already mentioned Layout Tests to include some more DOM-XSS test cases as well as real-life vulnerabilities. To find those, we used the reports the patched Chromium generated when crawling the Web as mentioned above to scan for compatibility problems, to automatically craft exploits. We then verified that the exploits do indeed work. With 757 of the top 10000 domains the number of exploitable domains was quite high. But that might not add more protection as the already existing built in mechanism, the XSS Auditor, might protect against those attacks already. So we ran the stock browser against the exploits and checked how many of those were successful. The XSS Auditor protected about 28% of the exploitable domains. Our taint tracking based solution, as already mentioned, protected against 100%. That number is not very surprising, because we used the very same codebase to find vulnerabilities. But we couldn’t do any better, because there is no source of DOM-based XSS vulnerabilities…

You could, however, trick the mechanism by using indirect flows. An example of such an indirect data flow is the following piece of code:


// Explicit flow: Taint propagates
var value1 = tainted_value === "shibboleth" ? tainted_value : "";
// Implicit flow: Taint does not propagate
var value2 = tainted_value === "shibboleth" ? "shibboleth" : "";

If you had such code, then we cannot protect against exploitation. At least not easily.

For future work in the Web context, the approach presented here can be made compatible with server-side taint tracking to persist taint information beyond the lifetime of a Web page. A server-side Web application could transmit taint information for the strings it sends so that the client could mark those strings as tainted. Following that idea it should be possible to defeat other types of XSS. Other areas of work are the representation of information about the data flows in order to help developers to secure their applications. We already receive a report in the form of structured information about the blocked code generation. If that information was enriched and presented in an appealing way, application developers could use that to understand why their application is vulnerable and when it is secure. In a similar vein, witness inputs need to be generated for a malicious data flow in order to assert that code is vulnerable. If these witness inputs were generated live while browsing a Web site, a developer could more easily assess the severity and address the issues arising from DOM-based XSS.

AMCIS Towards inter-organizational Enterprise Architecture Management – Applicability of TOGAF 9.1 for Network Organizations

First of all, there is a LaTeX template for the ACMIS conference now. I couldn’t believe that those academics use Word to typeset their papers. I am way too lazy to use Word so I decided to implement their (incomplete and somewhat incoherent) style guide as a LaTeX class. I guess it was an investment but it paid off the moment we needed to compile our list of references. Because, well, we didn’t have to do it… Our colleagues used Word and they spent at least a day to double check whether references are coherent. Not fun. On the technical side: Writing LaTeX classes is surprisingly annoying. The infrastructure is very limited. Everything feels like a big hack. Managing control flow, implementing data structures, de-duplicating code… How did people manage to write all these awesome LaTeX packages without having even the very basic infrastructure?!

As I promised in a recent post, I am coming back to literature databases. We wrote a literature review and thus needed to query databases. While doing the research I took note of some features and oddities and to save some souls from having to find out all that manually, I want to provide my list of these databases. One of my requirements was to export to a sane format. Something text based, well defined, easy to parse. The export shall include as much meta-data as possible, like keywords, citations, and other simple bibliographic data. Another requirement was the ability to deep link to a search. Something simple, you would guess. But many fall short. Not only do I want the convenience of not having to enter rather complex search queries manually (again), I also want to collaborate. And sending a link to results is much easier than exchanging instructions as to where to click.

  • Proquest
    • Export to RIS with keywords
    • Deeplink is hidden, after “My Searches” and “actions”
  • Palgrave
    • Export as CSV: Title, Subtitle, Authors/Editors, Publication, Date, Online, Date, Ebook, Collection, Journal, Title, ISBN13, ISSN, Content Type, URL
    • No ability to link to a search
  • Wiley
    • Export possible (BibTex, others), with keywords, but limited to 20 at a time
    • Link to Search not possible
  • JSTOR
    • Deeplinks to a search are possible (just copy the URL)
    • Export works (BibTeX, RIS), but not with keywords
  • EBSCO
    • Link to search a bit hidden via “Share”
    • No mass export of search results. Individual records can be exported.
  • bepress
    • Linking to a search is possible
    • Export not possible directly, but via other bepress services, such as AISNet. But then it’s hidden behind “show search”, then “advanced search” and then you can select “Bibliography Export” (Endote)
  • Science Direct
    • Not possible to link to a search. But one can create an RSS feed.
    • But it export with Keywords
  • Some custom web interface

On the paper (pdf link) itself: It’s called “Towards inter-organizational Enterprise Architecture Management – Applicability of TOGAF 9.1 for Network Organizations” and we investigated what problems the research community identified for modern enterprises and how well an EAM framework catered for those needs.

The abstract is as follows:

Network organizations and inter-organizational systems (IOS) have recently been the subjects of extensive research and practice.
Various papers discuss technical issues as well as several complex business considerations and cultural issues. However, one interesting aspect of this context has only received adequate coverage so far, namely the ability of existing Enterprise Architecture Management (EAM) frameworks to address the diverse challenges of inter-organizational collaboration. The relevance of this question is grounded in the increasing significance of IOS and the insight that many organizations model their architecture using such frameworks. This paper addresses the question by firstly conducting a conceptual literature review in order to identify a set of challenges. An EAM framework was then chosen and its ability to address the challenges was evaluated. The chosen framework is The Open Group Architecture Framework (TOGAF) 9.1 and the analysis conducted with regard to the support of network organizations highlights which issues it deals with. TOGAF serves as a good basis to solve the challenges of “Process and Data Integration” and “Infrastructure and Application Integration”. Other areas such as the “Organization of the Network Organization” need further support. Both the identification of challenges and the analysis of TOGAF assist academics and practitioners alike to identify further
research topics as well as to find documentation related to inter-organizational problems in EAM.

FTR: The permissions I needed to give away were surprisingly relaxed:

By checking the box below, I grant AMCIS 2013 Manuscript Submission on behalf of AMCIS 2013 the non-exclusive right to distribute my submission (“the Work”) over the Internet and make it part of the AIS Electronic Library (AISeL).
I warrant as follows:

    • that I have the full power and authority to make this agreement;
    • that the Work does not infringe any copyright, nor violate any proprietary rights, nor contain any libelous matter, nor invade the privacy of any person or third party;

that the Work has not been published elsewhere with the same content or in the same format; and

  • that no right in the Work has in any way been sold, mortgaged, or otherwise disposed of, and that the Work is free from all liens and claims.

 

I understand that once a peer-reviewed Work is deposited in the repository, it may not be removed.

On Academia…

A paper that I have authored has recently been published a while ago, but I’ve put this post off for a long time now. Before talking about the paper itself, I want to talk about Academia as I have the feeling that I need to defend myself for playing their game™. The following may sounds overly pessimistic and a while a few bright spots are going to be mentioned, many have been left out for ranting reasons. Keep that in mind when reading that somewhat unstructured rant…

Published papers are the currency in Academia. The more you have, the more respected you are. The quantity is the main metric. No wonder, given that quality control measures are not very well deployed. Pretty much the only mechanism to ensure quality is peer review. The holy grail.

Although the more papers at “better” conferences or journals you have, the better you are, the quality of the conference or journal and the quality of the paper are rarely questioned after the publication. Again, I don’t have proper proof for the statements I make as this is supposed to be a more general rant on current practises in Academia. I can only tell from experience. From me listening to people talking about fellow academics, from observing key metrics in various web portals, or seeing people applying for academic positions. Those people usually have an enumeration of their publications. Maybe it’s a “selection”. But I’ve never seen that people put a “ranking” of the quality of the publisher nor the publication itself. And it wouldn’t make sense, because we don’t have metrics for that, anyway. Sure, there are some people or companies trying to come up with something meaningful. But metrics such as “rejection rate”, “number of citations”, or “h-index” are inherently flawed. For many reasons. Mainly because the data is proprietary. You have to rely on the conference or the journal providing you with correct data. You cannot know whether it is correct as there is no right for you to know. Secondarily, the metric might suffer from chilling effects, such that people think the quality of their publication in spe is too weak to be able to be published on a “high ranked” conference. So they don’t even bother to submit. Other metrics like the average citation count after five years resembles much more a stochastic experiment rather than reflecting the quality of the publications (Ike Antkare anyone?). Again, you have the effect of people wanting to cite some paper of a “high ranked” conference, because that is what people will cite in the future. And in order to be found more easily in the future via backwards citation searches, you’d rather cite publications you think will be cited more often in the future (cf.).

Talking about quality…

You have to trust the peer review of the conference or journal but you actually cannot because you don’t even know who the peers were. It’s good to have an informed opinion and it’s a good thing to be able to rely on an informed judgement. But it’s not good having to rely on that. If, for whatever reason, a peer fails to provide appropriate reviews, one should be able to make a decision oneself. Some studies have indeed shown that the peer review process is no better than flipping a coin. So there seems to be some need to review the peer review.

Once again to be clear: I don’t mind peer review. I think it’s good. Blindly publishing without ensuring that there is indeed an advancement of world’s knowledge wouldn’t be good. And peer review could be a tool to control that. But it doesn’t do it right now. I don’t have any concrete proposal. But I think if the reviews themselves and the reviewers were known, then we could make better decisions as to whether to “trust” a publication or not.

Another proposal is to not have “journals” as physical hard copies anymore. It is 20142015, we have the Web, we have some cool technologies. But we don’t make use of any of that. Instead, we are maintaining the status from 20, or rather 200, years ago. We still subscribe to one-off bundles of printed and stapled paper. And we pay loads for that. And not only do we pay loads for receiving that, if you wanted to publish in one of those journals (or conferences), you have to pay, too. In fairness, it’s not only the printing and stapling that costs money, but the services around that. Things like proof reading (has anyone ever gotten a lectorate?), the peer review (has any peer ever gotten any reimbursement?), or the maintenance of an online database (why is it so damn hard to use any of these web databases?) are things we pay money for. I doubt that we need Journals in their current form. We probably do need entities (call them “publishers”), who in turn will need to earn some money, to make sure everything is going smoothly. But we don’t need print-and-forget style publishing. If we could add things like comments, annotations, links, reviews, supplementary material, a varying level of detail, to a paper, even after a few years or even decades, we could move to a “permanently peer reviewed” model. A publication is being reviewed all the time. Ideally by the general public. We could model our current workflow by delegating some form of trust to a group of people, say “reviewers of Journal X”, and only see what these people have vouched for. We could then selectively exclude people from that group of trustees, much like the web of trust. We could, if a paper makes an assumption which is falsified in the future, render some warning when opening the publication. We could decentralise the data such that everyone could build their own index, search mechanism, or interface.

On interfaces

Right now, if you wanted to, say, re-conduct the experiments done in published papers and share your results, you would have to create a publication (which is expected, but right now you would likely have to pay for that) and cite the papers whose results you are trying to reproduce. That’s okay. But if I then wanted to see when and how successful people tried to redo the experiments, I’d have to rely on the database I’m using to provide a reverse citation search and have the correct data (which, for some databases, seems to be the ability to do OCR on the PDF…). That’s not how things should work nowdays, right? We’d expect something more interactive, with tags, open data, something wikiesque. While the ability to reverse-search citations, to highlight some key references, or to link to a key contribution that followed a paper at hand would be nice indeed, we probably have to step back and make existing functionality somewhat usable. I’m not talking about advanced stuff like exporting search results in a standardised format or about deep linking to a result set from a query. That would need treatment after we’ve solved actually searching for multiple keywords, excluding some conferences or journals, or joining or intersecting queries. All that only works to some extent and it’s depressing that we cannot do anything about it, because we don’t have the relevant access or data. Don’t believe me? Well, you shouldn’t. But I’ll provide a table, probably in another post, showing what works with which database and what does not.

On experiments

As I was referring to reproducing results: It is pretty much impossible to reproduce any result, at least in my field, computer science. You don’t get the raw data, let alone the programs to run to get the results. You could argue that it is too complicated to maintain a program that can be run on any platform. Fair enough. I don’t have a solution. But the situation right now is not a good status quo. Right now you don’t get anything. So even if you had the very same setup as the authors of some publication, you would not be able to redo the experiments. It’s likely to be similar in other disciplines. I imagine that rocket scientists do experiments with self made devices or with some utterly expensive appliance (think LHC). Nobody will be able to reproduce the results, simply because there is just that one LHC out there… But… fortunately we have many digital things which are easy to archive and distribute. We, computer scientists, should make use of that. Why not require authors to submit a virtual appliance in some openly specified format? Obviously, source code would be nice, but even in academia there doesn’t seem to be a culture of sharing code freely, so I’m not even suggesting that.

Phew. After having criticised Academia and having made some half baked proposals I forgot what I actually wanted to do: Being a good academic (not caring about the public perception of “good” in terms of quantity of publications), and discuss a few things around the paper that we paid a couple of hundred dollars for to get published. But I leave that for another rant post.

In what ways do you think is Academia broken?

mrmcd14 in Darmstadt – DOM-based XSS

After last year’s fabulous event, I was really looking forward to this year’s mrmcd in Darmstadt, Germany. It outgrew last year’s edition and had probably around 250 to 300 people attending. Maybe even more. In fact, 450 clients generated 423 GB traffic during the conference which lasted 60 hours or so. That’s around 2MB/s. That’s megabytes. Per second. Every second. I find that quite impressive. Especially as the outdoor area was very inviting to just hang around, grab a beer, and chat to your fellow hackers. So some people must have had an amazing demand of … updates…

This year’s theme was construction sites. As IT, and especially security, is a major, never ending, and dangerous construction site. It was well done, with a lot of warning tape, the people wearing helmets, hi-vis vests, some security boots, etc. Although it couldn’t excel last year’s aviation theme, but the watermark was set extremely high. Anyway, the speakers received cool gadgets, like a tool set, a level, and other very well done gadgets. The talks were opened by Unicorn who, as you can see, was wearing proper safety gear. We were given instructions as to how to behave in case of fire, flood, or lack of alcohol. A nifty feature of this event is the availability of carbo hydrates in form of various food stuffs. It’s very cool to always being able to walk up to the buffet and fill up energy reserves.

The keynote was involuntarily given by dodger who did not miss the opportunity to show us various constructions sites, such as the Utah Data Center. Ultimately, (now I am maybe over interpreting things), it’s also hackers like us who make those possible. We usually decide for ourselves where to go and what to do. It was a good round-up on how we as a community work or should work. Also with some political references which I think is important as I have the feeling that many people lose that focus too easily.

An interesting series of talks was given by Ange Albertini, who first presented the PDF file format. It was interesting to see how the format actually looks like. I knew already a little but I’ve never really cared about the details. This was a very interesting and visually appealing talk. Pretty much like his other presentations which were again on file formats and on crypto.

My own talk was scheduled after the second night. I was positively surprised to see a half-filled room on a Sunday morning, after two nights of demanding partying… Anyway, I had an interested crowd which I think I could entertain. You can find my slides here. I was talking on DOM-based Cross-site Scripting. I presented a modified Chrome browser which is able to stop all identified DOM-based XSSs. I will need a separate post to cover the details. As a brief summary: Both WebKit and V8 were modified to track taint, that is, to annotate strings with the information of the source. Such a source could be the document.URL or the window.name. This taint information is evaluated whenever it is about to be compiled to code. The simple approach of blocking every tainted string to compile is not followed as it breaks the Web. Instead, the compiler will notice which token is about to be generated and only allow generation if and only if the string is untainted or of a data type (String, Boolean, Number). If the tainted token is, for example, function call, assignment or pretty much anything else, then it is replaced with an illegal token in order to abort compilation. There is a video of the talk here:

As we are on videos, the video team is just plainly amazing. It released videos of the event pretty much after they finished. And in a quality that is hard to excel. You check the videos of this conference, but also others. You may find some gems that are well worth watching. Be aware though, some talks are also very much on the vapor-ware side of things… I guess I don’t need to point to specific talks as it should be easy to identify…

I am already looking forward to next year’s event. The watermark has, again, been set high and I expect the next year to be able to raise that bar. But I hope it will be able to stay small enough to not lose the cosy and comfy feeling. Maybe I shouldn’t blog about that fantastic event to not generate too much attention ;-)

Benchmarking raw user mode disk access for VirtualBox

I am working with a virtual GNU/Linux system, because the machine I’m working with must run a Windows on its bare metal.

I thought I wanted to give raw disk access to the guest, but it turns out, that it is not very easy in Windows to give permanent permissions to a regular user. You can give permissions using the semi-official subinacl like so:

C:\WINDOWS\system32>"C:\Program Files (x86)\Windows Resource Kits\Tools\subinacl.exe" /noverbose /file \\.\physicaldrive1 /display=sddl

+File \\.\physicaldrive1
/sddl=O:BAG:SYD:(A;;FX;;;WD)(A;;FA;;;SY)(A;;FA;;;BA)(A;;FX;;;RC)

Elapsed Time: 00 00:00:00
Done: 1, Modified 0, Failed 0, Syntax errors 0
Last Done : \\.\physicaldrive1

C:\WINDOWS\system32> "C:\Program Files (x86)\Windows Resource Kits\Tools\subinacl.exe" /noverbose /file \\.\physicaldrive1 "/sddl=O:BUG:SYD:(A;;FA;;;WD)(A;;FA;;;SY)(A;;FA;;;BA)(A;;FX;;;RC)"

Elapsed Time: 00 00:00:00
Done: 1, Modified 1, Failed 0, Syntax errors 0
Last Done : \\.\physicaldrive1

C:\WINDOWS\system32> "C:\Program Files (x86)\Windows Resource Kits\Tools\subinacl.exe" /file \\.\physicaldrive1 /display=sddl /display=owner /display=primarygroup

+File \\.\physicaldrive1
/sddl=O:BUG:SYD:(A;;FA;;;WD)(A;;FA;;;SY)(A;;FA;;;BA)(A;;FX;;;RC)

+File \\.\physicaldrive1
/owner =builtin\users

+File \\.\physicaldrive1
/primary group =system

Elapsed Time: 00 00:00:00
Done: 1, Modified 0, Failed 0, Syntax errors 0
Last Done : \\.\physicaldrive1

I needed to figure out that the language the ACLs are written in, is SDDL and how to give my current user or group all permissions. I failed doing that so I opted for giving all access to every entity known to the system… You can see the relevant SDDL in the listing above.

But that change will only survive until the next reboot. To make the software permanent, an unofficial tool called dskacl can theoretically be used. Apparently it tries to write special values to the Windows Registry. Although I found my way through the documentation, I couldn’t make it work. It actually failed so hard on me that even Windows could not see the drive itself. So make a good contingency plan before even trying to make it work. It’s not really meant for attached disks, but rather external disks via USB. Anyway, I thought I’d have to redo the above mentioned step on every boot.

The question I had was, whether it’s actually worth it. I assumed that it would be a speed up to write directly to the harddrive without having to go through Windows and VirtualBox’ VDI layer before hitting the disk.

So I measured my typical workload: compiling Chromium. As it is what I’m working on, I want compilations to happen as quickly as possible.

My technique was the following:


export CCACHE_DISABLE=1
rm -rf out
./build/gyp_chromium
time ninja -C out/Debug chrome

I did a handul of runs to compensate for some irregularities that might happen on the host (or in fact, on the guest…)

When writing on an Ext4 straight onto the disk, I get the following results:

real 61m59.895s
user 322m52.832s
sys 46m49.268s

real 61m25.565s
user 318m40.680s
sys 46m7.608s

real 58m59.939s
user 320m36.500s
sys 46m28.336s

Having an Ext4 filesystem in a VDI container on an NTFS partition yields the following results:

real 60m50.360s
user 322m18.184s
sys 47m3.588s

real 57m30.324s
user 318m48.752s
sys 46m52.016s

real 63m29.179s
user 328m55.004s
sys 48m4.692s

I couldn’t test shared folders, because either the NTFS or in fact the vboxfs don’t support operations needed for the compilation. I guess it’s VirtualBoxes fault though…

My interpretation of the results:

Writing directly to the disk seems to be marginally slower than going through the VDI container. At best, there is no significant deviation from writing to the container. I decided that it’s not worth to write straight to disk. Going through the VDI container and through Windows is fine. Especially with all the risks involved such as Windows not being able to see the drive at all.

I acknowledge that my data is a bit flawed. It is likely that you cannot generalise my findings for any other workload. The method is also questionable as I didn’t flush caches or took care of anything disturbing my measurements. If you do measure yourself, I’m interested in getting the results.