Posts Tagged ‘paper’

On Academia…

Saturday, January 31st, 2015

A paper that I have authored has recently been published a while ago, but I’ve put this post off for a long time now. Before talking about the paper itself, I want to talk about Academia as I have the feeling that I need to defend myself for playing their game™. The following may sounds overly pessimistic and a while a few bright spots are going to be mentioned, many have been left out for ranting reasons. Keep that in mind when reading that somewhat unstructured rant…

Published papers are the currency in Academia. The more you have, the more respected you are. The quantity is the main metric. No wonder, given that quality control measures are not very well deployed. Pretty much the only mechanism to ensure quality is peer review. The holy grail.

Although the more papers at “better” conferences or journals you have, the better you are, the quality of the conference or journal and the quality of the paper are rarely questioned after the publication. Again, I don’t have proper proof for the statements I make as this is supposed to be a more general rant on current practises in Academia. I can only tell from experience. From me listening to people talking about fellow academics, from observing key metrics in various web portals, or seeing people applying for academic positions. Those people usually have an enumeration of their publications. Maybe it’s a “selection”. But I’ve never seen that people put a “ranking” of the quality of the publisher nor the publication itself. And it wouldn’t make sense, because we don’t have metrics for that, anyway. Sure, there are some people or companies trying to come up with something meaningful. But metrics such as “rejection rate”, “number of citations”, or “h-index” are inherently flawed. For many reasons. Mainly because the data is proprietary. You have to rely on the conference or the journal providing you with correct data. You cannot know whether it is correct as there is no right for you to know. Secondarily, the metric might suffer from chilling effects, such that people think the quality of their publication in spe is too weak to be able to be published on a “high ranked” conference. So they don’t even bother to submit. Other metrics like the average citation count after five years resembles much more a stochastic experiment rather than reflecting the quality of the publications (Ike Antkare anyone?). Again, you have the effect of people wanting to cite some paper of a “high ranked” conference, because that is what people will cite in the future. And in order to be found more easily in the future via backwards citation searches, you’d rather cite publications you think will be cited more often in the future (cf.).

Talking about quality…

You have to trust the peer review of the conference or journal but you actually cannot because you don’t even know who the peers were. It’s good to have an informed opinion and it’s a good thing to be able to rely on an informed judgement. But it’s not good having to rely on that. If, for whatever reason, a peer fails to provide appropriate reviews, one should be able to make a decision oneself. Some studies have indeed shown that the peer review process is no better than flipping a coin. So there seems to be some need to review the peer review.

Once again to be clear: I don’t mind peer review. I think it’s good. Blindly publishing without ensuring that there is indeed an advancement of world’s knowledge wouldn’t be good. And peer review could be a tool to control that. But it doesn’t do it right now. I don’t have any concrete proposal. But I think if the reviews themselves and the reviewers were known, then we could make better decisions as to whether to “trust” a publication or not.

Another proposal is to not have “journals” as physical hard copies anymore. It is 20142015, we have the Web, we have some cool technologies. But we don’t make use of any of that. Instead, we are maintaining the status from 20, or rather 200, years ago. We still subscribe to one-off bundles of printed and stapled paper. And we pay loads for that. And not only do we pay loads for receiving that, if you wanted to publish in one of those journals (or conferences), you have to pay, too. In fairness, it’s not only the printing and stapling that costs money, but the services around that. Things like proof reading (has anyone ever gotten a lectorate?), the peer review (has any peer ever gotten any reimbursement?), or the maintenance of an online database (why is it so damn hard to use any of these web databases?) are things we pay money for. I doubt that we need Journals in their current form. We probably do need entities (call them “publishers”), who in turn will need to earn some money, to make sure everything is going smoothly. But we don’t need print-and-forget style publishing. If we could add things like comments, annotations, links, reviews, supplementary material, a varying level of detail, to a paper, even after a few years or even decades, we could move to a “permanently peer reviewed” model. A publication is being reviewed all the time. Ideally by the general public. We could model our current workflow by delegating some form of trust to a group of people, say “reviewers of Journal X”, and only see what these people have vouched for. We could then selectively exclude people from that group of trustees, much like the web of trust. We could, if a paper makes an assumption which is falsified in the future, render some warning when opening the publication. We could decentralise the data such that everyone could build their own index, search mechanism, or interface.

On interfaces

Right now, if you wanted to, say, re-conduct the experiments done in published papers and share your results, you would have to create a publication (which is expected, but right now you would likely have to pay for that) and cite the papers whose results you are trying to reproduce. That’s okay. But if I then wanted to see when and how successful people tried to redo the experiments, I’d have to rely on the database I’m using to provide a reverse citation search and have the correct data (which, for some databases, seems to be the ability to do OCR on the PDF…). That’s not how things should work nowdays, right? We’d expect something more interactive, with tags, open data, something wikiesque. While the ability to reverse-search citations, to highlight some key references, or to link to a key contribution that followed a paper at hand would be nice indeed, we probably have to step back and make existing functionality somewhat usable. I’m not talking about advanced stuff like exporting search results in a standardised format or about deep linking to a result set from a query. That would need treatment after we’ve solved actually searching for multiple keywords, excluding some conferences or journals, or joining or intersecting queries. All that only works to some extent and it’s depressing that we cannot do anything about it, because we don’t have the relevant access or data. Don’t believe me? Well, you shouldn’t. But I’ll provide a table, probably in another post, showing what works with which database and what does not.

On experiments

As I was referring to reproducing results: It is pretty much impossible to reproduce any result, at least in my field, computer science. You don’t get the raw data, let alone the programs to run to get the results. You could argue that it is too complicated to maintain a program that can be run on any platform. Fair enough. I don’t have a solution. But the situation right now is not a good status quo. Right now you don’t get anything. So even if you had the very same setup as the authors of some publication, you would not be able to redo the experiments. It’s likely to be similar in other disciplines. I imagine that rocket scientists do experiments with self made devices or with some utterly expensive appliance (think LHC). Nobody will be able to reproduce the results, simply because there is just that one LHC out there… But… fortunately we have many digital things which are easy to archive and distribute. We, computer scientists, should make use of that. Why not require authors to submit a virtual appliance in some openly specified format? Obviously, source code would be nice, but even in academia there doesn’t seem to be a culture of sharing code freely, so I’m not even suggesting that.

Phew. After having criticised Academia and having made some half baked proposals I forgot what I actually wanted to do: Being a good academic (not caring about the public perception of “good” in terms of quantity of publications), and discuss a few things around the paper that we paid a couple of hundred dollars for to get published. But I leave that for another rant post.

In what ways do you think is Academia broken?

Critical Review of Tesseract

Tuesday, May 4th, 2010

For CA640 we were supposed to pick a paper from International Conference of Software Engineering 2009 (ICSE 2009) and critically review it.

I chose to review Tesseract: Interactive Visual Exploration of Socio-Technical Relationships in Software Development.

You can find the review in PDF here. Its abstract reads:

This critical review of a paper, which presents Tesseract and was handed in for the ICSE 2009, focusses on
strength and weaknesses of the idea behind Tesseract: Visualising and exploring freely available and loosly coupled fragments (mailing lists, bug tracker or commits) of Free Software development.
Tesseract is thus a powerful data miner as well as a GUI to browse the obtained data.

This critique evaluates the usefulness of Tesseract by questioning the fundamental motivation it was built on, the data which it analyses and its general applicability.

Existing gaps in the original research are filled by conducting interviews with relevant developers as well as providing information about the internal structure of a Free Software project.

Tesseract is a program that builds and visualises a social network based on freely available data from a software project such as mailing lists, bug tracker or commits to a software repository. This network can be interactively explored with the Tesseract tool. This tool shows how communication among developers relates to changes in the actual code. The authors used a project under the GNOME umbrella named Rhythmbox to show their data mining and the program in operation. GNOME is a Free/Libre Software Desktop used as default by many Linux distributions including the most popular ones, i.e. Ubuntu and Fedora. To assess Tesseracts usability and usefulness, the authors interviewed people not related to Rhythmbox asking whether Tesseract was usable and provided useful information.

The paper was particularly interesting for me because the authors analysed data from the GNOME project. As I am a member of that development community, I wanted to see how their approach can or cannot increase the quality of the project. Another focus was to help their attempt to improve GNOME by highlighting where they may have gaps in their knowledge of its internals.

During this critique, I will show that some assumptions were made that do not hold for Free/Libre and Open Source Software (FLOSS) in general and for GNOME in particular either because the authors simply did not have the internal knowledge or did not research carefully enough. Also I will show that the used data is not necessarily meaningful and I will attempt to complement the lacking data by presenting the results of interviews I conducted with actual GNOME developers. This will show how to further improve Tesseract by identifying new usage scenarios. Lastly, this text will question the general usefulness of Tesseract for the majority of Free Software projects.

MSN Shutdown in 2003

Monday, March 8th, 2010

During CA640 I was made to write an ethical review which I was supposed to hand in using a dodgy webservice. Since it got 90% people mugged me to make it available ;-) Of course, I don’t have a problem with that, so people now have a reference or know what to expect when they enter the course.

You can find the PDF here and its abstract reads:

At the end 2003 Microsoft closed the public chat-rooms of its Internet service called MSN.
MSN was pushed by Children’s Charities because they feared an abuse of these chat-rooms.
In some countries, however, the service was still available but subject to a charge.
This review raises ethical questions about Microsoft’s and the Children’s Charities’ behaviour because making the people pay with the excuse of protecting children is considered ethically questionable.
Also the Children’s Charities pushed for closure of a heavily used service although there is absolutely no evidence that children would be safer after closing down a chat-room.

If you are not interested in the non-technical details you might be interested to know that I use a Mercurial Hook on the server side to automatically compile the LaTeX sources one I push changes to the server:

$ cat .hg/hgrc
[hooks]
changegroup.compile = export FILE=paper && hg up -C && pdflatex --interaction=batchmode $FILE && bibtex $FILE && pdflatex --interaction=batchmode $FILE && pdflatex --interaction=batchmode $FILE

And then I just symlink that resulting PDF file to my public_html directory.

Digital Divide

Sunday, February 14th, 2010

Als Student kommt es hin und wieder vor, dass ich eine Hausarbeit schreiben muss. Da ich fest davon ueberzeugt, dass Uni, Wissenschaft und Wissen so frei wie moeglich sein sollten, und jedermensch auch noch durch Zahlung von Steuern potentiell das Studieren finanziert, denke ich, hat jedermensch das Recht mindestens zu sehen was ich so eigentlich den lieben langen Tag so mache.

Internet sei dank ist es heutzutage eher einfach, Dinge zu publizieren und Wissen fortzutragen. Deswegen gibt es hier nun eine Hausarbeit, die ich im letzten Semester in Gender Studies geschrieben habe.

Alien Toilet Sign

Alien Toilet Sign

Das Paper traegt den Namen “Weiblicher Zugang zu Technik und feministische Politiken” und das Abstract liesst sich wiefolgt:

Die Gründe, die zum Digital Divide, der digitalen Kluft, führen, sind vielfältig und Geschlecht ist einer davon.
Auch weibliche Gruppierungen haben das Ziel, den Anteil weiblicher Teilnehmer im Digitalen zu erhöhen.
Die Arbeit analysiert, wie dieses Ziel erreicht werden soll, warum das nicht gelingt und wie es eventuell doch erreicht werden kann.

Das PDF gibt es hier und ist als “Namensnennung-Keine kommerzielle Nutzung-Weitergabe unter gleichen Bedingungen 3.0 Deutschland” fuer jedermensch lizensiert. Das heisst aber nicht, dass ich es auf Anfrage anders lizensieren kann.

Die Arbeit liesst sich an einigen Stellen etwas ruckelig, was der Entstehungsgeschichte geschuldet ist. Im Prinzip sind aus 2.5 Arbeiten eine geworden. Ich hoffe, es ist dennoch nicht so schlimm.

Sollte das PDF inhaltlich nicht so spannend sein, lohnt es sich doch auf die technischen Details zu achten. So weiss das PDF, wie dessen Inhalt lizensiert ist. Dazu benutzt es XMP streams, die in das PDF eingebetted wurden. Die sind mit dem Paket hyperxmp ueber LaTeX in das PDF gekommen. Offiziell wird noch xmpincl empfohlen, aber das ist wirklich fies zu benutzen, weil mensch den XMP stream selbst erstellen muss.

\usepackage{hyperxmp}         % To be have an XMP Data Stream f.e. to include the license
[...]
\hypersetup{
        pdftitle={Weiblicher Zugang zu Technik und feministische Politiken},
        pdfauthor={Tobias Mueller},
        [...]
        pdfcopyright={This work is licensed to the public under the Creative Commons Attribution-Non-Commercial-Share Alike 3.0 Germany License.},
        pdflicenseurl={http://creativecommons.org/licenses/by-nc-sa/3.0/de/}
}

Mein Evince 2.29.1 (mit JHBuild gebaut) zeigt die Lizenzinformation froehlich an, Okular 0.9.2 nicht. Wie es sonst moeglich ist, in PDF eingebettete XMP Daten anzusehen, weiss ich nicht. Es waere fuer eine automatisierte Verarbeitung sicherlich interessant.

Vielen Dank and Chillum und Sourci, die mir beratend und patchend zur Seite standen und denen der Text wahrscheinlich zu den Ohren wieder herauskommt ;-)

Fuer eine inhaltliche Diskussion ist die Kommentarfunktion wohl eher schlecht geeignet aber in Ermangelung an Alternativen steht sie dazu zur Verfuegung. Ich mag die Loesung, die das Djangobook benutzt. Am Rand von jedem Absatz gibt es eine Kommentarfunktion die sehr gut funktioniert.

Adding Linux Syscall

Thursday, January 7th, 2010

In a course (CA644) we were asked to add a new syscall to the Linux kernel.Linux Oxi Power!

As I believe that knowledge should be as free and as accessible as possible, I thought I have to at least publish our results. Another (though minor) reason is that the society -to some extend- pays for me doing science so I believe that the society deserves to at least see the results.

The need to actually publish that is not very big since a lot of information on how to do that exists already. However, that is mostly outdated. A good article is from macboy but it misses to mention a minor fact: The syscall() function is variadic so that it takes as many arguments as you give it.

So the abstract of the paper, that I’ve written together with Nosmo, reads:

This paper shows how to build a recent Linux kernel from scratch, how to add a new system call to it and how to implement new functionality easily.
The chosen functionality is to retrieve the stack protecting canary so that mitigation of buffer overflow attacks can be circumvented.

And you can download the PDF here.

If it’s not interesting for you content wise, it might be interesting from a technical point of view: The PDF has files attached, so that you don’t need to do the boring stuff yourself but rather save working files and modify them. That is achieved using the embedfile package.

\usepackage{embedfile}        % Provides \embedfile[filename=foo, desc={bar}]{file}
[...]
\embedfile[filespec=writetest.c, mimetype=text/x-c,desc={Program which uses the new systemcall}]{../code/userland/writetest.c}%

If your PDF client doesn’t allow you save the files (Evince does :) ), you might want to use pdftk $PDF unpack_files in some empty directory.