gaming the system

I’m sure it isn’t just me that has noticed Google isn’t really as useful as it used to be any more. First there were the empty ‘wrapper’ sites that got onto the adwords box – you know, the ones that seemed to have ‘all about foo’ for every ‘foo’ search, but when you clicked on them just had the output of a search engine in them. Adwords are easy to ignore but sometimes you do actually want to find companies selling stuff. They were occasionally in the main result area too. They seem to come up a little less often now – or maybe i’m just searching for different things.

Then we have cloaking. e.g. the web site serves different content to a search engine than it does to users. So when you do a search you get a nice summary of what looks like what you want, but you click on it and all you get is a payment gateway. It is particularly prominent when looking for technical articles. See Summary of Academic Publishers Cloaking Discussion for some more information on this. It sucks big time.

Just as an example, lets try something simple, oh I dunno ‘efficient algorithm for sorting numbers external’ – a typical type of search for a software engineer.

8 links down we have (i’m not putting the link in html on purpose).

  A method for improving the efficiency of external sorting ..,.
    more efficient external sorting algorithms,based on a variety of
    distribution ... number of nodes), and an identical number of
    branches go from each node, ...
    www.springerlink.com/index/V3L0179J1801278L.pdf -

Ok, this isn’t really that useful looking, but this is just an example, and lets just take it as being what you’re after. A pdf and everything, lets go look … oh no, its just a payment gateway. $US32 for a paper … Hmm, that seems a little steep. Particularly if you look at the publishing date (go on, have a look, it might surprise you). I wonder how much of that the author gets, if he’s still alive.

Sometimes google scholar helps (but not in this particular case), given the title and author(s) you can often find free or draft versions of papers, but this is still a pain in the arse – why are these sites showing up at all in the main index when they are cloaking their information and intentionally gaming the system? I’m finding that searching for good quality coding and technical information is getting harder and harder, and google being complicit in this cloaking (see the linked article above, or search for ‘springerlink sucks’) just makes me angry at them (and frankly, who cares about the other search engines – they’re irrelevant).

And finally – take those away and searching for many types of information is just a lot harder than it used to be. I guess ‘the web’ has grown, and it’s mostly grown full of rubbish. I had yet another problem with Ubuntu yesterday – now I find 8.04 has major issues with USB mass storage devices on my laptop. Devices will drop out causing corruption, or refuse to work at all, both being totally unusable at best. It took a lot of searching for the right terms to uses to find something about the problem – and that was a lonely post on a forum. I guess we’re just unlucky with this together. Certain very popular terms like ubuntu, debian, fedora, linux are now so common it’s raising the signal to noise ratio significantly for any searches containing those terms. And so many sites cross-link with others too much that using linkage to weight results is becoming less useful (not that it was always super-great – I remember how advogato used to figure on the front page of just about any search for people who had an account on it).

I’m not sure about google news either. Today there were at least 4 stories on the iphone on the Australian front page – 3 in tech (i.e. all of them) and 1 in business. In the tech section by itself – the top 4 stories, with roadrunner (the fastest supercomputer in the world) pushed down to 5 or 6 (personally I think that is more tech-worthy, iphone belongs on the fashion or business pages if you ask me). Ok, the iphone is full of buzz, but one grouped story should surely suffice (google’s news selection is a bit strange sometimes, but normally it is at least a little better at grouping the same press release).