simpler-dup-finder

I’ve been experimenting with a method to make it easier to check if a bug has already been filed. Not perfect by any means, but appears to be good enough to be useful. Definitely seems a lot better than showing users the most frequently duplicated bugs (which are likely to be totally unrelated) and just asking if their bug is a dupe of one of those. That’s what we currently do with bug-buddy, though to be fair, that was a big improvement over trying to use the bug-buddy keyword for the same purpose.

Biggest problem currently is that it can list the same bug multiple times (once per comment that appears to be related). My SQL skills really suck, so I haven’t been able to find a reasonable fix for it. Using ‘distinct’ on the bug_id appears to have no effect unless I also group by bug_id — but grouping by bug_id slows the speed of the query down dramatically. So for now, the same bug can be listed several times.

It can also be used in a format similar to the current simple-dup-finder, and works on both bugs with a stacktrace and those without. A few more details are here.

In other news, Vincent stopped letting me shove the work of coming up with an announcement email off on him. That’s unfortunate, as it means everyone has to suffer with a much more lame announcement (vuntz, J5, kmaraas, jdub, and federico all send out far more creative announcements than I do). I kind of liked the format we had going — take (possibly uneven) turns making releases and if it was my turn then get Vincent to do the announcement part. But, Vincent left early yesterday, so I did the only thing I could do — I copied what I could from Vincent’s last stable release announcement and used it.

One Response to “simpler-dup-finder”

  1. Jamie McCracken says:

    I’ve spent quite a bit of time optimizing fulltext queries while developing Tracker (http://freedesktop.org/wiki/Software/Tracker)

    The slowness is caused by having a too complicated query with fulltext in it so its a good idea to split things up if thats the case (especially with group bys).

    Mysql has a nifty temp table support which should make it easier and more efficient to get a unique column and separate out the fulltext results.

    Try the following statements (if the distinct dont work you can also try a group by as the temp table will only be 50 rows max it will still be fast):

    DROP TEMPORARY TABLE IF EXISTS DUPE_SEARCH;

    CREATE TEMPORARY TABLE DUPE_SEARCH
    (
    bug_id long,
    relevance double
    );

    insert into DUPE_SEARCH
    select distinct bugs.bug_id, $fulltext_search
    from blah where blah limit 50;

    select distinct bugs.bug_id,
    substr(thetext, 1, 5000) AS comment,
    bug_status AS status,
    resolution,
    products.name AS product,
    substr(short_desc, 1, 60) AS summary
    From Blah, DUPE_SEARCH D… where D.bug_id = Blah.bug_id
    Order by D.relevance;

    Drop Temporary table if exists DUPE_SERACH;