Extrema Sorting Bug

Doug Beardsley dgbeards at southern.edu
Tue Sep 17 09:56:02 CEST 2002


Hello,

I believe there is a bug in the extrema sorting code that hasn't been
caught yet.  The basic idea of the bug is this:  When a new token is
found to be an extrema, it is put into the extrema array and the
previous entry at that point in the array is lost.  I believe this is
faulty because it is possible that the previous entry is in reality the
second largest (now that the larger one has been added), and it should
not be deleted from this list.  Allusions to this behavior were made in
one of the earlier threads.  I have written a new sorting algorithm that
I believe is correct.  It is a little slower than before, but still
maintains a reasonable efficiency and is correct.  It requires that the
list be of size KEEPERS+1.  It then puts the new entry at the end of the
list and a single bubble sort pass puts it in the correct position.
Here is a diff for a new algorithm.

1c1
<
/*****************************************************************************
---
> *****************************************************************************
401c401
<     discrim_t extrema[KEEPERS];
---
>     discrim_t extrema[KEEPERS+1];
410c410
<     int tok;
---
>     int tok, i;
417c417
<     discrim_t *pp, *hit;
---
>     discrim_t *pp, *hit, swaptemp;
472c472,490
<       for (pp = stats.extrema; pp <
stats.extrema+sizeof(stats.extrema)/sizeof(*stats.extrema); pp++)
---
>       flag = 0;
>       //Check for duplicates
>       for (i = 0; i < KEEPERS; i++)
>       {
>               if (stats.extrema[i].key && strcmp(stats.extrema[i].key,
>               yytext)==0)
>               {
>                       flag = 1;
>                       break;
>               }
>       }
>       if ( flag )
>               continue;
>
>       //Put the current entry at the end of the list to avoid an extra
>       comparison each iteration
>       stats.extrema[KEEPERS].prob = prob;
>       strncpy(stats.extrema[KEEPERS].key, yytext, MAXWORDLEN);
>
>       //Run one pass of bubble sort
>       for (i = KEEPERS; i > 0; i-- )
474,483c492
<           // don't allow duplicate tokens in the stats.extrema
<           if (pp->key && strcmp(pp->key, yytext)==0)
<             {
<                 hit=NULL;
<               break;
<             }
<           else
<           {
<                 slotdev=DEVIATION(pp->prob);
<                 if (dev>slotdev && dev>hitdev)
---
>                 if (DEVIATION(stats.extrema[i].prob) >
>                 DEVIATION(stats.extrema[i-1].prob))
485,486c494,496
<                     hit=pp;
<                   hitdev=slotdev;
---
>                       swaptemp=stats.extrema[i];
>                       stats.extrema[i]=stats.extrema[i-1];
>                       stats.extrema[i-1]=temp;
490,494d499
<         if (hit)
<       {
<           hit->prob = prob;
<           strncpy(hit->key, yytext, MAXWORDLEN);
<       }

I would be interested to hear comments on this.

Doug Beardsley



More information about the bogofilter-dev mailing list