invalid html warfare

Wed May 28 15:17:04 CEST 2003

At 09:09 AM 5/28/03, Peter Bishop wrote:
>On 28 May 2003 at 7:18, David Relson wrote:
>
> > >This detected 14516 "singleton" tokens out of a total of 72261
> >
> > Greg's tested what happens when singletons are discarded to shrink the
> > wordlist size.  Bogofilter's accuracy went _way_ down.
> >

I neglected to mention that, since he uses train on error, his counts are 
lower than mine would be (since I use '-u' (autoupdate) and manually 
correct any mistakes).

>That of course is possible - training may add a singleton, but it might be
>used many times during "production" runs. However if it is NOT used in
>production runs, then removing the singletons will make no difference to
>the accuracy.
>
>What I am suggesting is to only remove the "junk" singletons that are never
>likely to appear in again another message. There would still be singletons
>left (hopefully the "useful" ones) - so it is not quite the same as Greg's
>tests where all singletons are removed.

I had missed that detail.  Having a "junkiness" parameter for maintenance, 
which can be done off-line, could be valuable.  All that's missing for 
testing is the "junkiness" algorithm.

It might be interesting to take the 5-non-vowel code, create pruned 
wordlists, and compare scoring rates of pruned vs. unpruned.   Any volunteers?

>The other thing that has not been tested is whether a "junkiness" heuristic
>for unknown tokens would improve discrimination, e.g. treat each such token
>as if it were a singleton in the spamlist.