invalid html warfare

Peter Bishop pgb at adelard.com
Wed May 28 15:09:15 CEST 2003


On 28 May 2003 at 7:18, David Relson wrote:

> >This detected 14516 "singleton" tokens out of a total of 72261
> 
> Greg's tested what happens when singletons are discarded to shrink the 
> wordlist size.  Bogofilter's accuracy went _way_ down.
> 

That of course is possible - training may add a singleton, but it might be 
used many times during "production" runs. However if it is NOT used in 
production runs, then removing the singletons will make no difference to 
the accuracy. 

What I am suggesting is to only remove the "junk" singletons that are never 
likely to appear in again another message. There would still be singletons 
left (hopefully the "useful" ones) - so it is not quite the same as Greg's 
tests where all singletons are removed.

The other thing that has not been tested is whether a "junkiness" heuristic 
for unknown tokens would improve discrimination, e.g. treat each such token 
as if it were a singleton in the spamlist.
-- 
Peter Bishop 
pgb at adelard.com
pgb at csr.city.ac.uk






More information about the Bogofilter mailing list