invalid html warfare
David Relson
relson at osagesoftware.com
Wed May 28 15:17:04 CEST 2003
At 09:09 AM 5/28/03, Peter Bishop wrote:
>On 28 May 2003 at 7:18, David Relson wrote:
>
> > >This detected 14516 "singleton" tokens out of a total of 72261
> >
> > Greg's tested what happens when singletons are discarded to shrink the
> > wordlist size. Bogofilter's accuracy went _way_ down.
> >
I neglected to mention that, since he uses train on error, his counts are
lower than mine would be (since I use '-u' (autoupdate) and manually
correct any mistakes).
>That of course is possible - training may add a singleton, but it might be
>used many times during "production" runs. However if it is NOT used in
>production runs, then removing the singletons will make no difference to
>the accuracy.
>
>What I am suggesting is to only remove the "junk" singletons that are never
>likely to appear in again another message. There would still be singletons
>left (hopefully the "useful" ones) - so it is not quite the same as Greg's
>tests where all singletons are removed.
I had missed that detail. Having a "junkiness" parameter for maintenance,
which can be done off-line, could be valuable. All that's missing for
testing is the "junkiness" algorithm.
It might be interesting to take the 5-non-vowel code, create pruned
wordlists, and compare scoring rates of pruned vs. unpruned. Any volunteers?
>The other thing that has not been tested is whether a "junkiness" heuristic
>for unknown tokens would improve discrimination, e.g. treat each such token
>as if it were a singleton in the spamlist.
More information about the Bogofilter
mailing list