invalid html warfare
Peter Bishop
pgb at adelard.com
Wed May 28 15:09:15 CEST 2003
On 28 May 2003 at 7:18, David Relson wrote:
> >This detected 14516 "singleton" tokens out of a total of 72261
>
> Greg's tested what happens when singletons are discarded to shrink the
> wordlist size. Bogofilter's accuracy went _way_ down.
>
That of course is possible - training may add a singleton, but it might be
used many times during "production" runs. However if it is NOT used in
production runs, then removing the singletons will make no difference to
the accuracy.
What I am suggesting is to only remove the "junk" singletons that are never
likely to appear in again another message. There would still be singletons
left (hopefully the "useful" ones) - so it is not quite the same as Greg's
tests where all singletons are removed.
The other thing that has not been tested is whether a "junkiness" heuristic
for unknown tokens would improve discrimination, e.g. treat each such token
as if it were a singleton in the spamlist.
--
Peter Bishop
pgb at adelard.com
pgb at csr.city.ac.uk
More information about the Bogofilter
mailing list