singletons
David Relson
relson at osagesoftware.com
Sat Dec 27 13:50:34 CET 2003
On Fri, 26 Dec 2003 20:57:00 -0800
Jef Poskanzer <jef at acme.com> wrote:
> I mentioned last week that I'd like to see bogofilter use the count
> of singletons in a message as a factor in determining bogosity.
> I didn't have any concrete suggestions on how to do this, though.
> Well, I just thought of one: every time a previously unseen token
> gets scanned, also generate an artificial token called "Singleton"
> or something like that. Then just let the regular Bayesian statistics
> operate.
>
> This might be something to try in the post 1.0 era, along with
> two-token sequences.
> ---
> Jef
Jeff,
Good ideas :-)
I've got a "hint" mechanism that's not yet released. Counts are kept of
"interesting" occurrences (of which "singleton" could be one). At the
end of parsing a message, the counts are converted to tokens like
"hint:singleton:x", where x is in a sequence like
1,2,5,10,20,50,100,200,500, etc. If a message had 7 tokens, then the
"...:1", "...:2", and "...:5" tokens would be generated. At final
scoring the hints are treated just like other tokens. More work is
needed to see what's useful.
Token-pairs are pretty easy. A flag in the get_token() routine and
remembering the previous token will allow the routine to alternate
between returning single tokens and tokens pairs.
David
More information about the Bogofilter
mailing list