singletons

Sat Dec 27 13:50:34 CET 2003

On Fri, 26 Dec 2003 20:57:00 -0800
Jef Poskanzer <jef at acme.com> wrote:

> I mentioned last week that I'd like to see bogofilter use the count
> of singletons in a message as a factor in determining bogosity.
> I didn't have any concrete suggestions on how to do this, though.
> Well, I just thought of one: every time a previously unseen token
> gets scanned, also generate an artificial token called "Singleton"
> or something like that.  Then just let the regular Bayesian statistics
> operate.
> 
> This might be something to try in the post 1.0 era, along with
> two-token sequences.
> ---
> Jef

Jeff,

Good ideas :-)

I've got a "hint" mechanism that's not yet released.  Counts are kept of
"interesting" occurrences (of which "singleton" could be one).  At the
end of parsing a message, the counts are converted to tokens like
"hint:singleton:x", where x is in a sequence like
1,2,5,10,20,50,100,200,500, etc.  If a message had 7 tokens, then the
"...:1", "...:2", and "...:5" tokens would be generated.  At final
scoring the hints are treated just like other tokens.  More work is
needed to see what's useful.

Token-pairs are pretty easy.  A flag in the get_token() routine and
remembering the previous token will allow the routine to alternate
between returning single tokens and tokens pairs.

David