testing partial wordlists

David Relson relson at osagesoftware.com
Sat Feb 5 23:49:06 CET 2005


On 05 Feb 2005 17:22:33 -0500
Tom Anderson wrote:

> On Sat, 2005-02-05 at 13:51, David Relson wrote:
> > Last week I did some counts of the tokens in my wordlist.  Of the 1.5M
> > tokens I have, approx 1/3 have timestamps more than 2 yrs old and
> > another 1/3 are more than 1 yr old.  I'm giving thought to removing some
> > (or all) of those oldies.
> 
> I'd imagine that removing all tokens older than 13 or 14 months would be
> best, as you'll probably receive holiday and season specific spam and
> ham.  Eg., in winter, "santa" and "snow" might be big scorers, while in
> summer, "beach" and "surf" might be the big ones, with little overlap
> from one season to the next.  Deleting anything newer than about 13
> months is probably shooting yourself in the foot in regards to these
> types of tokens.
> 
> Tom

Hi Tom,

Indeed!!  It'd be interesting to know if there're holiday effects.  I've
got no info one way or t'other. 

Now I've got some actual numbers and don't have to imagine to decide
what to keep or what to pitch.  Given the numbers I got, I've removed
hapaxes and tokens older than a year.

I just need to remain vigilant for a while -- in case I've removed too
much!

David



More information about the Bogofilter mailing list