Converting old wordlist.db to Berkeley format?

RW rwmaillists at googlemail.com
Mon Sep 5 20:15:55 CEST 2016


On Mon, 5 Sep 2016 10:58:34 +0100
Geoff wrote:


> It would be tedious, but if need be I could also retrain.  As soon as
> I understood (back in 2003), that training could be a slow process, I
> began to archive my spam in case I needed to start again. (I already
> archived everthing else for purposes of my profession).  I have about
> 93K spam mails, and rather more ham, archived.

IIWY I'd train a new wordlist  on 2000 recent spams and 2000 recent
hams  and then run trainbogo.sh on the larger corpus with something
like:

ham_cutoff = 0.001
spam_cutoff= 0.99

trainbogo.sh does a train-on-error pass through the corpus, which is
much faster than training on everything.

Old, very heavily trained wordlists are not always optimal. They may
contain a lot of training that's no longer relevant, and they can
become over-trained and resistant to change. 


More information about the bogofilter mailing list