Using casefolded wordlists
Greg Louis
glouis at dynamicro.on.ca
Fri May 30 13:09:45 CEST 2003
On 20030530 (Fri) at 0900:12 +0100, Peter Bishop wrote:
> If I use the old casefolded wordlist it clearly will not recognise mixed
> case words like FREE OFFER. Ditto for ham email, there will be more
> unrecognised words.
>
> email spamicity (robinson)
> ham 0.35 (Pi) 0.47 (PI)
> ham2 0.37 (Pi) 0.47 (PI)
> spam 0.62 (Pi) 0.63 (PI)
> spam2 0.61(Pi) 0.61 (PI)
>
> >From my unscientific sample it appears that ham is affected more than spam
> and the effect could be to increase false positives until the wordlists get
> updated with mixed case words
That could be so. I suggested advising people to classify with -Pi but
train with -PI for a couple of months if they couldn't rebuild their
training databases; alternatively, one could speed up the process (as I
did at work, where I can't rebuild) by training, with -PI, on a large
batch of new messages (roughly equal numbers of spam and nonspam) right
after doing the upgrade.
--
| G r e g L o u i s | gpg public key: finger |
| http://www.bgl.nu/~glouis | glouis at consultronics.com |
| http://wecanstopspam.org in signatures fights junk email |
More information about the Bogofilter
mailing list