Using casefolded wordlists

Greg Louis glouis at dynamicro.on.ca
Fri May 30 13:09:45 CEST 2003


On 20030530 (Fri) at 0900:12 +0100, Peter Bishop wrote:

> If I use the old casefolded wordlist it clearly will not recognise mixed 
> case words like FREE OFFER. Ditto for ham email, there will be  more 
> unrecognised words.
> 
> email   spamicity (robinson)
> ham   0.35 (Pi)     0.47 (PI)
> ham2 0.37 (Pi)     0.47 (PI)
> spam 0.62 (Pi)     0.63 (PI)
> spam2 0.61(Pi)    0.61 (PI)
> 
> >From my unscientific sample it appears that ham is affected more than spam 
> and the effect could be to increase false positives until the wordlists get 
> updated with mixed case words

That could be so.  I suggested advising people to classify with -Pi but
train with -PI for a couple of months if they couldn't rebuild their
training databases; alternatively, one could speed up the process (as I
did at work, where I can't rebuild) by training, with -PI, on a large
batch of new messages (roughly equal numbers of spam and nonspam) right
after doing the upgrade.

-- 
| G r e g  L o u i s          | gpg public key: finger     |
|   http://www.bgl.nu/~glouis |   glouis at consultronics.com |
| http://wecanstopspam.org in signatures fights junk email |




More information about the Bogofilter mailing list