Re casefolding

David Relson relson at osagesoftware.com
Tue May 13 20:23:58 CEST 2003


Peter,

I may be missing something in your results, but I don't see false positives 
mentioned - just false negatives.  False positives are more important than 
false negatives because missing an important message is more 
important.  I'm not saying that wading through false negatives isn't a 
nuisance - just that it's less important than losing a real message.  Have 
you any info on false positives?


At 01:56 PM 5/13/03, Peter Bishop wrote:

>I thought I would have another look at case folding
>Same approach as before except I was more careful to preserve
>the mail headers intact when I kludged the test files
>+ I tried to capture all capitalisation of words in the database
>
>This time every capital letter is followed by a hyphen
>e.g.
>FREE  -> F-R-E-E-
>Now -> N-ow
>
>so they are stored in the database as f-r-e-e, n-ow
>There is some risk that this will be the same as existing
>lower case words - but not much:

Using hyphens is a neat trick!  'Tis a good way to persuade bogofilter to 
do something beyond what it really knows.

>Increase in database tokens:
>
>2.gz    42585   ->      49211
>3.gz    26565   ->      31296
>ham     12081   ->      14683
>
>False negative performance
>
>test    train   spams   fn      fn(with-caps)
>2.gz    3.gz    3876    19      14
>3.gx    2.gz    1907    9       5
>
>I am a bit suspicious about the first result as the
>count of spams (as split up by formail) changed
>from the unkludged version (increased by 6 to 3822)..
>
>The changes look they are just about significant
>might expect a variation of 19+-4 to and 9+-3 from chance variation
>
>So maybe capitals have some effect after all !

Bleeding edge bogofilter (cvs after 0.12.3) has a "-Pf" switch, where "P" 
stands for Parsing and "f" stands for case folding.  By default casefolding 
is enabled.  Using "-Pf" disables it.  If you get a current cvs snapshot, 
you'll be able to test a bit more easily.

David





More information about the Bogofilter mailing list