Re casefolding
David Relson
relson at osagesoftware.com
Tue May 13 20:23:58 CEST 2003
Peter,
I may be missing something in your results, but I don't see false positives
mentioned - just false negatives. False positives are more important than
false negatives because missing an important message is more
important. I'm not saying that wading through false negatives isn't a
nuisance - just that it's less important than losing a real message. Have
you any info on false positives?
At 01:56 PM 5/13/03, Peter Bishop wrote:
>I thought I would have another look at case folding
>Same approach as before except I was more careful to preserve
>the mail headers intact when I kludged the test files
>+ I tried to capture all capitalisation of words in the database
>
>This time every capital letter is followed by a hyphen
>e.g.
>FREE -> F-R-E-E-
>Now -> N-ow
>
>so they are stored in the database as f-r-e-e, n-ow
>There is some risk that this will be the same as existing
>lower case words - but not much:
Using hyphens is a neat trick! 'Tis a good way to persuade bogofilter to
do something beyond what it really knows.
>Increase in database tokens:
>
>2.gz 42585 -> 49211
>3.gz 26565 -> 31296
>ham 12081 -> 14683
>
>False negative performance
>
>test train spams fn fn(with-caps)
>2.gz 3.gz 3876 19 14
>3.gx 2.gz 1907 9 5
>
>I am a bit suspicious about the first result as the
>count of spams (as split up by formail) changed
>from the unkludged version (increased by 6 to 3822)..
>
>The changes look they are just about significant
>might expect a variation of 19+-4 to and 9+-3 from chance variation
>
>So maybe capitals have some effect after all !
Bleeding edge bogofilter (cvs after 0.12.3) has a "-Pf" switch, where "P"
stands for Parsing and "f" stands for case folding. By default casefolding
is enabled. Using "-Pf" disables it. If you get a current cvs snapshot,
you'll be able to test a bit more easily.
David
More information about the Bogofilter
mailing list