Degeneration thought

Sam Hills rcb at bbll.com
Wed Jun 18 19:02:25 CEST 2003



>>>The idea here is that these weird forms are fairly rare so they can be
>>>stored separately with no great storage cost - degeneration is to one of
>>>the standard forms only - not to any of the other weird formats.
>>>      
>>>
>>Time will tell if the weird formats are rare, or not.  Certainly one lookup
>>for the three standard forms is a winning speed strategy.
>>    
>>
>
>I am not sure how valid this is, but we could have a fourth count for *all* 
>weird formats. So tHe, tHE, ThE etc. all count as weird.
>This might result in higher pspam values as all weird forms are counted as 
>one form, but this only happens if the token has several different weird 
>formats (rare event?).
>  
>
In the spam I've seen, it isn't so rare.

Some words (which are already spamish) are mis-capitalized far more 
often than others.  Furthermore, deliberately mis-capitalized words are 
mis-capitalized in every possible way with seemingly relatively equal 
probability.

IMHO, a single count for 'mis-capitalized' might even give better 
results than maintaining a separate count for each possible variation.

>If there are no weird formats, the fourth count need not be stored (if you
>can do right truncation of count fields), so it might be quite space 
>efficient.
>
>Information is lost if we do this, as we cannot regenerate the actual 
>weird tokens and their counts (e.g. via bogoutil) e.g. we have output the 
>token in a specific weird format, like "tHe" rather than the actual one
>- but maybe it does not matter too much - after all we lost information 
>about formats in the original casefolded database,
> 
>  
>
I don't think it matters.  Although proper names (such as McDonald or 
BogoFilter) will be capitalized the same way almost every time in 
non-spam and genuine typos (in non-spam) will be relatively rare, 
deliberately mis-capitalized words are mis-capitalized in every possible 
way and they're almost always a strong indication of spam.  It should be 
sufficient to simply report 'mis-capitalized' as a single form -- it 
doesn't really matter which capitalization variants were seen.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20030618/32117ee9/attachment.html>


More information about the Bogofilter mailing list