Degeneration thought
Sam Hills
rcb at bbll.com
Wed Jun 18 19:02:25 CEST 2003
>>>The idea here is that these weird forms are fairly rare so they can be
>>>stored separately with no great storage cost - degeneration is to one of
>>>the standard forms only - not to any of the other weird formats.
>>>
>>>
>>Time will tell if the weird formats are rare, or not. Certainly one lookup
>>for the three standard forms is a winning speed strategy.
>>
>>
>
>I am not sure how valid this is, but we could have a fourth count for *all*
>weird formats. So tHe, tHE, ThE etc. all count as weird.
>This might result in higher pspam values as all weird forms are counted as
>one form, but this only happens if the token has several different weird
>formats (rare event?).
>
>
In the spam I've seen, it isn't so rare.
Some words (which are already spamish) are mis-capitalized far more
often than others. Furthermore, deliberately mis-capitalized words are
mis-capitalized in every possible way with seemingly relatively equal
probability.
IMHO, a single count for 'mis-capitalized' might even give better
results than maintaining a separate count for each possible variation.
>If there are no weird formats, the fourth count need not be stored (if you
>can do right truncation of count fields), so it might be quite space
>efficient.
>
>Information is lost if we do this, as we cannot regenerate the actual
>weird tokens and their counts (e.g. via bogoutil) e.g. we have output the
>token in a specific weird format, like "tHe" rather than the actual one
>- but maybe it does not matter too much - after all we lost information
>about formats in the original casefolded database,
>
>
>
I don't think it matters. Although proper names (such as McDonald or
BogoFilter) will be capitalized the same way almost every time in
non-spam and genuine typos (in non-spam) will be relatively rare,
deliberately mis-capitalized words are mis-capitalized in every possible
way and they're almost always a strong indication of spam. It should be
sufficient to simply report 'mis-capitalized' as a single form -- it
doesn't really matter which capitalization variants were seen.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20030618/32117ee9/attachment.html>
More information about the Bogofilter
mailing list