[Fwd: Re: [cvs] Potential for error?]
Tom Allison
tallison at tacocat.net
Tue Oct 22 12:04:01 CEST 2002
Allyn Fratkin wrote:
>> > Also, I noticed that there were a lot of words in my lists that weren't
>> > words. Things like ab34af127 would be listed, but only once. Based on
>> > this, eventually the list files will bloat to inifinity.
>
>
>
> are you possibly training bogofilter using mailboxes from microsoft
> windows, that use CRLF as line endings? bogofilter up through 0.7.5 is not
> recognizing and discarding base64 attachments correctly with CRLF (the CR
> is throwing it off). it is treating them as normal text and parsing the
> base64 data as words. i submitted a fix for this but it didn't make it
> into 0.7.5.
>
> my good word db went from 50MB to 3MB after i figured out and fixed this
> problem. i guess i get a lot of attachments. :-)
>
> by the way, it occurs the me that bogofilter will think any single word
> on a line is base64 and discard it, based on the regexp it uses to
> "recognize" base64. i guess this is not too serious until spammers
> start sending messages with only one word per line. :-)
>
>> Similarly, one could periodically discard any tokens whose good+spam
>> count is 1.
>
>
> did you mean good=spam? i think you would definitely
> want to keep a word that only appeared in one of the lists.
I was pouring through some email that I had been parsing using my "best
guess" at Paul Grahams regex (my version is posted below) and came up with
something to the tune of:
51896 spam in 8812 emails where the count was >1. I deleted approximately
50000 words that were used once in spam and not listed in the good token
list. (singularity = appears only once in only one list)
Similarly, the good token list holds 22,649 tokens with 2,825 emails used.
I deleted just under 20,000 words that didn't show up on spam and were used
once in one email.
Looking through this list, and knowing a little about myself, we are no
longer counting works in the english language. I tried to escape from the
email when I found the statement matching
/^content-transfer-encoding: base64/i
But my knowledge of email (or lack thereof) did allow in some great base64
looking text. But there are an aweful lot of tokens that I don't recognize
as anything definitive in english, html, or javascript.
It might be helpful to have a better lexer involved. I picked up almost 50%
potential base64 garble before I deleted out all the singularities.
Ordinarily I would not remove a singularity as many of them are/were
legitimate words. But after running 20,000+ emails I figured I had a good
enough sample for my needs.
I have posted below the perl regex I used to break up the word-tokens.
$_ = lc; # Force everything to lowercase
next if /^message-id/o;
last if /^content-transfer-encoding: base64/;
foreach $word (/(\<?[\w\'\-\$]+\>?)/cg) {
next if $word =~ /^[\-\d]+$/o; # discard only numbers
next if $word =~ /^[\-_]+$/o; # discard only dashed lines
$word =~ s/[\<\>\']+//g; # strip HTML, apostrophes
next unless $word =~ /\w\w/o; # single char words ignored
$words{$word}++;
}
If someone would be willing to review my code and provide inputs on how to
better the method (including base64?) then I would be willing to re-run this
information again using the same body of emails. It takes a while through
perl and postgresql to tear apart and load this data, but I think it's more
worthwhile than setiathome can be at the moment.
--
The study of non-linear physics is like the study of non-elephant biology.
More information about the Bogofilter
mailing list