[Fwd: Re: [cvs] Potential for error?]

Tue Oct 22 12:04:01 CEST 2002

Allyn Fratkin wrote:
 >> > Also, I noticed that there were a lot of words in my lists that weren't
 >> > words.  Things like ab34af127 would be listed, but only once.  Based on
 >> > this, eventually the list files will bloat to inifinity.
 >
 >
 >
 > are you possibly training bogofilter using mailboxes from microsoft
 > windows, that use CRLF as line endings?  bogofilter up through 0.7.5 is not
 > recognizing and discarding base64 attachments correctly with CRLF (the CR
 > is throwing it off).  it is treating them as normal text and parsing the
 > base64 data as words.  i submitted a fix for this but it didn't make it
 > into 0.7.5.
 >
 > my good word db went from 50MB to 3MB after i figured out and fixed this
 > problem.  i guess i get a lot of attachments.  :-)
 >
 > by the way, it occurs the me that bogofilter will think any single word
 > on a line is base64 and discard it, based on the regexp it uses to
 > "recognize" base64.  i guess this is not too serious until spammers
 > start sending messages with only one word per line.  :-)
 >
 >> Similarly, one could periodically discard any tokens whose good+spam
 >> count is 1.
 >
 >
 > did you mean good=spam?  i think you would definitely
 > want to keep a word that only appeared in one of the lists.

I was pouring through some email that I had been parsing using my "best
guess" at Paul Grahams regex (my version is posted below) and came up with
something to the tune of:
51896 spam in 8812 emails where the count was >1.  I deleted approximately
50000 words that were used once in spam and not listed in the good token
list. (singularity = appears only once in only one list)

Similarly, the good token list holds 22,649 tokens with 2,825 emails used.
I deleted just under 20,000 words that didn't show up on spam and were used
once in one email.

Looking through this list, and knowing a little about myself, we are no
longer counting works in the english language.  I tried to escape from the
email when I found the statement matching
/^content-transfer-encoding: base64/i
But my knowledge of email (or lack thereof) did allow in some great base64
looking text.  But there are an aweful lot of tokens that I don't recognize
as anything definitive in english, html, or javascript.

It might be helpful to have a better lexer involved.  I picked up almost 50%
potential base64 garble before I deleted out all the singularities.
Ordinarily I would not remove a singularity as many of them are/were
legitimate words.  But after running 20,000+ emails I figured I had a good
enough sample for my needs.

I have posted below the perl regex I used to break up the word-tokens.

      $_ = lc;  # Force everything to lowercase

      next if /^message-id/o;
      last if /^content-transfer-encoding: base64/;

      foreach $word (/(\<?[\w\'\-\$]+\>?)/cg) {
          next if $word =~ /^[\-\d]+$/o;  # discard only numbers
          next if $word =~ /^[\-_]+$/o;   # discard only dashed lines
          $word =~ s/[\<\>\']+//g;        # strip HTML, apostrophes
          next unless $word =~ /\w\w/o;   # single char words ignored

          $words{$word}++;
      }

If someone would be willing to review my code and provide inputs on how to
better the method (including base64?) then I would be willing to re-run this
information again using the same body of emails.  It takes a while through
perl and postgresql to tear apart and load this data, but I think it's more
worthwhile than setiathome can be at the moment.
-- 
The study of non-linear physics is like the study of non-elephant biology.