How to avoid s p lit up wor ds?

Fri Jan 17 21:58:54 CET 2003

At 03:42 PM 1/17/03, Chris Wilkes wrote:

>On Fri, Jan 17, 2003 at 03:29:37PM -0500, David Relson wrote:
> >
> > Do you remember what the spam was using to split up the words?  I'll do
> > some experiments using bogolexer, but it'd be helpful to know what the
> > original looked like.
>
>I'm trying to recall the spam that had it a lot, but here's one using it
>in the subject line:
>   Subject: A n t i * A g i n g   M_i_r_a_c_l_e W_o_r_k_e_r
>which is kind of extreme as you could easily say "An ti  Agi ng"
>
> > 0.9.1.2 is the latest stable version.  The mime processing code is
> > presently only available from cvs on SourceForge.  If you want, I can make
> > source and/or binary rpms available for you with the latest code.
>
>Figured that out right after I sent off the email.  I'm a 0.9.1.2 person
>and will look into checking out the cvs'ed code.
>
>What's happening with MIME right now?  BF just ignores those headers,
>but it still goes into the body of a mime message, right?
>
>Chris

Bogofilter currently recognizes the headers, uses some and ignores 
others.  Ignored are 7bit, 8bit, etc; processed are base64, qp 
(quoted-printable), and uuencode.  Also the lexer has been split into three 
parts - one for the header which cares about Form, Date, Message-ID, normal 
tokens, etc; a second for plain text which cares about From, mime 
boundaries, ip addrs, normal tokens, etc; and the third for html cares abou 
From, mime boundaries, html tags, html comments, etc.  At the moment tokens 
inside of html tags are ignored, though that is subject to change in the 
future.

By the way, I did reproduce your problem with "buy to<br>ner 
car<doodaa>tri<x>dg<z>es".  I'll experiment with it a bit...