Tokens including header values?

David Relson relson at osagesoftware.com
Sat Feb 1 01:42:42 CET 2003


At 07:36 PM 1/31/03, Chris Wilkes wrote:

>This spam got past bogofilter today, the -bfvvv one is the result of a
>running it through bogofilter -vvv after registering it as spam:
>         http://ladro.com/bf/20020131-01.txt
>         http://ladro.com/bf/20020131-01-bfvvv.txt
>
>As you can see there isn't much there to go off of, the main words I
>could that show it to be spam are:
>   alice deflowered hymen's
>Which aren't likely to show up in another spam email.  Nor is the
>website, kellu.com.  They'll just change it to be kellv.com next time.
>
>However I have seen quite a lot of spams with my email address in the
>Subject line:
>         Subject: See her get deflowered cwilkes at pobox.com
>Hardly anyone that's going to email me puts my email address in the
>subject.
>
>Likewise, I'm never going to get legitimate email with a bunch of other
>cwilkes's listed in the To:.
>
>Has anyone thought of making bogofilter header and body aware?  I'm not
>sure how much of a gain that would be versus making the code that much
>more complicated.  You could write out to files like spamlist-subject.db
>and goodlist-to.db.
>
>Chris

Chris,

Funny you should mention it.  As bogofilter 0.10.1.x is almost ready for 
promotion as the stable version of bogofilter, my attention is moving 
forward to what's next.

As your message arrived, I'm looking at gtkdiff showing bogolexer from the 
current version and an experimental version.  The goal of the experimental 
version is to recognize various standard header phrases and create special 
tokens for each word in that header line.  For example, "Subject: See 
cwilkes at pobox.com" may soon return "subj:see", "subj:cwilkes", and 
"subj:pobox.com".

Thanks for asking :-)

David






More information about the bogofilter-dev mailing list