MIME content-type tokenization

Mon Feb 23 05:19:25 CET 2004

Hello Fluffy,

On Sun, 22 Feb 2004 21:00:20 -0700
fluffy wrote:

> Sorry if this has been discussed before, but a cursory search on the
> various list archives didn't find anything about this specific issue,
> so:
> 
> When bogofilter works with the token "Content-Type," it essentially
> disregards the actual type of the content because of the way it is
> tokenized.  So, whenever someone emails me an audio file for a
> creative-commons music project I'm working on, it almost always gets
> classified as spam, since as  far as Bogofilter can tell, a message
> with nothing but a large mp3 is more or less the same as a spammer
> sending a message which is nothing but a huge GIF (which comprises a
> major amount of my spam).

In both these cases, bogofilter treats the attachment as binary that
can't (shouldn't) be tokenized.  The message score is based on the other
tokens of the message.  

The treatment of the message as spam is based on the tokens in the
message and how you've trained bogofilter.  If you want to learn more of
_why_ bogofilter treats a given message as spam, use the "-vvv" option
to have bogofilter list all the tokens in the message and their
individual scores.  That will tell you which tokens caused bogofilter to
think the message is spam.

> Has any thought gone into tokenizing headers (particularly MIME) as a
> single chunk?  i.e. tokenizing as "Content-type: audio/mpeg" instead
> of(or in addition to) "Content-Type" "audio" "mpeg" as it does
> currently. This would make bogofilter at least somewhat sensitive to
> the content of MIMEd data without requiring any special processing of
> it (aside from MIME chunks which are text/* which could still continue
> to be processed through the normal tokenizer).

Whether it's better to have 1 long token or 3 short ones can only be
determined by having two versions of bogofilter and scoring a lot of
messages to see which way works better.  My experience indicates that
it's better to have more detailed tokens than fewer tokens.  I'd vote
for the short-ones rather than the long one, though I could be wrong.

> Maybe multiple tokenization methods would also benefit the other
> discussion going on regarding how to deal with word-splitting tokens
> (like, the word "Via-gr-a" could be tokenized as both "Viagra" and
> "Via" "gr" "a").

That's giving double weight to a token, which is contrary to the
bayesian principles.  Also, the peculiar spellings that spammers use to
avoid rule based filters are like red flags to a bayesian filter.  They
say "look at me.  Somebody is furtively trying to escape notice." 
They're like red flags.

> This could be especially beneficial to, say, a doctor who actually
> *does* handle legitimate email about certain prescription medications.
> :)

Since bogofilter distinguishes between the proper and improper spellings
(based on its training), your doctor friend should be fine :-)

David