MIME content-type tokenization

Mon Feb 23 05:00:20 CET 2004

Sorry if this has been discussed before, but a cursory search on the
various list archives didn't find anything about this specific issue,
so:

When bogofilter works with the token "Content-Type," it essentially
disregards the actual type of the content because of the way it is
tokenized.  So, whenever someone emails me an audio file for a
creative-commons music project I'm working on, it almost always gets
classified as spam, since as  far as Bogofilter can tell, a message
with nothing but a large mp3 is more or less the same as a spammer
sending a message which is nothing but a huge GIF (which comprises a
major amount of my spam).

Has any thought gone into tokenizing headers (particularly MIME) as a
single chunk?  i.e. tokenizing as "Content-type: audio/mpeg" instead of
(or in addition to) "Content-Type" "audio" "mpeg" as it does currently.
This would make bogofilter at least somewhat sensitive to the content
of MIMEd data without requiring any special processing of it (aside
from MIME chunks which are text/* which could still continue to be
processed through the normal tokenizer).

Maybe multiple tokenization methods would also benefit the other
discussion going on regarding how to deal with word-splitting tokens
(like, the word "Via-gr-a" could be tokenized as both "Viagra" and
"Via" "gr" "a").

This could be especially beneficial to, say, a doctor who actually
*does* handle legitimate email about certain prescription medications.
:)

--
http://trikuare.cx/