MIME content-type tokenization

Mon Feb 23 07:31:50 CET 2004

On Feb 22, 2004, at 9:19 PM, David Relson wrote:

> In both these cases, bogofilter treats the attachment as binary that
> can't (shouldn't) be tokenized.  The message score is based on the 
> other
> tokens of the message.

Yes, I understand that.  What I was referring to was the Content-Type 
MIME header itself - it is currently tokenized as separate words, when 
semantically it's a single unit of information, which is more useful 
for classification when it's kept together in a single token.

> The treatment of the message as spam is based on the tokens in the
> message and how you've trained bogofilter.  If you want to learn more 
> of
> _why_ bogofilter treats a given message as spam, use the "-vvv" option
> to have bogofilter list all the tokens in the message and their
> individual scores.  That will tell you which tokens caused bogofilter 
> to
> think the message is spam.

Yes, and it was from running bogofilter -vvv that I saw that the 
Content-Type header was being tokenized in this way.

>> Has any thought gone into tokenizing headers (particularly MIME) as a
>> single chunk?  i.e. tokenizing as "Content-type: audio/mpeg" instead
>> of(or in addition to) "Content-Type" "audio" "mpeg" as it does
>> currently. This would make bogofilter at least somewhat sensitive to
>> the content of MIMEd data without requiring any special processing of
>> it (aside from MIME chunks which are text/* which could still continue
>> to be processed through the normal tokenizer).
>
> Whether it's better to have 1 long token or 3 short ones can only be
> determined by having two versions of bogofilter and scoring a lot of
> messages to see which way works better.  My experience indicates that
> it's better to have more detailed tokens than fewer tokens.  I'd vote
> for the short-ones rather than the long one, though I could be wrong.

Well, I was thinking that it would be used in *both* forms - i.e. 
tokenize it both as 'Content-type: audio/mpeg' *and* 'Content-type' 
'audio' 'mpeg'.

>> Maybe multiple tokenization methods would also benefit the other
>> discussion going on regarding how to deal with word-splitting tokens
>> (like, the word "Via-gr-a" could be tokenized as both "Viagra" and
>> "Via" "gr" "a").
>
> That's giving double weight to a token, which is contrary to the
> bayesian principles.  Also, the peculiar spellings that spammers use to
> avoid rule based filters are like red flags to a bayesian filter.  They
> say "look at me.  Somebody is furtively trying to escape notice."
> They're like red flags.

It's not giving double weight to a token, it's treating a single group 
of letters as two different tokens.  The red-flag weird spellings would 
still be there.

I just don't see how giving the Bayesian filter more data could be a 
bad thing.  (Also, I didn't find it worth mentioning earlier but I've 
done quite a bit of work with Bayes and Markov theory in the past for a 
variety of problems, mostly in the realm of image processing, and have 
a firm grasp on statistical models, so you don't have to assume I'm 
just an uninformed random user. :)

>> This could be especially beneficial to, say, a doctor who actually
>> *does* handle legitimate email about certain prescription medications.
>> :)
>
> Since bogofilter distinguishes between the proper and improper 
> spellings
> (based on its training), your doctor friend should be fine :-)

Well, it was just a random example - like, the presence of "Viagra" 
together with the presence of "Via" "gr" "a" would be more likely to 
indicate just "Via" on its own (like, "Go to my house via Central Ave." 
or whatever).

Again, it's not duplicating data, it's providing more higher-level data 
based on the same amount of input.

Intuitively I think that the multiple-tokenization idea might be worth 
exploring, and I was wondering if any explorations along these lines 
had been done already.  I'd be happy to code a patch to the lexer if so 
requested.

--
http://trikuare.cx/