MIME content-type tokenization

Andras Salamon andras at dns.net
Mon Feb 23 11:06:35 CET 2004


On Sun, Feb 22, 2004 at 11:31:50PM -0700, fluffy wrote:
> Yes, I understand that.  What I was referring to was the Content-Type 
> MIME header itself - it is currently tokenized as separate words, when 
> semantically it's a single unit of information, which is more useful 
> for classification when it's kept together in a single token.

Bogofilter already makes some decisions based on high level semantics,
for instance IIRC some header fields are not tokenized (Received: and
Message-Id:), and large attachments are excluded also.  If we have started
down the path to adding message semantics into the filtering process,
I don't see why we shouldn't do a few other useful tweaks like the one
suggested above.

For instance, I have received exactly one piece of ham with a header
field of "Content-Type: text/html", while that header field occurs in
around 17K spams to date.  Ideally I would like to be able to weight
this fact more heavily than that most spam with this header field now
contains reams of random ham-like English text tokens and a small amount
of encoded URLs pointing to a web site with the actual spam content.

However, to avoid feature bloat, it would probably be better to work
out how we can integrate existing tools to do semantic analysis with
bogofilter's pure token classification approach.  For example, a standard
set of procmail filters could take care of high level classification,
stripping out header fields that should not be tokenized, filing
messages from whitelisted senders, and then the rest could be passed onto
bogofilter for pure token analysis.

An ideal solution for me would be one that applied Bayesian techniques of
estimating spam/ham probabilities based on high level semantic criteria as
well as on occurrence of tokens.  Maybe here we need some cluster analysis
(what are the high level criteria with strongly significant spam/ham bias)
to then either automatically feed into a Bayesian engine like bogofilter,
or we could continue the current manual process of tweaking the codebase
when such criteria are found.  Meta-bogofilter?

-- Andras Salamon                   andras at dns.net




More information about the bogofilter-dev mailing list