The significance of word placement

Matthias Andree matthias.andree at gmx.de
Fri Oct 25 02:26:37 CEST 2002


On Thu, 24 Oct 2002, David Relson wrote:

> If I remember right, Mark Hoffman is working on some advanced tokenizing 
> features.  One part of the project is to generate compound tokens like 
> subject:betreff, from:xyz, etc.
> 
> An idea that just occurred to me is that the prefixes (like subject: or 
> from:) could be recognized and bogofilter could apply difference weights 
> (importances) to such tokens.  I'm going to think out loud here for a 
> minute.

I'll have a hard time describing these in English, feel free to ask back
with the proper terms to make sure we agree.

I'm thinking about contrained probabilities (not sure if that's the
correct term):

We currently look at single tokens. Assuming we have four tokens A B C D
(in order), we are only looking at A, at B, at C, at D. We could also
look at AB, BC, CD. We might also consider looking at ABC and BCD. That
way, we may catch runs of "compound" tokens indicative for spam while
the individual tokens that make up the compound don't hint towards ham
or spam.

-- 
Matthias Andree




More information about the bogofilter-dev mailing list