The significance of word placement
David Relson
relson at osagesoftware.com
Fri Oct 25 01:55:25 CEST 2002
At 05:47 PM 10/24/02, Boris 'pi' Piwinger wrote:
>Hi!
>
>Most of the spam not caught by bogofilter for me is in
>German. A significant portion of that is "women" asking for
>"dates". Strange enough the subject is "Betreff" (which
>means: subject or topic). So this is a 100 percenter in my
>filter. Bogofilter often misses it, though.
>
>Would it be significant if a word shows up in the header or
>even the subject, bogofilter would know better. I have
>virtually hundreds of those messages with that exact
>subject. None of which is ham.
>
>pi
pi,
If I remember right, Mark Hoffman is working on some advanced tokenizing
features. One part of the project is to generate compound tokens like
subject:betreff, from:xyz, etc.
An idea that just occurred to me is that the prefixes (like subject: or
from:) could be recognized and bogofilter could apply difference weights
(importances) to such tokens. I'm going to think out loud here for a minute.
Suppose the special tokens were very easy to recognize. Perhaps a special
first character, like a colon. Supposing the colon, a special token would
look like :subject:betreff. In the config file, there could be a list of
prefixes for special tokens and special weights to assign to such
tokens. An entry might look like "special_token: subject 3", which would
mean to give tokens like ":subject:betreff" triple importance when figuring
out the spamicity.
The special first character would allow the query_needs_special_handling()
function to very quickly determine whether a token _might_ or _might_not_
need further checking for special handling. Using the config file to save
weights, would allow each bogofilter site to select what header fields are
of interest to them and how important to consider them. It would even be
possible for the config file entry to have two weights - one to apply to
probabilities for the spam word list and one to apply to probabilities from
the non-spam list.
Anyhow, the above ideas are what your message brings to my mind.
Comments anyone?
David
More information about the Bogofilter
mailing list