token classification
Carlos Paz
capaz at iote.net
Thu May 15 20:53:10 CEST 2003
Hello!
I'm new to this project, but reading the documentation and checking the
list of tokens stored on good and spam db files I created, I feel that
there is some important information that don't get extracted by the
current tokenizing model, so I have some suggestions that I'd like to
share with you:
It would be great if the tokenizing functions could create classes of
tokens, (by a prefix maybe), adding more meaning to the tokens on the
header of the message.
For example, if the message hop list reports a server like
smtp03.spammer.com, It would be nice to produce tokens like:
hop:smtp03.spammer.com
hop:spammer.com
A header like "From: ... <name at spamdomain.com> ..." could generate
tokens like:
from:name at spamdomain.com
from:name
from:spamdomain.com
Same with subject, cc, to, etc.
The thing is that headers sections are semantically very meaningfull
about message procedence (after all, they where the base of the 1st
generation spam filters) and mixing it with the body of the message
feels like loosing that meaning statistically.
I would go further by defining different weights to different classes of
tokens. i.e., if a body token has a weigth of 1, I would give a
"subject" or "from" token a weight of 2 or 1.5 at least (we could play
with this to find the best results). Weight could be implemented easily
by adding a multiplying factor to the token ocurrences (n) based on the
class it belongs to.
The current tokenizing functions don't store an e-mail address as such,
it gets splitted by '@'. I think that's good, but I would add the entire
address as a token too, since it's easier for a spammer to forge as
"some_random_name at mydomain.com", but not as "laura at mydomain.com", from
whom I've never received SPAM (heck, if she sends me some SPAM that must
be worth looking at).
The spammer even might get lucky if his address has "laura" on it, or
even easier, "laura" is on the message text, and "mydomain.com" is
already on my databases, marked with a neutral probability.
I think that entire e-mail addresses as tokens empowers whitelisting and
blacklisting on a traditional sense, but with the more powerful
self-adapting bayesian approach.
I would add a final suggestion about token prefixing: I'd try to use the
smallest prefixes ("f" for "from", "s" for "subject", etc.), since this
has a direct impact on the databases sizes.
I believe on this project, it has a lot of potential and I would be glad
to contribute to it as much as I can.
I'd really like to read your thoughts about this subject.
More information about the bogofilter-dev
mailing list