token classification

Thu May 15 20:53:10 CEST 2003

Hello!

I'm new to this project, but reading the documentation and checking the 
list of tokens stored on good and spam db files I created, I feel that 
there is some important information that don't get extracted by the 
current tokenizing model,  so I have some suggestions that I'd like to 
share with you:

It would be great if the tokenizing functions could create classes of 
tokens, (by a prefix maybe), adding more meaning to the tokens on the 
header of the message.

For example, if the message hop list reports a server like 
smtp03.spammer.com, It would be nice to produce tokens like:

hop:smtp03.spammer.com
hop:spammer.com

A header like "From: ... <name at spamdomain.com> ..." could generate 
tokens like:

from:name at spamdomain.com
from:name
from:spamdomain.com

Same with subject, cc, to, etc.

The thing is that headers sections are semantically very meaningfull 
about message procedence (after all, they where the base of the 1st 
generation spam filters) and mixing it with the body of the message 
feels like loosing that meaning statistically.

I would go further by defining different weights to different classes of 
tokens. i.e., if a body token has a weigth of 1, I would give a 
"subject" or "from" token a weight of 2 or 1.5 at least (we could play 
with this to find the best results). Weight could be implemented easily 
by adding a multiplying factor to the token ocurrences (n) based on the 
class it belongs to.

The current tokenizing functions don't store an e-mail address as such, 
it gets splitted by '@'. I think that's good, but I would add the entire 
address as a token too, since it's easier for a spammer to forge as 
"some_random_name at mydomain.com", but not as "laura at mydomain.com", from 
whom I've never received SPAM (heck, if she sends me some SPAM that must 
be worth looking at).

The spammer even might get lucky if his address has "laura" on it, or 
even easier, "laura" is on the message text, and "mydomain.com" is 
already on my databases, marked with a neutral probability.

I think that entire e-mail addresses as tokens empowers whitelisting and 
blacklisting on a traditional sense, but with the more powerful 
self-adapting bayesian approach.

I would add a final suggestion about token prefixing: I'd try to use the 
smallest prefixes ("f" for "from", "s" for "subject", etc.), since this 
has a direct impact on the databases sizes.

I believe on this project, it has a lot of potential and I would be glad 
to contribute to it as much as I can.

I'd really like to read your thoughts about this subject.