token classification

Thu May 15 22:11:19 CEST 2003

On 20030515 (Thu) at 1353:10 -0500, Carlos Paz wrote:
> Hello!
> 
> I'm new to this project, but reading the documentation and checking the 
> list of tokens stored on good and spam db files I created, I feel that 
> there is some important information that don't get extracted by the 
> current tokenizing model,  so I have some suggestions that I'd like to 
> share with you:

Welcome!

> It would be great if the tokenizing functions could create classes of 
> tokens, (by a prefix maybe), adding more meaning to the tokens on the 
> header of the message.

That is currently done with Subject:, and cvs has code to add To:, From:
and Return-Path: headers, those being the ones Paul Graham identified
as being useful (http://www.paulgraham.com/better.html).  Experiments
that David (the project leader) and I have done confirm that tagging
these headers noticeably improves discrimination.

> I would go further by defining different weights to different classes of 
> tokens. i.e., if a body token has a weigth of 1, I would give a 
> "subject" or "from" token a weight of 2 or 1.5 at least (we could play 
> with this to find the best results). Weight could be implemented easily 
> by adding a multiplying factor to the token ocurrences (n) based on the 
> class it belongs to.

Why do you think this would help?  I don't say it wouldn't, but I'd be
interested in the theoretical justification for such a proposal.

> The current tokenizing functions don't store an e-mail address as such, 
> it gets splitted by '@'. I think that's good, but I would add the entire 
> address as a token too, since it's easier for a spammer to forge as 
> "some_random_name at mydomain.com", but not as "laura at mydomain.com", from 
> whom I've never received SPAM (heck, if she sends me some SPAM that must 
> be worth looking at).
> 
> The spammer even might get lucky if his address has "laura" on it, or 
> even easier, "laura" is on the message text, and "mydomain.com" is 
> already on my databases, marked with a neutral probability.

It might make a difference; if you were to code it and try it, I'd be
very interested in the results.

> I would add a final suggestion about token prefixing: I'd try to use the 
> smallest prefixes ("f" for "from", "s" for "subject", etc.), since this 
> has a direct impact on the databases sizes.

Not a very great impact.  I think the extra three bytes are worth
spending in the interest of legibility.

0fht:
total 19476
-rw-r--r--    1 root     root     10264576 May 14 12:24 goodlist.db
-rw-r--r--    1 root     root      9646080 May 14 12:24 spamlist.db
2fHt:
total 21148
-rw-r--r--    1 root     root     11190272 May 14 12:43 goodlist.db
-rw-r--r--    1 root     root     10432512 May 14 12:43 spamlist.db

Header tagging with four-byte tags caused less than a ten-percent
increase in size.  Of this, most is due to the fact that with header
tagging, tokens that occur in both bodies and headers appear twice or
more, instead of just once, in the wordlists.

-- 
| G r e g  L o u i s          | gpg public key: finger     |
|   http://www.bgl.nu/~glouis |   glouis at consultronics.com |
| http://wecanstopspam.org in signatures fights junk email |