Crm114 style context matching. Phrases and partial phrases.

Greg Louis glouis at dynamicro.on.ca
Sat May 17 23:23:27 CEST 2003


On 20030517 (Sat) at 1637:45 -0400, David Relson wrote:

> Any single message won't have have W*W tokens.  The concern is more that, 
> over time, the overall corpora will use W*W tokens and give an overly large 
> wordlist.
> 
Yup.  To give you an idea:
-rw-r--r--    1 spamtest users    19558400 May 17 09:25 spamlist.db
-rw-r--r--    1 spamtest users   466722816 May 17 17:17 spamlist.db

The first line is normal spamlist.db from 11,000 spams; ca 18 Mb.
The second is from the same 11,000 spams with phrases, n=4; 455 Mb.

This is why CRM114 trains only on errors; we'll have to learn how to
manage the tradeoff between manageable training db size and superior
discrimination, if phrases turn out to make a big difference.

I'm about half way through building the corresponding goodlist.db; with
luck, I should have the test results later this evening.

-- 
| G r e g  L o u i s          | gpg public key: finger     |
|   http://www.bgl.nu/~glouis |   glouis at consultronics.com |
| http://wecanstopspam.org in signatures fights junk email |




More information about the Bogofilter mailing list