Crm114 style context matching. Phrases and partial phrases.
Greg Louis
glouis at dynamicro.on.ca
Sat May 17 23:23:27 CEST 2003
On 20030517 (Sat) at 1637:45 -0400, David Relson wrote:
> Any single message won't have have W*W tokens. The concern is more that,
> over time, the overall corpora will use W*W tokens and give an overly large
> wordlist.
>
Yup. To give you an idea:
-rw-r--r-- 1 spamtest users 19558400 May 17 09:25 spamlist.db
-rw-r--r-- 1 spamtest users 466722816 May 17 17:17 spamlist.db
The first line is normal spamlist.db from 11,000 spams; ca 18 Mb.
The second is from the same 11,000 spams with phrases, n=4; 455 Mb.
This is why CRM114 trains only on errors; we'll have to learn how to
manage the tradeoff between manageable training db size and superior
discrimination, if phrases turn out to make a big difference.
I'm about half way through building the corresponding goodlist.db; with
luck, I should have the test results later this evening.
--
| G r e g L o u i s | gpg public key: finger |
| http://www.bgl.nu/~glouis | glouis at consultronics.com |
| http://wecanstopspam.org in signatures fights junk email |
More information about the Bogofilter
mailing list