Crm114 style context matching. Phrases and partial phrases.
Greg Louis
glouis at dynamicro.on.ca
Sun May 18 13:12:18 CEST 2003
On 20030517 (Sat) at 1839:11 -0400, Greg Louis wrote:
> Right now my priority is to discover whether there is or is not a major
> improvement in discrimination with the introduction of phrases. If
> there be such, then it's worth learning how to obtain it at the least
> cost. If not, then we needn't bother optimizing. I'm therefore not
> training on error, but just building a training db with 11,000 spams
> and 11,000 nonspams, not worrying about efficiency at this stage. I
> expect to finish training and start on testing within the next 60
> minutes; I am hoping the testing will go much faster than the training
> did.
>
The first experiment failed, at least in part because of a human error.
A quick check shows that the method used to build the training
databases was flawed: .MSG-COUNT came out zero (I should have realized
that would happen). I will need to dump the (huge) training db and set
.MSG-COUNT manually, then reload (should cut the size a bit too).
Database size is a _major_ potential problem. With PIPE_SIZE 4, the
spamlist grows by a factor of about 25 and the goodlist by about 10 in
comparison to the lists built with just single tokens. Total size,
with 11,000 spams and 11,000 nonspams, is about 3/4 Gb.
I'll report back when the training db has been fixed and the tests
rerun.
--
| G r e g L o u i s | gpg public key: finger |
| http://www.bgl.nu/~glouis | glouis at consultronics.com |
| http://wecanstopspam.org in signatures fights junk email |
More information about the Bogofilter
mailing list