Crm114-like Phrases and partial phrases; database size

Mon May 19 13:50:21 CEST 2003

On 20030518 (Sun) at 1914:41 -0400, Greg Louis wrote:

> > Database size is a _major_ potential problem.  With PIPE_SIZE 4, the
> > spamlist grows by a factor of about 25 and the goodlist by about 10 in
> > comparison to the lists built with just single tokens.  Total size,
> > with 11,000 spams and 11,000 nonspams, is about 3/4 Gb.
> 
> This remains true.

The figures given above for database size are what one sees if one
performs training from spam- and nonspam-containing mbox files.  It's
possible to shrink a bogofilter training db significantly by doing

$ bogoutil -d goodlist.db | bogoutil -l goodlist.new
$ bogoutil -d spamlist.db | bogoutil -l spamlist.new
$ mv goodlist.new goodlist.db
$ mv spamlist.new spamlist.db

I did that with the databases used to compare normal and phrase-enabled
bogofilter, and got

$ ls -l normal phrase
normal:
total 50784
-rw-r--r--    1 spamtest users    32382976 May 19 07:22 goodlist.db
-rw-r--r--    1 spamtest users    19558400 May 19 07:23 spamlist.db

phrase:
total 523760
-rw-r--r--    1 spamtest users    213622784 May 19 07:19 goodlist.db
-rw-r--r--    1 spamtest users    322174976 May 19 07:15 spamlist.db

Overall, the training db with phrases is 511 Mb, 10.3 times the size of
the one built with just single tokens.

> Using phrases (PIPE_SIZE 4) reduced the false-negative count by 74.8
> percent.

Some people will consider that the database size expansion is
sufficiently undesirable to outweigh the improvement in discrimination. 
Throughput might become a problem as well, especially for larger
installations.

We can improve both throughput and database size by using a single
wordlist instead of two; code to do that has already been written and
tested, though not tested with phrases.  The differences are only of
the order of ten percent, however.  It turns out that in most corpora,
there are very many tokens that appear in only one of the two
wordlists, so the combined wordlist isn't much smaller than the total
of the separate ones; and although token lookup is greatly accelerated
with a single list, token lookup with two lists only represented about
20% of the time needed to process a message.  Lexical analysis and
classification account for the rest.

FWIW, my own feeling is that we should continue to work on improving
single-token bogofilter for the present.  After release 1.0, we might
want to consider adding an option to use phrases.

-- 
| G r e g  L o u i s          | gpg public key: finger     |
|   http://www.bgl.nu/~glouis |   glouis at consultronics.com |
| http://wecanstopspam.org in signatures fights junk email |