Crm114-like Phrases and partial phrases; database size

Greg Louis glouis at dynamicro.on.ca
Tue May 20 00:45:28 CEST 2003


On 20030520 (Tue) at 0829:08 +1000, michael at optusnet.com.au wrote:
> Greg Louis <glouis at dynamicro.on.ca> writes:
> > On 20030518 (Sun) at 1914:41 -0400, Greg Louis wrote:
> > 
> > > > Database size is a _major_ potential problem.  [...]
> [...]
> > Some people will consider that the database size expansion is
> > sufficiently undesirable to outweigh the improvement in discrimination. 
> > Throughput might become a problem as well, especially for larger
> > installations.
> 
> I don't know  if I count as a 'larger installation' or not (planning
> to use it to filter about 3 - 5 million emails per day) but some thoughts:
> 
> Given sufficent ram, the drop in thruput should be proportional
> to the log of the dbase size. So a 10-fold increase in size
> should be only a 20% drop in thruput.

True.  Given sufficient RAM, and precisely the larger installations
(yes, I would imagine you qualify ;) should be able to deal with that.

> The other point I'd mention is that accuracy matters.

To you, to me, not to everybody when hardware muscle becomes an issue
(been there, done that, got the rotten tomatoes).

IMHO there is still work to be done in the single-token environment
that will yield significant improvement at much less cost.  I've said,
and I stand by it, that once 1.0 is a reality we ought to come back to
considering phrases, and I do think that if we take enough trouble to
understand the issues, we can bump bogofilter performance up another
notch.

-- 
| G r e g  L o u i s          | gpg public key: finger     |
|   http://www.bgl.nu/~glouis |   glouis at consultronics.com |
| http://wecanstopspam.org in signatures fights junk email |




More information about the Bogofilter mailing list