Ignore lists [was: Keeping the cruft out ...]

Thu Mar 4 22:58:21 CET 2004

On Thu, 4 Mar 2004 22:46:25 +0100 (MET)
Pavel Kankovsky wrote:

> On Wed, 3 Mar 2004, David Relson wrote:
> 
> > It was eventually realized that, since the ignore list was likely
> > pretty small, most all words would require _two_ searches when an
> > ignore list was used.
> 
> One could put everything into one db file but mark ignored entries:
> perhaps using a special value of counters (e.g. spam count = ham count
> = 0 or ~0UL) etc. The programs would recognize the mark and skip
> marked tokens. A single db lookup per token.

Hi Pavel,

It _could_ be done that way, but it seems messy.  The old code required
the existance of an ignore.db file (that could be built from a simple
text file using bogoutil).  An alternate approach would be to simply
read the simple text file.

> There would be a small drawback: the list of ignored tokens would be
> lost whenever a db was rebuild from the scratch from a corpus. Perhaps
> a separate db of ignored tokens should be kept around but it would not
> be used until a new token was to be added to the main db (in this
> case, the program would check ignored.db to determine whether it
> should add a regular "live" token or a marked "dead" token).

Having "ignored" tokens in the regular database calls for adding an
"ignore" flag or using an unlikely value (such as 0xFFFFFFFF).  I expect
special checks would quickly spread throughout bogofilter, which would
be bad.

David