Ignore lists [was: Keeping the cruft out ...]
David Relson
relson at osagesoftware.com
Thu Mar 4 22:58:21 CET 2004
On Thu, 4 Mar 2004 22:46:25 +0100 (MET)
Pavel Kankovsky wrote:
> On Wed, 3 Mar 2004, David Relson wrote:
>
> > It was eventually realized that, since the ignore list was likely
> > pretty small, most all words would require _two_ searches when an
> > ignore list was used.
>
> One could put everything into one db file but mark ignored entries:
> perhaps using a special value of counters (e.g. spam count = ham count
> = 0 or ~0UL) etc. The programs would recognize the mark and skip
> marked tokens. A single db lookup per token.
Hi Pavel,
It _could_ be done that way, but it seems messy. The old code required
the existance of an ignore.db file (that could be built from a simple
text file using bogoutil). An alternate approach would be to simply
read the simple text file.
> There would be a small drawback: the list of ignored tokens would be
> lost whenever a db was rebuild from the scratch from a corpus. Perhaps
> a separate db of ignored tokens should be kept around but it would not
> be used until a new token was to be added to the main db (in this
> case, the program would check ignored.db to determine whether it
> should add a regular "live" token or a marked "dead" token).
Having "ignored" tokens in the regular database calls for adding an
"ignore" flag or using an unlikely value (such as 0xFFFFFFFF). I expect
special checks would quickly spread throughout bogofilter, which would
be bad.
David
More information about the Bogofilter
mailing list