Ignore lists [was: Keeping the cruft out ...]

Thu Mar 4 22:46:25 CET 2004

On Wed, 3 Mar 2004, David Relson wrote:

> It was eventually realized that, since the ignore list was likely
> pretty small, most all words would require _two_ searches when an
> ignore list was used.

One could put everything into one db file but mark ignored entries:
perhaps using a special value of counters (e.g. spam count = ham count
= 0 or ~0UL) etc. The programs would recognize the mark and skip marked
tokens. A single db lookup per token.

There would be a small drawback: the list of ignored tokens would be
lost whenever a db was rebuild from the scratch from a corpus. Perhaps a
separate db of ignored tokens should be kept around but it would not be
used until a new token was to be added to the main db (in this case, the
program would check ignored.db to determine whether it should add a
regular "live" token or a marked "dead" token).

--Pavel Kankovsky aka Peak  [ Boycott Microsoft--http://www.vcnet.com/bms ]
"Resistance is futile. Open your source code and prepare for assimilation."