Ignore lists [was: Keeping the cruft out ...]
peak at argo.troja.mff.cuni.cz
Thu Mar 4 16:46:25 EST 2004
On Wed, 3 Mar 2004, David Relson wrote:
> It was eventually realized that, since the ignore list was likely
> pretty small, most all words would require _two_ searches when an
> ignore list was used.
One could put everything into one db file but mark ignored entries:
perhaps using a special value of counters (e.g. spam count = ham count
= 0 or ~0UL) etc. The programs would recognize the mark and skip marked
tokens. A single db lookup per token.
There would be a small drawback: the list of ignored tokens would be
lost whenever a db was rebuild from the scratch from a corpus. Perhaps a
separate db of ignored tokens should be kept around but it would not be
used until a new token was to be added to the main db (in this case, the
program would check ignored.db to determine whether it should add a
regular "live" token or a marked "dead" token).
--Pavel Kankovsky aka Peak [ Boycott Microsoft--http://www.vcnet.com/bms ]
"Resistance is futile. Open your source code and prepare for assimilation."
More information about the Bogofilter