advice on ignore-files

Sat Oct 5 22:33:15 CEST 2002

Eric,

Your question essentially breaks down into two parts:
1) How do we make ignore-list maintenance easy?
2) How do we make ignore-list functionality high performance?

I think that a two-step maintenance process for ignore-lists is justified
for a significant performance gain. Don't just use a flat text file with a
linear search, and accept the performance penalty to make the administration
and code simple.

Based on my experience, binary searching a sorted flat text file usually
outperforms a BerkeleyDB lookup when the list of words is "small".
Obviously, once the list of data to be scanned becomes too large to fit in
an ordinary disk buffer, it starts getting faster to hash it or btree it
instead. Most measurements I have taken showed that database searches start
to look attractive when you have more than a thousand words in your list.
Ignore lists are almost always going to be "small".

I would suggest using a flat file, and simply require the administrator to
run the file through the Unix "sort" command after he/she modifies the file.
You can keep the code simple by using a DB_RECNO interface to the flat file,
which will automatically mmap it for you. The line number of the file
becomes the index of the array. Using a binary search for the word you want
in this array is a good approach for "small" data sets.

I think that a database format should be available as an option for long
lists, which should use the same database format that we use for wordlists.
In the event that an administrator has an ignore-list that's several
thousand words long, he can select the database format we use for wordlists
and get better performance.

Adrian