advice on ignore-files
David Relson
relson at osagesoftware.com
Sat Oct 5 05:48:11 CEST 2002
<x-flowed>
Eric,
Your message touches on lots of interesting subjects...
Performance:
Current code collects words, i.e. gets a set of (unduplicated) words from
the input message. Then the "giant" word lists (spam and non-spam) are
searched for each word. Then spamicity is calculated. Number of searches:
2 big ones for each word.
With ignore lists, the words would be collected as before. Then the
(presumably) small ignore list would be searched. If word in small list,
done. If not in small list, search the two giant lists. Then calculate
spamicity (as before). Number of searches: 1 small one for each word and 2
big ones for each word not in the ignore list. Search time will be less
for words found in the ignore list and will be greater for words not in the
ignore list. How much time is saved depends on number of "ignore" words
encountered, which is partially a function of size of the ignore list.
To use some random numbers, posit a 1000 word message and a 100 word ignore
list. At best 100 of the 1000 words will be in the ignore list. Assuming
this case, there will be 1000 searches of the ignore list and 1800 searches
of the giant lists. Without ignore lists, there would have been 2000
searches of the giant lists.
At the moment my basic spamlist has 54,788 words from 3,588 messages and my
goodlist has 221,783 words from 26,469 messages. The two lists differ
significantly in size, which affects something - though I can't say what or
how.
There are lots of possible questions that can be asked about list sizes and
the impact on performance.
Have you measured the performance difference when using ignore lists?
Perhaps the best thing to do is add the code, measure some reasonable
loads, and see how the performance is affected.
Red Herrings:
I noticed the same thing with months as have you. I also noticed it with
userids as I had lots of spam, but little saved good mail, for some of my
users. lexer.l is already discarding lots of html tags and keywords. Too
bad it can't be easily extended to include ignore lists.
List maintenance:
You've obviously thought about the issues a great deal. You've done a good
job of presenting pros and cons.
I'm inclined to pick choice 2 - user maintained plain text file with
convert-to-db capability. A plain text file is very easy to
maintain. Given manual maintenance, the convert-to-db capability will be
used infrequently. Given Gyepi's lock handling routines, it should be easy
to avoid race conditions when using the convert utility.
I don't know if this message is helpful or not. It's an attempt to explore
thoughts about ignore lists and to determine what remains to be learned. I
hope this message _has_ been helpful.
David
</x-flowed>
More information about the bogofilter-dev
mailing list