advice on ignore-files

Sat Oct 5 05:48:11 CEST 2002

<x-flowed>
Eric,

Your message touches on lots of interesting subjects...

Performance:

Current code collects words, i.e. gets a set of (unduplicated) words from 
the input message.  Then the "giant" word lists (spam and non-spam) are 
searched for each word. Then spamicity is calculated.  Number of searches: 
2 big ones for each word.

With ignore lists, the words would be collected as before.  Then the 
(presumably) small ignore list would be searched.  If word in small list, 
done.  If not in small list, search the two giant lists.  Then calculate 
spamicity (as before).  Number of searches: 1 small one for each word and 2 
big ones for each word not in the ignore list.  Search time will be less 
for words found in the ignore list and will be greater for words not in the 
ignore list.  How much time is saved depends on number of "ignore" words 
encountered, which is partially a function of size of the ignore list.

To use some random numbers, posit a 1000 word message and a 100 word ignore 
list.  At best 100 of the 1000 words will be in the ignore list.  Assuming 
this case, there will be 1000 searches of the ignore list and 1800 searches 
of the giant lists.  Without ignore lists, there would have been 2000 
searches of the giant lists.

At the moment my basic spamlist has 54,788 words from 3,588 messages and my 
goodlist has 221,783 words from 26,469 messages.  The two lists differ 
significantly in size, which affects something - though I can't say what or 
how.

There are lots of possible questions that can be asked about list sizes and 
the impact on performance.

Have you measured the performance difference when using ignore lists?

Perhaps the best thing to do is add the code, measure some reasonable 
loads, and see how the performance is affected.

Red Herrings:

I noticed the same thing with months as have you.  I also noticed it with 
userids as I had lots of spam, but little saved good mail, for some of my 
users.  lexer.l is already discarding lots of html tags and keywords.  Too 
bad it can't be easily extended to include ignore lists.

List maintenance:

You've obviously thought about the issues a great deal.  You've done a good 
job of presenting pros and cons.

I'm inclined to pick choice 2 - user maintained plain text file with 
convert-to-db capability.  A plain text file is very easy to 
maintain.  Given manual maintenance, the convert-to-db capability will be 
used infrequently.  Given Gyepi's lock handling routines, it should be easy 
to avoid race conditions when using the convert utility.

I don't know if this message is helpful or not.  It's an attempt to explore 
thoughts about ignore lists and to determine what remains to be learned.  I 
hope this message _has_ been helpful.

David

</x-flowed>