wordlist size [was: Crm114 style context matching...]

David Relson relson at osagesoftware.com
Sun May 18 19:16:51 CEST 2003


At 12:57 PM 5/18/03, you wrote:

>Greg Louis's results suggest that database size will increase A LOT
>with 4 word phrases - maybe 20 times bigger
>(But I guess 2 and 3 word phrases would not be so bad)
>
>Clearly Greg needs to find out if storing phrases offers any significant
>improvement. And (i guess) Greg is going the compare performance with
>different phrase lengths.
>
>Maybe a compromise using word pairs (N=2) will give
>improvement without too much size increase
>but if the databases get too big, here is a possible thought for
>restricting database size when storing phrases
>
>1) check that all words are frequent in spam messages
>    (e.g. pbad > 0.05)
>2) check all words indicate spam
>    (eg. pbad / pgood > 1.5)
>
>A similar rule applies when adding phrases to the goodlist.
>
>The rationale for this that we don't want the databases to be full of "rare
>phrases" e.g. like the junk letter sequences you find to make spams unique.
>and in any case "rare phrases" are not going to be found in many messages
>so they won't be much use for spam detection.
>
>Also we don't want phrases that are equally likely to exist in both lists
>as they will not give much discrimination.

Greg's first step is to determine how useful phrases are.  A second step 
could be the correspondence between phrase length, usefulness, and wordlist 
size.  Only after usefulness is established does it become valuable to 
worry about managing wordlist size.

FWIW, Greg & I have code for a single wordlist that has both spam and ham 
counts for each token.  It's faster, depending on how the cache size is set 
for BerkeleyDB.  The default cache size gives poor results.  A cache approx 
25% of wordlist size (or greater than wordlist size) works well.  A cache 
of 2/3 of wordlist size gives really awful performance.  These numbers are 
approximate and we're still trying to understand the phenomena (cache size 
vs. performance).





More information about the Bogofilter mailing list