Crm114 style context matching. Phrases and partial phrases.

Peter Bishop pgb at adelard.com
Sun May 18 18:57:30 CEST 2003


Greg Louis's results suggest that database size will increase A LOT
with 4 word phrases - maybe 20 times bigger 
(But I guess 2 and 3 word phrases would not be so bad)

Clearly Greg needs to find out if storing phrases offers any significant 
improvement. And (i guess) Greg is going the compare performance with 
different phrase lengths.

Maybe a compromise using word pairs (N=2) will give
improvement without too much size increase
but if the databases get too big, here is a possible thought for 
restricting database size when storing phrases

1) check that all words are frequent in spam messages
   (e.g. pbad > 0.05)
2) check all words indicate spam
   (eg. pbad / pgood > 1.5)

A similar rule applies when adding phrases to the goodlist.

The rationale for this that we don't want the databases to be full of "rare 
phrases" e.g. like the junk letter sequences you find to make spams unique.
and in any case "rare phrases" are not going to be found in many messages 
so they won't be much use for spam detection.

Also we don't want phrases that are equally likely to exist in both lists 
as they will not give much discrimination.



-- 
Peter Bishop 
pgb at adelard.com
pgb at csr.city.ac.uk






More information about the Bogofilter mailing list