Crm114 style context matching. Phrases and partial phrases.
Peter Bishop
pgb at adelard.com
Sun May 18 18:57:30 CEST 2003
Greg Louis's results suggest that database size will increase A LOT
with 4 word phrases - maybe 20 times bigger
(But I guess 2 and 3 word phrases would not be so bad)
Clearly Greg needs to find out if storing phrases offers any significant
improvement. And (i guess) Greg is going the compare performance with
different phrase lengths.
Maybe a compromise using word pairs (N=2) will give
improvement without too much size increase
but if the databases get too big, here is a possible thought for
restricting database size when storing phrases
1) check that all words are frequent in spam messages
(e.g. pbad > 0.05)
2) check all words indicate spam
(eg. pbad / pgood > 1.5)
A similar rule applies when adding phrases to the goodlist.
The rationale for this that we don't want the databases to be full of "rare
phrases" e.g. like the junk letter sequences you find to make spams unique.
and in any case "rare phrases" are not going to be found in many messages
so they won't be much use for spam detection.
Also we don't want phrases that are equally likely to exist in both lists
as they will not give much discrimination.
--
Peter Bishop
pgb at adelard.com
pgb at csr.city.ac.uk
More information about the Bogofilter
mailing list