Crm114 style context matching. Phrases and partial phrases.

Anthony Clarke anthonyc at compsoc.man.ac.uk
Sat May 17 10:44:22 CEST 2003


Hi,

I've hobbled together a preprocessing script which allows phrases and
partial phrases to be categorised like crm114.

It works like this. Suppose you have a sequence of words A B C D E F G H
I J....

For A B C D, output A, A_B, A_C, A_B_C, A_D, A_B_D, A_C_D, A_B_C_D.

Then continue for the next 4 words, B C D E and do the same. Then the
same for C D E F etc. etc.

By doing this, you are allowing for missing words, eg. if you have a
spam which has the phrase BUY VIAGRA NOW!, and you register it, you will
also get a hit for BUY CHEAP INK NOW!

crm114 is supposed to be extremeley good as a classifier using this method.

I don't think I have enough messages (1600 spam, 200 nonspam) to try out
the tuning scripts and get some firm results for this.

The main disadvantage is that wordlists expand considerably.

Anthony.







More information about the Bogofilter mailing list