A case for Markovian

Tue May 11 16:57:30 CEST 2004

I just received a "nigerian" spam that was scored ~0.16, which is in my
"unsure" zone, but almost a false negative.  While most of the tokens in
this "long story" spam are hammy on their own, consider phrases such as "the
former president of Kenya" or "over this confidential matter" or
"transferring funds to foreign accounts".  It's phrases such as these which
immediately alert my own internal filter that this is spam.  Using a
single-token filter, I can possibly get this into my spam zone registering
it enough such that tokens like "Kenya" or "Gideon" or the IP/ASN are
extremely spammy, but even then the glut of hammy words will weigh heavily.
I don't believe that using ESFs or any other single-token method addresses
this problem.

So what's the downside of using multiple-word phrases?  Obviously the
database size.  But trimming the database regularly to remove hapaxes and
perhaps very neutral common phrases should be enough to keep it within
reason.  And if the phrase "window" size is configurable, then individuals
can decide their own accuracy/size trade-off.

Tom