A case for Markovian

Wed May 12 01:03:20 CEST 2004

Tom Anderson wrote:
> I just received a "nigerian" spam that was scored ~0.16, which is in my
> "unsure" zone, but almost a false negative.  While most of the tokens in
> this "long story" spam are hammy on their own, consider phrases such as "the
> former president of Kenya" or "over this confidential matter" or
> "transferring funds to foreign accounts".  It's phrases such as these which
> immediately alert my own internal filter that this is spam.  Using a
> single-token filter, I can possibly get this into my spam zone registering
> it enough such that tokens like "Kenya" or "Gideon" or the IP/ASN are
> extremely spammy, but even then the glut of hammy words will weigh heavily.
> I don't believe that using ESFs or any other single-token method addresses
> this problem.
> 
> So what's the downside of using multiple-word phrases?  Obviously the
> database size.  But trimming the database regularly to remove hapaxes and
> perhaps very neutral common phrases should be enough to keep it within
> reason.  And if the phrase "window" size is configurable, then individuals
> can decide their own accuracy/size trade-off.
> 

Consider that each word has variations that can still be understood:
Viagra
vi at gr@
v.i.a....
you get the picture.  But if I keep going I'll be blocked by everyones 
bogofilter ruleset!

Now multiply that by the variations_that_become_ava.ila.ble to_some 1 else
and you don't have a problem with just the 30,000 words squared, but the 
spelling varations of each word.  If you just guess at 5 variations plus 
the original you hit 32,400,000,000 instead of the 900,000,000 that we 
originally remarked upon.  That's a big database for only two words.