A case for Markovian
tallison at tacocat.net
Tue May 11 19:03:20 EDT 2004
Tom Anderson wrote:
> I just received a "nigerian" spam that was scored ~0.16, which is in my
> "unsure" zone, but almost a false negative. While most of the tokens in
> this "long story" spam are hammy on their own, consider phrases such as "the
> former president of Kenya" or "over this confidential matter" or
> "transferring funds to foreign accounts". It's phrases such as these which
> immediately alert my own internal filter that this is spam. Using a
> single-token filter, I can possibly get this into my spam zone registering
> it enough such that tokens like "Kenya" or "Gideon" or the IP/ASN are
> extremely spammy, but even then the glut of hammy words will weigh heavily.
> I don't believe that using ESFs or any other single-token method addresses
> this problem.
> So what's the downside of using multiple-word phrases? Obviously the
> database size. But trimming the database regularly to remove hapaxes and
> perhaps very neutral common phrases should be enough to keep it within
> reason. And if the phrase "window" size is configurable, then individuals
> can decide their own accuracy/size trade-off.
Consider that each word has variations that can still be understood:
vi at gr@
you get the picture. But if I keep going I'll be blocked by everyones
Now multiply that by the variations_that_become_ava.ila.ble to_some 1 else
and you don't have a problem with just the 30,000 words squared, but the
spelling varations of each word. If you just guess at 5 variations plus
the original you hit 32,400,000,000 instead of the 900,000,000 that we
originally remarked upon. That's a big database for only two words.
More information about the Bogofilter