A case for Markovian
Jason Crowe
jcrowe at midwestglove.com
Tue May 11 17:35:47 CEST 2004
Isn't this basically what CRM114 does?
> -----Original Message-----
> From:
> bogofilter-bounces+jcrowe=midwestglove.com at bogofilter.org
> [mailto:bogofilter-bounces+jcrowe=midwestglove.com at bogofilter.
> org] On Behalf Of Tom Anderson
> Sent: Tuesday, May 11, 2004 9:58 AM
> To: bogofilter at bogofilter.org
> Subject: A case for Markovian
>
>
> I just received a "nigerian" spam that was scored ~0.16,
> which is in my
> "unsure" zone, but almost a false negative. While most of
> the tokens in
> this "long story" spam are hammy on their own, consider
> phrases such as "the
> former president of Kenya" or "over this confidential matter" or
> "transferring funds to foreign accounts". It's phrases such
> as these which
> immediately alert my own internal filter that this is spam. Using a
> single-token filter, I can possibly get this into my spam
> zone registering
> it enough such that tokens like "Kenya" or "Gideon" or the IP/ASN are
> extremely spammy, but even then the glut of hammy words will
> weigh heavily.
> I don't believe that using ESFs or any other single-token
> method addresses
> this problem.
>
> So what's the downside of using multiple-word phrases? Obviously the
> database size. But trimming the database regularly to remove
> hapaxes and
> perhaps very neutral common phrases should be enough to keep it within
> reason. And if the phrase "window" size is configurable,
> then individuals
> can decide their own accuracy/size trade-off.
>
> Tom
>
> _______________________________________________
> Bogofilter mailing list
> Bogofilter at bogofilter.org
> http://www.bogofilter.org/mailman/listinfo/bogofilter
>
>
More information about the Bogofilter
mailing list