A case for Markovian

Tue May 11 17:35:47 CEST 2004

Isn't this basically what CRM114 does?

> -----Original Message-----
> From: 
> bogofilter-bounces+jcrowe=midwestglove.com at bogofilter.org 
> [mailto:bogofilter-bounces+jcrowe=midwestglove.com at bogofilter.
> org] On Behalf Of Tom Anderson
> Sent: Tuesday, May 11, 2004 9:58 AM
> To: bogofilter at bogofilter.org
> Subject: A case for Markovian
> 
> 
> I just received a "nigerian" spam that was scored ~0.16, 
> which is in my
> "unsure" zone, but almost a false negative.  While most of 
> the tokens in
> this "long story" spam are hammy on their own, consider 
> phrases such as "the
> former president of Kenya" or "over this confidential matter" or
> "transferring funds to foreign accounts".  It's phrases such 
> as these which
> immediately alert my own internal filter that this is spam.  Using a
> single-token filter, I can possibly get this into my spam 
> zone registering
> it enough such that tokens like "Kenya" or "Gideon" or the IP/ASN are
> extremely spammy, but even then the glut of hammy words will 
> weigh heavily.
> I don't believe that using ESFs or any other single-token 
> method addresses
> this problem.
> 
> So what's the downside of using multiple-word phrases?  Obviously the
> database size.  But trimming the database regularly to remove 
> hapaxes and
> perhaps very neutral common phrases should be enough to keep it within
> reason.  And if the phrase "window" size is configurable, 
> then individuals
> can decide their own accuracy/size trade-off.
> 
> Tom
> 
> _______________________________________________
> Bogofilter mailing list
> Bogofilter at bogofilter.org
> http://www.bogofilter.org/mailman/listinfo/bogofilter
> 
>