question about new spam encoding

David Relson relson at osagesoftware.com
Thu Nov 20 15:39:06 CET 2003


On 20 Nov 2003 09:21:40 -0500
Tom Anderson <tanderso at oac-design.com> wrote:

> On Wed, 2003-11-19 at 19:13, David Relson wrote:
> > Tokens are limited to 30 chars, so long URLs are excluded :-(
> 
> That sounds dangerous... maybe we should make an exception for URLs
> only?  It seems to me that URLs are one of the most important tokens
> we can use.  Minimum we should do is at least break it up and record
> the domain but leave off query string junk and maybe the subdomain. 
> BTW, www.quick-home-loan-search.biz is only 30 characters, and
> quick-home-loan-search.biz is only 26, so these would fit current
> limits if broken up.

Don't forget to add 5 for the "head:" prefix.

> Chances are, spammers are going to use the same domain for awhile
> since it's an investment, so that's the ideal spam indicator.  It's at
> least as important as any other two tokens, so let's give it two
> tokens' character limits and make it 60.
> 
> Otherwise, you'll be getting URLs like:
> http://haha.imaspammer.you-loser-cant-bogofilter-my-emails.com

While numeric urls, e.g. 1.2.3.4, get special treatment, text urls do
not.  Increasing MAXTOKENLEN will allow
VeryLongJunkTokensToBeAddedToTheWordList.




More information about the Bogofilter mailing list