question about new spam encoding
David Relson
relson at osagesoftware.com
Thu Nov 20 15:39:06 CET 2003
On 20 Nov 2003 09:21:40 -0500
Tom Anderson <tanderso at oac-design.com> wrote:
> On Wed, 2003-11-19 at 19:13, David Relson wrote:
> > Tokens are limited to 30 chars, so long URLs are excluded :-(
>
> That sounds dangerous... maybe we should make an exception for URLs
> only? It seems to me that URLs are one of the most important tokens
> we can use. Minimum we should do is at least break it up and record
> the domain but leave off query string junk and maybe the subdomain.
> BTW, www.quick-home-loan-search.biz is only 30 characters, and
> quick-home-loan-search.biz is only 26, so these would fit current
> limits if broken up.
Don't forget to add 5 for the "head:" prefix.
> Chances are, spammers are going to use the same domain for awhile
> since it's an investment, so that's the ideal spam indicator. It's at
> least as important as any other two tokens, so let's give it two
> tokens' character limits and make it 60.
>
> Otherwise, you'll be getting URLs like:
> http://haha.imaspammer.you-loser-cant-bogofilter-my-emails.com
While numeric urls, e.g. 1.2.3.4, get special treatment, text urls do
not. Increasing MAXTOKENLEN will allow
VeryLongJunkTokensToBeAddedToTheWordList.
More information about the Bogofilter
mailing list