StudlyCaps

Tom Allison tallison at tacocat.net
Fri Jul 9 02:57:55 CEST 2004


Tom Anderson wrote:
> On Thu, 2004-07-08 at 09:20, David Relson wrote:
> 
>>On Thu, 8 Jul 2004 09:14:43 -0400
>>Tom Anderson wrote:
>>
>>
>>>From: "Tom Allison" <tallison at tacocat.net>
>>>
>>>>Could you modify anthing that exceeds the MAXTOKENLEN to become the
>>>>token, "MAXTOKENLEN" which a counter (+1) against it?
>>>>
>>>>This would tend to pool all these excessively long tokens into one
>>>>"virtual" token to measure for spamicity.
>>>
>>>Good idea, but it would also count email addresses and URLs and
>>>perhaps signatures and whatnot.  I'm not sure I'd appreciate an email
>>>full of URLs from a friend being counted as spam just because they all
>>>exceed the max length.
>>>
>>>Tom
>>
>>It would just be a single token among many.  It would have little effect
>>on a hammish message but might be valuable for an unsure.
> 
> 
> My interpretation was that every single token which went over the max
> would simply be converted to "MAXTOKENLEN" for scoring.  Therefore, if I
> had an email that said something like, "Here are the articles: URL1 ...
> URLN", where URL1 through URLN are URLs greater than MAXTOKENLEN.  It
> would be better to not convert those all to a single presumably spammy
> token.  I prefer the idea of breaking on case transitions to that.
> 
> Then again, maybe this "problem" doesn't need a solution at all... let's
> see how it plays out for awhile.
> 
> Tom
> 
> 

Interpretation is correct.

However, I have some assumptions.
tokens are seperated on non-word elements (.,\/|:) which would mean that 
your URL's would be broken up more sanely than you indicate.  Proof is 
in the tokens for domains.  They are broken up by the period of the 
domain name.

The other assumption is that the MAXTOKENLEN is probably on the order of 
128 characters.  This would take care of a lot of words and would 
maintain a reasonable break between what's really a word and what isn't.

But these may be solutions for something that has yet to be a problem?




More information about the Bogofilter mailing list