counters [was: StudlyCaps]

Andreas Pardeike andreas at pardeike.net
Fri Jul 9 00:23:53 CEST 2004


On 2004-07-08, at 15.17, David Relson wrote:

>> Could you modify anthing that exceeds the MAXTOKENLEN to become the
>> token, "MAXTOKENLEN" which a counter (+1) against it?
>>
>> This would tend to pool all these excessively long tokens into one
>> "virtual" token to measure for spamicity.
>>
>> You might only get one token per email, but it helps.
>
> Long tokens could simply be truncated to MAXTOKENLEN.
>
> At one time, bogofilter had some feature counting code.  The lexer 
> would
> count various features (like no_body, html_break, html_comment,
> html_tag, html_unk, ipaddr, html_char, url_char, money, ...) and create
> tokens giving counts.  Perhaps I'll resurrect the code to see if it's 
> of
> value.

Or every token exceeding MAXTOKENLEN could be transformed into a new
token called 'MAXTOKENLEN+12' (i.e. if it was actually maxtokenlen + 12
letters). That would include the length of tokens in the database and
thus would minimize the bad effect on ham with similar tokens.

Andreas Pardeike




More information about the Bogofilter mailing list