PATCH: MAXTOKENLEN+delta [was: counters...]

David Relson relson at osagesoftware.com
Fri Jul 9 01:57:08 CEST 2004


On Fri, 9 Jul 2004 00:23:53 +0200
Andreas Pardeike wrote:

> On 2004-07-08, at 15.17, David Relson wrote:
> 
> >> Could you modify anthing that exceeds the MAXTOKENLEN to become the
> >> token, "MAXTOKENLEN" which a counter (+1) against it?
> >>
> >> This would tend to pool all these excessively long tokens into one
> >> "virtual" token to measure for spamicity.
> >>
> >> You might only get one token per email, but it helps.
> >
> > Long tokens could simply be truncated to MAXTOKENLEN.
> >
> > At one time, bogofilter had some feature counting code.  The lexer 
> > would
> > count various features (like no_body, html_break, html_comment,
> > html_tag, html_unk, ipaddr, html_char, url_char, money, ...) and
> > create tokens giving counts.  Perhaps I'll resurrect the code to see
> > if it's of
> > value.
> 
> Or every token exceeding MAXTOKENLEN could be transformed into a new
> token called 'MAXTOKENLEN+12' (i.e. if it was actually maxtokenlen +
> 12 letters). That would include the length of tokens in the database
> and thus would minimize the bad effect on ham with similar tokens.
> 
> Andreas Pardeike

Andreas,

Creating such a token is well nigh trivial.   If you're so inclined,
below is such a patch.  Let us know if it helps or hurts.

Regards,

David

Index: token.c
===================================================================
RCS file: /cvsroot/bogofilter/bogofilter/src/token.c,v
retrieving revision 1.89
diff -u -r1.89 token.c
--- token.c	8 Jul 2004 12:12:48 -0000	1.89
+++ token.c	8 Jul 2004 23:54:30 -0000
@@ -267,6 +267,9 @@
 	    fputc('\n', dbgout);
 	}
 
+	if (yylval->leng > MAXTOKENLEN)
+	    yylval->leng = sprintf(yylval->text, "MAXTOKENLEN+%d",
yylval->leng - MAXTOKENLEN);
+
 	/* eat all long words */
 	if (yylval->leng <= MAXTOKENLEN)
 	    done = true;




More information about the Bogofilter mailing list