PATCH: MAXTOKENLEN+delta [was: counters...]
David Relson
relson at osagesoftware.com
Fri Jul 9 01:57:08 CEST 2004
On Fri, 9 Jul 2004 00:23:53 +0200
Andreas Pardeike wrote:
> On 2004-07-08, at 15.17, David Relson wrote:
>
> >> Could you modify anthing that exceeds the MAXTOKENLEN to become the
> >> token, "MAXTOKENLEN" which a counter (+1) against it?
> >>
> >> This would tend to pool all these excessively long tokens into one
> >> "virtual" token to measure for spamicity.
> >>
> >> You might only get one token per email, but it helps.
> >
> > Long tokens could simply be truncated to MAXTOKENLEN.
> >
> > At one time, bogofilter had some feature counting code. The lexer
> > would
> > count various features (like no_body, html_break, html_comment,
> > html_tag, html_unk, ipaddr, html_char, url_char, money, ...) and
> > create tokens giving counts. Perhaps I'll resurrect the code to see
> > if it's of
> > value.
>
> Or every token exceeding MAXTOKENLEN could be transformed into a new
> token called 'MAXTOKENLEN+12' (i.e. if it was actually maxtokenlen +
> 12 letters). That would include the length of tokens in the database
> and thus would minimize the bad effect on ham with similar tokens.
>
> Andreas Pardeike
Andreas,
Creating such a token is well nigh trivial. If you're so inclined,
below is such a patch. Let us know if it helps or hurts.
Regards,
David
Index: token.c
===================================================================
RCS file: /cvsroot/bogofilter/bogofilter/src/token.c,v
retrieving revision 1.89
diff -u -r1.89 token.c
--- token.c 8 Jul 2004 12:12:48 -0000 1.89
+++ token.c 8 Jul 2004 23:54:30 -0000
@@ -267,6 +267,9 @@
fputc('\n', dbgout);
}
+ if (yylval->leng > MAXTOKENLEN)
+ yylval->leng = sprintf(yylval->text, "MAXTOKENLEN+%d",
yylval->leng - MAXTOKENLEN);
+
/* eat all long words */
if (yylval->leng <= MAXTOKENLEN)
done = true;
More information about the Bogofilter
mailing list