MAXTOKENLEN [was: StudlyCaps]
Jason A. Smith
jazbo at jazbo.dyndns.org
Fri Jul 9 10:53:40 EDT 2004
How about combining these two by saving long tokens as a truncated
string + their delta size. This should make them relatively unique
without bloating the database too much.
I don't like the truncate only approach since it makes tokens that are
exactly MAXTOKENLEN indistinguishable from longer tokens which both
begin the same, but this might not happen too much. The
MAXTOKENLEN+delta approach loses all information about what the
beginning of the token looked like because it only saves the token size.
On Fri, 2004-07-09 at 08:29, David Relson wrote:
> Here are my thoughts on MAXTOKENLEN.
> Breaking MixedCaseStuFF into separate tokens is above and beyond
> bogofilter's charter and is a bad idea.
> Two reasonable approaches are:
> truncate long tokens to MAXTOKENLEN
> convert long tokens to MAXTOKENLEN+delta
> Alternatively, bogofilter can ignore long tokens, as it does now.
> Bogofilter mailing list
> Bogofilter at bogofilter.org
More information about the Bogofilter