wordlist oddness?
David Relson
relson at osagesoftware.com
Thu Nov 7 06:23:32 CET 2002
At 11:59 PM 11/6/02, Allyn Fratkin wrote:
>0.8.0.rc2 actually found 80 *more* words in my good corpi than 0.7.6
>and the exact same number of words in my spam corpi.
>i don't really understand why, the data shouldn't have changed at all.
>
>anyway, it doesn't seem like this is a new problem that should stop 0.8.0
>but one that should be looked at for a future release.
Allyn,
There have been some changes to lexer.l. As you may be aware, the current
lexer throws away lines that appear to be base64 encoding. That capability
has been modified so that words (up to 20 characters) that appear alone on
a line are accepted as tokens (rather than falsely discarded as being
base64). That might account for the additional words.
David
More information about the Bogofilter
mailing list