wordlist oddness?

David Relson relson at osagesoftware.com
Thu Nov 7 06:23:32 CET 2002


At 11:59 PM 11/6/02, Allyn Fratkin wrote:
>0.8.0.rc2 actually found 80 *more* words in my good corpi than 0.7.6
>and the exact same number of words in my spam corpi.
>i don't really understand why, the data shouldn't have changed at all.
>
>anyway, it doesn't seem like this is a new problem that should stop 0.8.0
>but one that should be looked at for a future release.

Allyn,

There have been some changes to lexer.l.  As you may be aware, the current 
lexer throws away lines that appear to be base64 encoding.  That capability 
has been modified so that words (up to 20 characters) that appear alone on 
a line are accepted as tokens (rather than falsely discarded as being 
base64).  That might account for the additional words.

David






More information about the Bogofilter mailing list