Radical lexers
David Relson
relson at osagesoftware.com
Wed Dec 10 17:18:48 CET 2003
On Wed, 10 Dec 2003 17:07:42 +0100
Boris 'pi' Piwinger <3.14 at logic.univie.ac.at> wrote:
...[snip]...
> > At a rough count, there are less than 200 single character tokens
> > (256 characters, less 32 control characters, 25 or so special
> > symbols). Have you ever looked at their spam/ham counts? Are any
> > of them significantly hammish or spammish? Running 'bogoutil -d
> > wordlist.db | grep "^? "' would list them all.
>
> Yes, but not for this lexer. I turned out that there are
> significant tokens of length one, but very few, more for two
> bytes. But this lexer will certainly produce different results.
>
> What it makes hard to see their value is that you cannot
> simply see it with bogoutil.
The following should show all the single character tokens (with approx
scores):
bogoutil -d wordlist.db | grep "^? " | awk '{print $1}' | bogoutil -p
wordlist.db
More information about the Bogofilter
mailing list