Radical lexers

David Relson relson at osagesoftware.com
Wed Dec 10 17:18:48 CET 2003


On Wed, 10 Dec 2003 17:07:42 +0100
Boris 'pi' Piwinger <3.14 at logic.univie.ac.at> wrote:

...[snip]...

> > At a rough count, there are less than 200 single character tokens
> > (256 characters, less 32 control characters, 25 or so special
> > symbols).  Have you ever looked at their spam/ham counts?  Are any
> > of them significantly hammish or spammish?  Running 'bogoutil -d
> > wordlist.db | grep "^? "' would list them all.
> 
> Yes, but not for this lexer. I turned out that there are
> significant tokens of length one, but very few, more for two
> bytes. But this lexer will certainly produce different results.
> 
> What it makes hard to see their value is that you cannot
> simply see it with bogoutil.

The following should show all the single character tokens (with approx
scores):

bogoutil -d wordlist.db | grep "^? " | awk '{print $1}' | bogoutil -p
wordlist.db




More information about the Bogofilter mailing list