Much simplified lexer (was: lexer change)

Boris 'pi' Piwinger 3.14 at logic.univie.ac.at
Wed Nov 12 13:36:09 CET 2003


Boris 'pi' Piwinger wrote:

>> The main benefit of the Bayesian
>> method is that it's not hindered by aging of rules like SpamAssassin
>> is.  We shouldn't be deciding based on a few more incorrect
>> classifications here or there to institute a new rule. 
> 
> Basically I agree. But somehow you have to determine what a
> word is (and hence if a word can start with a $-sign). But
> you are right, I cannot give any reason besides testing for
> not allowing tokens of length one or numbers. You would
> actually expect that those are useful.

Insprired by our discussion, Tom, I changed the lexer to be
more in the fashion you describe. If you want to see if it
works for you, it is attached.

>> I might agree
>> with a rule if there were a fundamental underlying philosophical reason,
>> but just tweaking the output is not a good enough reason.
> 
> I can follow you there. I'd be happy to add numbers and
> short tokens as well as tokens starting with $ of any form.

I will allow $ at any place in the word ($cientology,
Mircro$oft etc.). I will allow numbers at any place in the
word (this includes only numbers). I will allow tokens of
length one and two.

> Here is once more pretty much what a token is:
>> TOKENFRONT	[^[:blank:][:cntrl:][:digit:][:punct:]]
>> TOKENMID	[^[:blank:]<>;=():&%$#@+|/\\{}^\"?*,[:cntrl:][\]]*
>> TOKENBACK	[^[:blank:]<>;=():&%$#@+|/\\{}^\"?*,[:cntrl:][\]._+-]

> If I can trust my eyes (I usally cannot;-) those characters
> are allowed to show up in the middle of a word, but not at
> the beginning: !'-._`~ (which looks OK).
> 
> At the end of a word we only allow !'` in addition to those
> allowed at the front. I cannot say why ' or ` should be
> there. I'd disallow those.

I do that. And I missed the ~, which I'll also disallow.

> And by your argument also remove ! -- even though it "works".

And that. This makes TOKENFRONT and TOKENBACK the same.

> I don't know anything about ´.

It is not in ASCII, [:punct:] is a subset of ASCII, though.
We have problems with punctuation, blanks etc. it other
charsets anyway (we cannot always recognize them).


First test to see how it performs. Recall:

Test with the last release: 2.7M
                       spam   good
.MSG_COUNT              592    307
wo (fn):  0.500000    26     23     19     68
wo (fp):  0.500000     5      4      4     13
wi (fn):  0.581092    50     41     41    132
wi (fp):  0.581092     3      2      1      6
wi (fn):  0.499993    26     23     19     68
wi (fp):  0.499993     6      4      5     15
wi (fn):  0.457261    15     15     14     44
wi (fp):  0.457261    14     10      8     32

Allowing two-byte-tokens: 2.8M
                       spam   good
.MSG_COUNT              630    284
wo (fn):  0.500000    24     22     22     68
wo (fp):  0.500000     4      4      3     11
wi (fn):  0.544564    40     30     31    101
wi (fp):  0.544564     3      1      2      6
wi (fn):  0.499999    24     22     21     67
wi (fp):  0.499999     5      4      4     13
wi (fn):  0.419627     8     12     15     35
wi (fp):  0.419627    12      8     11     31

With the attached new lexer: 2.9M
                       spam   good
.MSG_COUNT              554    308
wo (fn):  0.500000    18     20     21     59
wo (fp):  0.500000     6      4      6     16
wi (fn):  0.584458    43     30     30    103
wi (fp):  0.584458     3      1      2      6
wi (fn):  0.503945    21     21     24     66
wi (fp):  0.503945     6      4      4     14
wi (fn):  0.471097    13     15     14     42
wi (fp):  0.471097    13      7     11     31

Again we can read different things from the results. On one
hand the number of false positives increases, which is bad.
On the other hand if you look at different false positives
targets (roughly: .05%, .1%,  .25%) it performs better than
the last release, especially for very few false positives.
Compared to only adding two-byte-tokens to the last release
it performs a bit worse, but not that much (except for the
initial false positive rate).

So let's wrap things up. This (experimental!) lexer removes
several special rules introduced over time (with good
reason, but at some point a review might be worth it), it is
getting simpler to read with fewer definitions and rules.
There might well be more to be done.

As Tom argued, we are excluding several tokens with no good
reason, which now get into the list without any external
judgement if this is helpful or not.

This version performs reasonably well, so one might question
several exceptions we had so far.

pi
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: lexer_v3.l.new
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20031112/1457d080/attachment.ksh>


More information about the Bogofilter mailing list