Much simplified lexer (was: lexer change)
Boris 'pi' Piwinger
3.14 at logic.univie.ac.at
Wed Nov 12 13:36:09 CET 2003
Boris 'pi' Piwinger wrote:
>> The main benefit of the Bayesian
>> method is that it's not hindered by aging of rules like SpamAssassin
>> is. We shouldn't be deciding based on a few more incorrect
>> classifications here or there to institute a new rule.
>
> Basically I agree. But somehow you have to determine what a
> word is (and hence if a word can start with a $-sign). But
> you are right, I cannot give any reason besides testing for
> not allowing tokens of length one or numbers. You would
> actually expect that those are useful.
Insprired by our discussion, Tom, I changed the lexer to be
more in the fashion you describe. If you want to see if it
works for you, it is attached.
>> I might agree
>> with a rule if there were a fundamental underlying philosophical reason,
>> but just tweaking the output is not a good enough reason.
>
> I can follow you there. I'd be happy to add numbers and
> short tokens as well as tokens starting with $ of any form.
I will allow $ at any place in the word ($cientology,
Mircro$oft etc.). I will allow numbers at any place in the
word (this includes only numbers). I will allow tokens of
length one and two.
> Here is once more pretty much what a token is:
>> TOKENFRONT [^[:blank:][:cntrl:][:digit:][:punct:]]
>> TOKENMID [^[:blank:]<>;=():&%$#@+|/\\{}^\"?*,[:cntrl:][\]]*
>> TOKENBACK [^[:blank:]<>;=():&%$#@+|/\\{}^\"?*,[:cntrl:][\]._+-]
> If I can trust my eyes (I usally cannot;-) those characters
> are allowed to show up in the middle of a word, but not at
> the beginning: !'-._`~ (which looks OK).
>
> At the end of a word we only allow !'` in addition to those
> allowed at the front. I cannot say why ' or ` should be
> there. I'd disallow those.
I do that. And I missed the ~, which I'll also disallow.
> And by your argument also remove ! -- even though it "works".
And that. This makes TOKENFRONT and TOKENBACK the same.
> I don't know anything about ´.
It is not in ASCII, [:punct:] is a subset of ASCII, though.
We have problems with punctuation, blanks etc. it other
charsets anyway (we cannot always recognize them).
First test to see how it performs. Recall:
Test with the last release: 2.7M
spam good
.MSG_COUNT 592 307
wo (fn): 0.500000 26 23 19 68
wo (fp): 0.500000 5 4 4 13
wi (fn): 0.581092 50 41 41 132
wi (fp): 0.581092 3 2 1 6
wi (fn): 0.499993 26 23 19 68
wi (fp): 0.499993 6 4 5 15
wi (fn): 0.457261 15 15 14 44
wi (fp): 0.457261 14 10 8 32
Allowing two-byte-tokens: 2.8M
spam good
.MSG_COUNT 630 284
wo (fn): 0.500000 24 22 22 68
wo (fp): 0.500000 4 4 3 11
wi (fn): 0.544564 40 30 31 101
wi (fp): 0.544564 3 1 2 6
wi (fn): 0.499999 24 22 21 67
wi (fp): 0.499999 5 4 4 13
wi (fn): 0.419627 8 12 15 35
wi (fp): 0.419627 12 8 11 31
With the attached new lexer: 2.9M
spam good
.MSG_COUNT 554 308
wo (fn): 0.500000 18 20 21 59
wo (fp): 0.500000 6 4 6 16
wi (fn): 0.584458 43 30 30 103
wi (fp): 0.584458 3 1 2 6
wi (fn): 0.503945 21 21 24 66
wi (fp): 0.503945 6 4 4 14
wi (fn): 0.471097 13 15 14 42
wi (fp): 0.471097 13 7 11 31
Again we can read different things from the results. On one
hand the number of false positives increases, which is bad.
On the other hand if you look at different false positives
targets (roughly: .05%, .1%, .25%) it performs better than
the last release, especially for very few false positives.
Compared to only adding two-byte-tokens to the last release
it performs a bit worse, but not that much (except for the
initial false positive rate).
So let's wrap things up. This (experimental!) lexer removes
several special rules introduced over time (with good
reason, but at some point a review might be worth it), it is
getting simpler to read with fewer definitions and rules.
There might well be more to be done.
As Tom argued, we are excluding several tokens with no good
reason, which now get into the list without any external
judgement if this is helpful or not.
This version performs reasonably well, so one might question
several exceptions we had so far.
pi
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: lexer_v3.l.new
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20031112/1457d080/attachment.ksh>
More information about the Bogofilter
mailing list