lexer change
Boris 'pi' Piwinger
3.14 at logic.univie.ac.at
Tue Nov 11 15:39:58 CET 2003
Tom Anderson wrote:
>> My test yesterday actually showed it does not help to allow
>> those tokens.
>
> Yes, but your tests are always going to be limited to current or recent
> emails.
Right.
> What about future emails?
We don't know.
> The main benefit of the Bayesian
> method is that it's not hindered by aging of rules like SpamAssassin
> is. We shouldn't be deciding based on a few more incorrect
> classifications here or there to institute a new rule.
Basically I agree. But somehow you have to determine what a
word is (and hence if a word can start with a $-sign). But
you are right, I cannot give any reason besides testing for
not allowing tokens of length one or numbers. You would
actually expect that those are useful.
> It should be a
> drastic difference, as in >10%, to even consider it. Who decided on the
> "[^[:blank:][:cntrl:][:digit:][:punct:]]" rule, and why?
I don't know who made it. Bug not allowing punctuation in
general seems reasonable for a word. "word" should give
'word' not '"word"'. For the $-sign we could add it, I don't
have a strong opinion here, it seems that that rule doesn't
change much anyway.
> I might agree
> with a rule if there were a fundamental underlying philosophical reason,
> but just tweaking the output is not a good enough reason.
I can follow you there. I'd be happy to add numbers and
short tokens as well as tokens starting with $ of any form.
This is pretty much it, I think.
Here is once more pretty much what a token is:
> TOKENFRONT [^[:blank:][:cntrl:][:digit:][:punct:]]
> TOKENMID [^[:blank:]<>;=():&%$#@+|/\\{}^\"?*,[:cntrl:][\]]*
> TOKENBACK [^[:blank:]<>;=():&%$#@+|/\\{}^\"?*,[:cntrl:][\]._+-]
I'll even have to look up which of theese characters listed
is not in [:punct:] which is AFAICS this list: ! " # $ % & '
( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~
(http://www.gnu.org/software/grep/doc/grep_8.html)
If I can trust my eyes (I usally cannot;-) those characters
are allowed to show up in the middle of a word, but not at
the beginning: !'-._`~ (which looks OK).
BTW: There is a + doubled in TOKENBACK.
At the end of a word we only allow !'` in addition to those
allowed at the front. I cannot say why ' or ` should be
there. I'd disallow those. And by your argument also remove
! -- even though it "works".
I don't know anything about ´.
pi
More information about the Bogofilter
mailing list