lexer change

Boris 'pi' Piwinger 3.14 at logic.univie.ac.at
Tue Nov 11 15:39:58 CET 2003


Tom Anderson wrote:

>> My test yesterday actually showed it does not help to allow
>> those tokens.
> 
> Yes, but your tests are always going to be limited to current or recent
> emails. 

Right.

> What about future emails? 

We don't know.

> The main benefit of the Bayesian
> method is that it's not hindered by aging of rules like SpamAssassin
> is.  We shouldn't be deciding based on a few more incorrect
> classifications here or there to institute a new rule. 

Basically I agree. But somehow you have to determine what a
word is (and hence if a word can start with a $-sign). But
you are right, I cannot give any reason besides testing for
not allowing tokens of length one or numbers. You would
actually expect that those are useful.

> It should be a
> drastic difference, as in >10%, to even consider it.  Who decided on the
> "[^[:blank:][:cntrl:][:digit:][:punct:]]" rule, and why? 

I don't know who made it. Bug not allowing punctuation in
general seems reasonable for a word. "word" should give
'word' not '"word"'. For the $-sign we could add it, I don't
have a strong opinion here, it seems that that rule doesn't
change much anyway.

> I might agree
> with a rule if there were a fundamental underlying philosophical reason,
> but just tweaking the output is not a good enough reason.

I can follow you there. I'd be happy to add numbers and
short tokens as well as tokens starting with $ of any form.
This is pretty much it, I think.

Here is once more pretty much what a token is:
> TOKENFRONT	[^[:blank:][:cntrl:][:digit:][:punct:]]
> TOKENMID	[^[:blank:]<>;=():&%$#@+|/\\{}^\"?*,[:cntrl:][\]]*
> TOKENBACK	[^[:blank:]<>;=():&%$#@+|/\\{}^\"?*,[:cntrl:][\]._+-]

I'll even have to look up which of theese characters listed
is not in [:punct:] which is AFAICS this list: ! " # $ % & '
( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~
(http://www.gnu.org/software/grep/doc/grep_8.html)

If I can trust my eyes (I usally cannot;-) those characters
are allowed to show up in the middle of a word, but not at
the beginning: !'-._`~ (which looks OK).

BTW: There is a + doubled in TOKENBACK.

At the end of a word we only allow !'` in addition to those
allowed at the front. I cannot say why ' or ` should be
there. I'd disallow those. And by your argument also remove
! -- even though it "works".

I don't know anything about ´.

pi





More information about the Bogofilter mailing list