Test with different lexers

Boris 'pi' Piwinger 3.14 at logic.univie.ac.at
Wed Dec 3 08:52:05 CET 2003


Tom Anderson <tanderso at oac-design.com> wrote:

>Punctuation is not used by the average person the way it is described in
>your 8th grade grammer textbook.  

It is, but ...

>From ascii-art to emoticons to special
>emphasis, people use punctuation in creative ways.

It is also used for things like this.

>These can be valuable tokens.

Maybe.

>> In you above paragraph we would have those tokens:
>> email:
>> "special"
>> token.
>> me,
>> _does_
>> etc.
>> 
>> It seems very unnatural from the definition of punctuation
>> in any language I know to do that as opposed to just:
>> email
>> special
>> token
>> me
>> does
>
>Which of the above list do you think is more strongly indicative of spam
>or ham, the watered-down dictionary-only words, or the punctuated
>variations?

The real words. Where you have punctuation seems pretty
arbitrary (it is more like some words more likely have a
comma before them, but that what not be matched anyway).

>If "special" (with the quotes) is used commonly to describe
>a sale at an online drugstore, it may be a strong indicator of spam.  

Could be.

>If only my friends use the underscore emphasis technique (or sign as a
>number ;), then that is very hammish.  

Yes, *if*. And if it is used for the same words and if it is
_not used for phrases_.

>Prices and percents are another huge category.  

Prices in which currency? Percents look OK as the last
character.

>Why would we want to remove these important clues?

Just because of their function in language. In most cases
they will hide the content in arbitrary places.

>> On the other hand we do allow some punctuation in words to
>> cover V.I.A.G.R.A or up-to-date or MSG_COUNT. Maybe you are
>> right and this is wrong.
>
>I think it should be assumed the correct way due to the fact that we are
>using the Bayesian method.  Creating rules a priori to supplement Bayes
>seems hypocrisy to me.

I fail to see this is true with normal punctuation. It might
be a nice idea to not allow them in words, so those get
split up.

pi




More information about the Bogofilter mailing list