Test with different lexers
Boris 'pi' Piwinger
3.14 at logic.univie.ac.at
Wed Dec 3 08:52:05 CET 2003
Tom Anderson <tanderso at oac-design.com> wrote:
>Punctuation is not used by the average person the way it is described in
>your 8th grade grammer textbook.
It is, but ...
>From ascii-art to emoticons to special
>emphasis, people use punctuation in creative ways.
It is also used for things like this.
>These can be valuable tokens.
Maybe.
>> In you above paragraph we would have those tokens:
>> email:
>> "special"
>> token.
>> me,
>> _does_
>> etc.
>>
>> It seems very unnatural from the definition of punctuation
>> in any language I know to do that as opposed to just:
>> email
>> special
>> token
>> me
>> does
>
>Which of the above list do you think is more strongly indicative of spam
>or ham, the watered-down dictionary-only words, or the punctuated
>variations?
The real words. Where you have punctuation seems pretty
arbitrary (it is more like some words more likely have a
comma before them, but that what not be matched anyway).
>If "special" (with the quotes) is used commonly to describe
>a sale at an online drugstore, it may be a strong indicator of spam.
Could be.
>If only my friends use the underscore emphasis technique (or sign as a
>number ;), then that is very hammish.
Yes, *if*. And if it is used for the same words and if it is
_not used for phrases_.
>Prices and percents are another huge category.
Prices in which currency? Percents look OK as the last
character.
>Why would we want to remove these important clues?
Just because of their function in language. In most cases
they will hide the content in arbitrary places.
>> On the other hand we do allow some punctuation in words to
>> cover V.I.A.G.R.A or up-to-date or MSG_COUNT. Maybe you are
>> right and this is wrong.
>
>I think it should be assumed the correct way due to the fact that we are
>using the Bayesian method. Creating rules a priori to supplement Bayes
>seems hypocrisy to me.
I fail to see this is true with normal punctuation. It might
be a nice idea to not allow them in words, so those get
split up.
pi
More information about the Bogofilter
mailing list