Test with different lexers
Tom Anderson
tanderso at oac-design.com
Wed Dec 3 02:55:48 CET 2003
On Tue, 2003-12-02 at 11:53, Boris 'pi' Piwinger wrote:
> That certainly is an interesting idea, but punctuation is
> different in my opinion, this is not because it works, but
> because of the function in language which makes it special
> which then should be reflected in a parser.
Punctuation is not used by the average person the way it is described in
your 8th grade grammer textbook. From ascii-art to emoticons to special
emphasis, people use punctuation in creative ways. These can be
valuable tokens.
> In you above paragraph we would have those tokens:
> email:
> "special"
> token.
> me,
> _does_
> etc.
>
> It seems very unnatural from the definition of punctuation
> in any language I know to do that as opposed to just:
> email
> special
> token
> me
> does
Which of the above list do you think is more strongly indicative of spam
or ham, the watered-down dictionary-only words, or the punctuated
variations? If "special" (with the quotes) is used commonly to describe
a sale at an online drugstore, it may be a strong indicator of spam. If
only my friends use the underscore emphasis technique (or sign as a
number ;), then that is very hammish. Prices and percents are another
huge category. Why would we want to remove these important clues?
> On the other hand we do allow some punctuation in words to
> cover V.I.A.G.R.A or up-to-date or MSG_COUNT. Maybe you are
> right and this is wrong.
I think it should be assumed the correct way due to the fact that we are
using the Bayesian method. Creating rules a priori to supplement Bayes
seems hypocrisy to me.
Tom
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20031202/b89dd3b5/attachment.sig>
More information about the Bogofilter
mailing list