Test with different lexers

Tom Anderson tanderso at oac-design.com
Wed Dec 3 02:55:48 CET 2003


On Tue, 2003-12-02 at 11:53, Boris 'pi' Piwinger wrote:
> That certainly is an interesting idea, but punctuation is
> different in my opinion, this is not because it works, but
> because of the function in language which makes it special
> which then should be reflected in a parser.

Punctuation is not used by the average person the way it is described in
your 8th grade grammer textbook.  From ascii-art to emoticons to special
emphasis, people use punctuation in creative ways.  These can be
valuable tokens.

> In you above paragraph we would have those tokens:
> email:
> "special"
> token.
> me,
> _does_
> etc.
> 
> It seems very unnatural from the definition of punctuation
> in any language I know to do that as opposed to just:
> email
> special
> token
> me
> does

Which of the above list do you think is more strongly indicative of spam
or ham, the watered-down dictionary-only words, or the punctuated
variations?  If "special" (with the quotes) is used commonly to describe
a sale at an online drugstore, it may be a strong indicator of spam.  If
only my friends use the underscore emphasis technique (or sign as a
number ;), then that is very hammish.  Prices and percents are another
huge category.  Why would we want to remove these important clues?

> On the other hand we do allow some punctuation in words to
> cover V.I.A.G.R.A or up-to-date or MSG_COUNT. Maybe you are
> right and this is wrong.

I think it should be assumed the correct way due to the fact that we are
using the Bayesian method.  Creating rules a priori to supplement Bayes
seems hypocrisy to me.

Tom
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20031202/b89dd3b5/attachment.sig>


More information about the Bogofilter mailing list