Test with different lexers

David Relson relson at osagesoftware.com
Wed Dec 3 03:07:43 CET 2003


On 02 Dec 2003 20:55:48 -0500
Tom Anderson <tanderso at oac-design.com> wrote:

> On Tue, 2003-12-02 at 11:53, Boris 'pi' Piwinger wrote:
> > That certainly is an interesting idea, but punctuation is
> > different in my opinion, this is not because it works, but
> > because of the function in language which makes it special
> > which then should be reflected in a parser.
> 
> Punctuation is not used by the average person the way it is described
> in your 8th grade grammer textbook.  From ascii-art to emoticons to
> special emphasis, people use punctuation in creative ways.  These can
> be valuable tokens.

true

> Which of the above list do you think is more strongly indicative of
> spam or ham, the watered-down dictionary-only words, or the punctuated
> variations?  If "special" (with the quotes) is used commonly to
> describe a sale at an online drugstore, it may be a strong indicator
> of spam.  If only my friends use the underscore emphasis technique (or
> sign as a number ;), then that is very hammish.  Prices and percents
> are another huge category.  Why would we want to remove these
> important clues?
> 
> > On the other hand we do allow some punctuation in words to
> > cover V.I.A.G.R.A or up-to-date or MSG_COUNT. Maybe you are
> > right and this is wrong.
> 
> I think it should be assumed the correct way due to the fact that we
> are using the Bayesian method.  Creating rules a priori to supplement
> Bayes seems hypocrisy to me.

I have trouble believing that including normal punctuation is useful. 
For example, every sentence ends with a period.  So this message will
produce "useful." and "period.".  Then add in my use of commas to give
"example,".

Modify the parser and run the experiment so we can see how much
including punctuation helps (or hurts).





More information about the Bogofilter mailing list