Test with different lexers

Tom Anderson tanderso at oac-design.com
Wed Dec 3 14:41:03 CET 2003


On Wed, 2003-12-03 at 02:52, Boris 'pi' Piwinger wrote:
> The real words. Where you have punctuation seems pretty
> arbitrary (it is more like some words more likely have a
> comma before them, but that what not be matched anyway).

> Yes, *if*. And if it is used for the same words and if it is
> _not used for phrases_.

If you often emphasize phrases or you often end a sentence with the same
word, then including that leading or trailing punctuation can still be
useful.  Maybe "word." will have a higher spamicity than "word".  This
may especially be true of "free!" or "$10!".  If we can distinguish
between "madam," or "madam:" and "madam", it will likely be important. 
For a real world example, I just received a spam titled "*National
Attention as an Innovator*", so "*National" and "Innovator*" with the
asterisks included may be much stronger spam indicators than without the
asterisks.

> Prices in which currency? Percents look OK as the last
> character.

That's just my point... you shouldn't even ask questions such as "which
currency", as that is immaterial.  Just allow all tokens and you'll
cover all currencies.  Don't try to be "more clever than the
statistics".  If someone wants to include a token in an email, even if
it is complete gibberish, we should still be ranking and scoring it. 
Bogofilter doesn't need to care about semantics.

> >I think it should be assumed the correct way due to the fact that we are
> >using the Bayesian method.  Creating rules a priori to supplement Bayes
> >seems hypocrisy to me.
> 
> I fail to see this is true with normal punctuation. It might
> be a nice idea to not allow them in words, so those get
> split up.

I can see why you would want to parse out "normal punctuation" as a
decoding rather than filtering process, however it is really difficult
to know "normal" from "abnormal".  Moreover, normal punctuation can be
indicative of spam too, as I argued above, so we should still be scoring
it anyway.  And if some or most normal punctuation does not give us a
scoring advantage, we should simply degenerate to the punctuation-less
form.  If we do that, we don't have to worry about removing it
beforehand.

As I said before, I hardly have time to participate in this discussion
let alone fiddle with the parser myself.  I'd surely appreciate anyone
who tried out this sort of change, but cannot do so myself at the
moment.

Tom

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://www.bogofilter.org/pipermail/bogofilter/attachments/20031203/8189a21a/attachment.sig>


More information about the Bogofilter mailing list