Test with different lexers

Wed Dec 3 14:52:09 CET 2003

Tom Anderson wrote:

> If you often emphasize phrases or you often end a sentence with the same
> word, then including that leading or trailing punctuation can still be
> useful.  Maybe "word." will have a higher spamicity than "word".  This
> may especially be true of "free!" or "$10!".  If we can distinguish
> between "madam," or "madam:" and "madam", it will likely be important. 
> For a real world example, I just received a spam titled "*National
> Attention as an Innovator*", so "*National" and "Innovator*" with the
> asterisks included may be much stronger spam indicators than without the
> asterisks.

Could be, but also those word won't be recognized at all, in
may cases.

>> Prices in which currency? Percents look OK as the last
>> character.
> 
> That's just my point... you shouldn't even ask questions such as "which
> currency", as that is immaterial.   Just allow all tokens and you'll
> cover all currencies.

Prices like "1,69 EUR" will never be matched. This is very
common, to write them this way.

> Bogofilter doesn't need to care about semantics.

I still believe the idea is to read words in a broader
sense. So it does matter. But your idea will be worth an
experiment. If it does not fail badly, I'll take your point.

>> >I think it should be assumed the correct way due to the fact that we are
>> >using the Bayesian method.  Creating rules a priori to supplement Bayes
>> >seems hypocrisy to me.
>> 
>> I fail to see this is true with normal punctuation. It might
>> be a nice idea to not allow them in words, so those get
>> split up.
> 
> I can see why you would want to parse out "normal punctuation" as a
> decoding rather than filtering process, however it is really difficult
> to know "normal" from "abnormal".  Moreover, normal punctuation can be
> indicative of spam too, as I argued above, so we should still be scoring
> it anyway. 

That was the idea of allowing ! at word ends. In my test I
could not see any significance.

> And if some or most normal punctuation does not give us a
> scoring advantage, we should simply degenerate to the punctuation-less
> form. 

Now that sounds reasonable, but again makes it more complicated.

> As I said before, I hardly have time to participate in this discussion
> let alone fiddle with the parser myself.  I'd surely appreciate anyone
> who tried out this sort of change, but cannot do so myself at the
> moment.

If you do find some time to compile and rund a test (which
does only take a relatively small amount of time where you
attention is needed), I'll send you a special version of the
lexer.

Next I will test (I don't promise any time too soon) is not
allowing any punctuation at all.

pi