Test with different lexers
Boris 'pi' Piwinger
3.14 at logic.univie.ac.at
Wed Dec 3 14:52:09 CET 2003
Tom Anderson wrote:
> If you often emphasize phrases or you often end a sentence with the same
> word, then including that leading or trailing punctuation can still be
> useful. Maybe "word." will have a higher spamicity than "word". This
> may especially be true of "free!" or "$10!". If we can distinguish
> between "madam," or "madam:" and "madam", it will likely be important.
> For a real world example, I just received a spam titled "*National
> Attention as an Innovator*", so "*National" and "Innovator*" with the
> asterisks included may be much stronger spam indicators than without the
> asterisks.
Could be, but also those word won't be recognized at all, in
may cases.
>> Prices in which currency? Percents look OK as the last
>> character.
>
> That's just my point... you shouldn't even ask questions such as "which
> currency", as that is immaterial. Just allow all tokens and you'll
> cover all currencies.
Prices like "1,69 EUR" will never be matched. This is very
common, to write them this way.
> Bogofilter doesn't need to care about semantics.
I still believe the idea is to read words in a broader
sense. So it does matter. But your idea will be worth an
experiment. If it does not fail badly, I'll take your point.
>> >I think it should be assumed the correct way due to the fact that we are
>> >using the Bayesian method. Creating rules a priori to supplement Bayes
>> >seems hypocrisy to me.
>>
>> I fail to see this is true with normal punctuation. It might
>> be a nice idea to not allow them in words, so those get
>> split up.
>
> I can see why you would want to parse out "normal punctuation" as a
> decoding rather than filtering process, however it is really difficult
> to know "normal" from "abnormal". Moreover, normal punctuation can be
> indicative of spam too, as I argued above, so we should still be scoring
> it anyway.
That was the idea of allowing ! at word ends. In my test I
could not see any significance.
> And if some or most normal punctuation does not give us a
> scoring advantage, we should simply degenerate to the punctuation-less
> form.
Now that sounds reasonable, but again makes it more complicated.
> As I said before, I hardly have time to participate in this discussion
> let alone fiddle with the parser myself. I'd surely appreciate anyone
> who tried out this sort of change, but cannot do so myself at the
> moment.
If you do find some time to compile and rund a test (which
does only take a relatively small amount of time where you
attention is needed), I'll send you a special version of the
lexer.
Next I will test (I don't promise any time too soon) is not
allowing any punctuation at all.
pi
More information about the Bogofilter
mailing list