What is a word (lexertest)

Wed Oct 23 23:03:38 CEST 2002

David Relson wrote:
> At 06:59 AM 10/22/02, Boris 'pi' Piwinger wrote:
> 
>> Hi!
>>
>> Even though I don't code here, I tested something;-)
>>
>> [3.14 at pi ~/local/bogolists]$ echo "»cmsg newgroup«"|lexertest
>> get_token: 1 '»cmsg'
>> get_token: 1 'newgroup«'
>> [3.14 at pi ~/local/bogolists]$ echo "bla"|lexertest
>>
>> Both results are not really satisfiying. There might be a reason why
>> the second does not return anything, but the first is wrong. Well,
>> here we have the problem that we cannot tell without looking at the
>> charset.
>>
>> pi
> 
> 
> pi,
> 
> There _is_ a problem with the lexer.
> 
> If a line contains exactly one token (composed only of letters and 
> digits), the lexer will ignore it.
> 
> If there're delimiters (spaces, punctuation, control characters) at the 
> beginning or the end of the line, the lexer will return it.
> 
> If there're special characters (underscore, dash, etc) in the token, the 
> lexer will return it.
> 
> We need our lexer expert here !!!  Clint Adams, are you watching ???
> 
> David
> 

I was thinking about this word stuff...

What is a word?  I mean, what is a word that we are likely to care 
about when it comes to reading spam?  We read English (most of us) 
and from the English words alone, we are able to determine if this 
email is spam/nonspam.

Given that functional role of a word, then it might be as simple 
as the perl regex of /(\w+)/ocg for parsing out any words.

And with regard to the value of base64 encoding and detection of 
words that are a line long.  Can you name any word in the English 
(and most Roman/Greek based) languages that are going to be 66 
characters in length?

When I was parsing email to postgresql, I had so set the word 
limit to char(50) to keep it "sane".

I would suggest we consider the simplistic approach of setting a 
word to be something like (I only know how to express this in perl):

/(\w{,40})/ocg and not /^-?\d+$/

I believe that this would be more in line with the lexigraphical 
rules/patterns of most grecko-roman based languages.

Considering further the idea of spam being determined by what is 
communicated to the human being involved, it could be further 
argued to ignore all HEADER information except for the SUBJECT.
After all, new many people using email are going to read through 
the X-Tags and such.

If this can be accepted, then the HEADER information can be ignored.

Now, if I haven't been thrown out of the party at this time, I 
would like to present another potential problem that I've been 
struggling with when trying to impliment an intelligent and 
adaptive email filter.  How to correct false readings from a POP 
server.

If you accept the constraint that only BODY + SUBJECT are 
considered candidate for spam/nospam evaluation, then the earlier 
mentioned problem of returning email to the server for bogofilter 
corrections becomes independent of the email HEADER, which is 
modified on a Forward/Reply message.

The added SUBJECT property of [Fwd] (or whatever) can easily be 
stripped of the SUBJECT, returning it to the original format.

By sending email back through the server, procmail can be utilized 
to manage the bogofilter correction (-N/-S) processes based on 
only the BODY and SUBJECT of the returned emails.  This would 
allow bogofilter impimentation to be more readily implemented on a 
mail server, rather than waiting for email clients to develop a 
bogofilter management technique (like bounce).

Thoughts?
-- 
It was Penguin lust... at its ugliest.