What is a word (lexertest)

David Relson relson at osagesoftware.com
Tue Oct 22 17:28:35 CEST 2002


At 11:15 AM 10/22/02, Allyn Fratkin wrote:

>David Relson wrote:
>
>>If a line contains exactly one token (composed only of letters and
>>digits), the lexer will ignore it.
>>
>>If there're delimiters (spaces, punctuation, control characters) at the
>>beginning or the end of the line, the lexer will return it.
>>
>>If there're special characters (underscore, dash, etc) in the token, the
>>lexer will return it.
>
>yes, i mentioned this last night in another email. if a line consists
>only of a-zA-Z0-9+/ characters it is "assumed" to be base64 and discarded.
>this is the reason it is ignored by the lexer.  in other words, this is
>a purposeful design decision.
>
>yes, it would be better if base64 data was recognized correctly.
>it would be better still if base64 text attachments were decoded.
>but unless and until one of those things happens, bogofilter needs to
>ignore base64 data any way it can.  if it loses some single word lines,
>then that is a tradeoff i would be willing to make.

Allyn,

It's in my head that BASE64 is intricate and complicated.  Perhaps I'm 
thinking of messages about decoding it.  Perhaps I've got it confused with 
MIME encoding.  Whatever ...

BASE64 as I recall is simply an encoding that takes 3 8-bit characters (a 
total of 24 bits) and converts them to 4 6-bit printable characters.  This 
is done so the message doesn't get munged by 7-bit mailers.  That 
conversion isn't too tough, though I don't know when to apply the conversion.

Anyhow, now I understand what you were saying about BASE64 last night.

Your persistence is appreciated.

David





More information about the Bogofilter mailing list