html processing

Tue Dec 17 13:10:51 CET 2002

David Relson <relson at osagesoftware.com> writes:

>>So what we're effectively about is a two-stage approach:
>>
>>1. source coding, decode/reduce mail content to useful parts. Like: kill
>>    HTML comments, kill non-text attachments/inlines, decode
>>    quoted-printable and base64. (We may need to have a second look at
>>    the file names of application/octet-stream stuff, there are dozens of
>>    misconfigured webmailers that do not recognize MIME types properly,
>>    and we may try to look at application/octet-stream when the name ends
>>    in .rtf, .txt or .bat or .vbs or something.)
>>
>>2. consume the reduced, parse for tokens and figure if the mail is spam.
>
> Yes!! By the way, how goes your work on processing attachments?

Parked for more important work on other projects.

After the above idea, I'm wondering if we're going to have two
lexers. One to parse and help reduce cruft, and one to do the actual
tokenization. The only think I'm a bit unsure about is how we'd continue
to provide the -p mode. We'd have to merge two lexers in one program I
assume for now.

-- 
Matthias Andree