html processing

Tue Dec 17 13:37:20 CET 2002

At 07:10 AM 12/17/02, Matthias Andree wrote:

>David Relson <relson at osagesoftware.com> writes:
>
> >>So what we're effectively about is a two-stage approach:
> >>
> >>1. source coding, decode/reduce mail content to useful parts. Like: kill
> >>    HTML comments, kill non-text attachments/inlines, decode
> >>    quoted-printable and base64. (We may need to have a second look at
> >>    the file names of application/octet-stream stuff, there are dozens of
> >>    misconfigured webmailers that do not recognize MIME types properly,
> >>    and we may try to look at application/octet-stream when the name ends
> >>    in .rtf, .txt or .bat or .vbs or something.)
> >>
> >>2. consume the reduced, parse for tokens and figure if the mail is spam.
> >
> > Yes!! By the way, how goes your work on processing attachments?
>
>Parked for more important work on other projects.
>
>After the above idea, I'm wondering if we're going to have two
>lexers. One to parse and help reduce cruft, and one to do the actual
>tokenization. The only think I'm a bit unsure about is how we'd continue
>to provide the -p mode. We'd have to merge two lexers in one program I
>assume for now.

I've been thinking about bogofilter's usage of ram.  First an assumption - 
that html content will be marked.  This may be false and time will quickly 
show if that is so.

Anyhow, the implementation of passthrough loads the whole message into ram, 
a line at a time.  Seems like we could (should) just suck the whole message 
into ram.  Parsing would continue token by token until a 
content-type-encoding tag appears, at which point an appropriate decoder 
would be used.  This might result in second copies in ram of some portions 
of the message, but I think that's a reasonable price to pay for correct 
handling of content.

David