html processing
David Relson
relson at osagesoftware.com
Tue Dec 17 13:37:20 CET 2002
At 07:10 AM 12/17/02, Matthias Andree wrote:
>David Relson <relson at osagesoftware.com> writes:
>
> >>So what we're effectively about is a two-stage approach:
> >>
> >>1. source coding, decode/reduce mail content to useful parts. Like: kill
> >> HTML comments, kill non-text attachments/inlines, decode
> >> quoted-printable and base64. (We may need to have a second look at
> >> the file names of application/octet-stream stuff, there are dozens of
> >> misconfigured webmailers that do not recognize MIME types properly,
> >> and we may try to look at application/octet-stream when the name ends
> >> in .rtf, .txt or .bat or .vbs or something.)
> >>
> >>2. consume the reduced, parse for tokens and figure if the mail is spam.
> >
> > Yes!! By the way, how goes your work on processing attachments?
>
>Parked for more important work on other projects.
>
>After the above idea, I'm wondering if we're going to have two
>lexers. One to parse and help reduce cruft, and one to do the actual
>tokenization. The only think I'm a bit unsure about is how we'd continue
>to provide the -p mode. We'd have to merge two lexers in one program I
>assume for now.
I've been thinking about bogofilter's usage of ram. First an assumption -
that html content will be marked. This may be false and time will quickly
show if that is so.
Anyhow, the implementation of passthrough loads the whole message into ram,
a line at a time. Seems like we could (should) just suck the whole message
into ram. Parsing would continue token by token until a
content-type-encoding tag appears, at which point an appropriate decoder
would be used. This might result in second copies in ram of some portions
of the message, but I think that's a reasonable price to pay for correct
handling of content.
David
More information about the bogofilter-dev
mailing list