HTML Processing Idea

Tue Dec 17 20:43:30 CET 2002

On Tue, Dec 17, 2002 at 12:43:05PM -0500, Doug Beardsley wrote:
> I have not been keeping up with this list very much, but I glanced at
> some of the recent messages and wanted to throw out an idea.  HTML
> processing could possibly be largely avoided through the addition of one
> piece of data to the spam and non-spam corpi.  I propose that in
> addition to the word, we store an extra piece of info that tells whether
> the word was seen as text/html or text/plain.  Therefore, there would be
> separate entries for hello as it appears in html and as it appears in
> plain text.  This is obviously directed at the html problem, but it
> could be generalized to store multiple encoding types.  This would
> present a large increase in the corpus size, but I don't think that is a
> huge penalty for the benefit of defeating spam tricks.  Has anyone else
> thought about this idea?  What do you think?

I have been thinking about a similar idea, which is an extension of another
that has been alluded to and discussed in the past, to wit:
associate contextual information with words stored in the database whenever possible.

For example, each word from a header would be associated with the header name,
and each word from a mime section would be associated with the mime-type or (if the type is
plain-text, the encoding)

Suject: hello there -> subject:hello, subject:there

every word from a base64 encoded plaintext message is then prefixed with 'base64:'
while it would be useful to to that a base64 encoded html message was such, it might start to
defeat the purpose to prefix every word from that message with 'html:base64:'. Perhaps the solution is
to store words twice, once for the encoding, and once for the mime-type.

unencoded plain-text messages, as the default, would not have their words prefixed at all, primarily to maintain
backwards compatibility.

-Gyepi