HTML Processing Idea

Tue Dec 17 18:43:05 CET 2002

I have not been keeping up with this list very much, but I glanced at
some of the recent messages and wanted to throw out an idea.  HTML
processing could possibly be largely avoided through the addition of one
piece of data to the spam and non-spam corpi.  I propose that in
addition to the word, we store an extra piece of info that tells whether
the word was seen as text/html or text/plain.  Therefore, there would be
separate entries for hello as it appears in html and as it appears in
plain text.  This is obviously directed at the html problem, but it
could be generalized to store multiple encoding types.  This would
present a large increase in the corpus size, but I don't think that is a
huge penalty for the benefit of defeating spam tricks.  Has anyone else
thought about this idea?  What do you think?

Doug Beardsley