html processing [was: spaced out spam words]
David Relson
relson at osagesoftware.com
Sat Jun 10 16:32:56 CEST 2006
On Sat, 10 Jun 2006 16:26:08 +0400
Mikhail Zabaluev wrote:
> В Птн, 09/06/2006 в 17:53 -0400, David Relson пишет:
> > What sort of flexibility are people interested in? Anybody have the
> > time and energy to work on it?
>
> I don't believe it will add much for the amount of extra tokens it
> would generate. The spammers may invent tricks with the content, but
> bogofilter gets a lot of information from message headers. I use
> Bogofilter with IP tracking turned on, and I rarely receive anything
> that seems to have fooled the lexer.
>
> What could be more helpful is special handling of HTML markup tokens:
> use of the font tag and the color attribute can be a good warning
> sign.
Mikhail,
Bogofilter is _already_ giving special handling to "font" tags as well
as "a" and "img". Run command "bogolexer -p < file" with the file
below:
---begin---
Content-Type: text/html;
charset="iso-8859-1"
<font doodah>some text between fonts</font>
<a href="http://groups.yahoo.com" target="_top">
<img src="http://us.i1.yimg.com/us.yimg.com/i/yg/img/logo/yg.gif"
width="266" height="37" border="0" alt="Yahoo! Groups"> </a>
<font color="#800000"><strong>Dan Farber: Jobs and his new mini Mac
</strong></font> <br>
<font size="4"> </font>>
---end---
More information about the Bogofilter
mailing list