html processing [was: spaced out spam words]

David Relson relson at osagesoftware.com
Sat Jun 10 16:32:56 CEST 2006


On Sat, 10 Jun 2006 16:26:08 +0400
Mikhail Zabaluev wrote:

> В Птн, 09/06/2006 в 17:53 -0400, David Relson пишет:
> > What sort of flexibility are people interested in?  Anybody have the
> > time and energy to work on it?
> 
> I don't believe it will add much for the amount of extra tokens it
> would generate. The spammers may invent tricks with the content, but
> bogofilter gets a lot of information from message headers. I use
> Bogofilter with IP tracking turned on, and I rarely receive anything
> that seems to have fooled the lexer.
> 
> What could be more helpful is special handling of HTML markup tokens:
> use of the font tag and the color attribute can be a good warning
> sign.

Mikhail,

Bogofilter is _already_ giving special handling to "font" tags as well
as "a" and "img".  Run command "bogolexer -p < file" with the file
below:

---begin---
Content-Type: text/html;
	charset="iso-8859-1"

<font doodah>some text between fonts</font>
<a href="http://groups.yahoo.com" target="_top">
<img src="http://us.i1.yimg.com/us.yimg.com/i/yg/img/logo/yg.gif"
width="266" height="37" border="0" alt="Yahoo! Groups"> </a>

<font color="#800000"><strong>Dan Farber: Jobs and his new mini Mac
</strong></font> <br>

<font size="4">   </font>>

---end---



More information about the Bogofilter mailing list