obscured URL not being tokenized

David Relson relson at osagesoftware.com
Sat Dec 20 19:54:50 CET 2003


On 20 Dec 2003 13:18:17 -0500
Tom Anderson <tanderso at oac-design.com> wrote:

> On Sat, 2003-12-20 at 11:16, Dan Singletary wrote:
> > filter.  I've mentioned it before, but there should be some way to
> > tell bogofilter to ignore text that is the same color as it's
> > background- I know this would require more interpretation of the
> > HTML, and I'm not sure how much more code there would need to be for
> > this.  I've attached the entire offending email for your reference.
> 
> That would be simple enough for only white backgrounds and
> specifically"white" fonts.  But once you start getting
> color="#fffffe", then it gets more difficult.  Moreover, it is nearly
> impossible to compare a background _image_ to a foreground color
> without doing all kinds of image recognition.  Even if it were
> possible, the overhead would be extreme.
> 
> However, if you simply leave everything alone as is, and just register
> your spams, the Bayesian method should start to recogize things like
> color="white" and background="something.jpg" as spamish tokens.
> 
> Assuming, that is, that bogofilter doesn't throw away such valuable
> information as an equals sign and quotes.
> 
> Tom

Tom,

Bogofilter ignores the internals of most html tags on the grounds that
there's little information there worth keeping.  The internals of a,
img, and font tags are parsed and scored.  Given this, bogofilter has
the FFFFFF and WHITE tokens in some cases but not others.  Scoring these
tokens is helpful, but won't conclusively identify any messages as spam
(or ham), since the scores of any pair of tokens isn't enough info to
score a message.

Bogofilter has never included the equals sign or quotes in tokens.  The
tokens without the special characters have been considered good enough.

I suppose it's time for a test to see what happens when they ('=' and
'"') are allowed and what happens when tags allowing color info are
scanned.  Are there any volunteers to change the lexer to support this
test?  If so, make the changes, send me the patch, and I'll run the
test.

David




More information about the Bogofilter mailing list