obscured URL not being tokenized

Sun Dec 21 05:07:07 CET 2003

On the subject of wether or not bogofilter should be 'bloated' to 
include recognition of hidden text, I think that the following should be 
considered:

The theory behind the filter is all great, but only if you follow the 
rules.  In this case, the rules are that you are training your filter, 
and testing the messages based on what the end user sees in front of 
them on their email client.  That is what the spammers are truley 
interested in - what you see.  Now, they know that your spam filter sees 
a lot of stuff that they can hide from you, so why not include it in the 
message also, to trick the filter.

Now, I understand that we want to keep bogofilter as simple as 
possible.... but it should also be taken into consideration wether we 
want bogofilter to be comparing apples and oranges or not.

I think that we either need to decide to not attack the issue at ALL, 
or, if we do decide to address it, it should be done the right way-- 
The right way to do this is to "render" the email message virtually 
within bogofilter and then run the filter on the rendered text.  The 
rendering would be much simplified from a graphical renderer, but would 
basically render the HTML just as any other browser would.  This, would 
definitely bloat the code, but seeing how a lot of my spam IS in HTML, 
it might be beneficial.

In the interim, I think that David's ideas for giving bogofilter some 
ability to recognize the color tags are definitely a step in the right 
direction.

-Dan

David Relson wrote:
> Tom,
> 
> Of necessity, bogofilter is already doing some html interpretation:
> 
> comments - removed from consideration so "ju<!xxx>nk" is recognized
> tags a, img, font - innards parsed because of demonstratable benefits
> html character decoding - "a" converted to 'a'
> url character decoding - "%30" converted to '0'
> etc
> 
> For the heck of it, I'm experimenting with color as regards hidden text.
>  The following ideas have become clear:
> 
> Proper color processing requires a stack to track levels (nesting) and
> attributes.
> 
> Including additional html tags in the parser _may_ be space efficient --
> remains to be seen.  Including additional html tags in the parser will
> _definitely_ be faster than a preprocessor.
> 
> An alternate experiment could accept "size=-5", "color=red",
> "color=#ff0000", etc as tokens.
> 
> It'd be interesting to see which method is more effective.
> 
> David
> 
> ---------------------------------------------------------------------
> FAQ: http://bogofilter.sourceforge.net/bogofilter-faq.html
> To unsubscribe, e-mail: bogofilter-unsubscribe at aotto.com
> For summary digest subscription: bogofilter-digest-subscribe at aotto.com
> For more commands, e-mail: bogofilter-help at aotto.com
>