obscured URL not being tokenized

Sun Dec 21 20:58:51 CET 2003

Has any thought been put into not only registering single tokens as 
bogofilter does now, but registering dual tokens so that "color, white" 
was a token, or "display, none" was a token-- this might enhance 
bogofilters accuracy because often you get enhanced meaning from looking 
at two adjacent tokens .. "click here" comes to mind.

-Dan

Tom Anderson wrote:
> On Sat, 2003-12-20 at 23:07, Dan Singletary wrote:
> 
>>Now, I understand that we want to keep bogofilter as simple as 
>>possible.... but it should also be taken into consideration wether we 
>>want bogofilter to be comparing apples and oranges or not.
> 
> 
> Obviously we want to ignore comments and other invisible stuff, however,
> not at the expense of turning bogofilter from a Bayesian filter into a
> rules-based filter.  The surprising and wonderful characteristic of a
> Bayesian filter is that it finds significance in things you would just
> throw away, and discovers that what you thought was significant hardly
> even registers.  If "display:none" is used extensively in spams, and
> almost never in hams, then that'll tip the scale way into the spam
> direction irrespective of the 50 slightly hammish terms following it. 
> It might slip past the filter for the first dozen times, but as you
> train on error, bogofilter quickly learns.
> 
> If you really wanted full html rendering, my suggestion would be to use
> a pre-filter which does that and then outputs an html comment "<!-- -->"
> around the offending text.  This way, bogofilter could contain only very
> simplistic html removal and remain very slim and fast, concentrating on
> its Bayesian functionality.  The html rendering could thus be turned on
> or off modularly.
> 
> 
>>rendering would be much simplified from a graphical renderer, but would 
>>basically render the HTML just as any other browser would.  This, would 
>>definitely bloat the code, but seeing how a lot of my spam IS in HTML, 
>>it might be beneficial.
> 
> 
> For the html module, I'd suggest pulling some code from the Mozilla
> project.  You can't do it half-assed or you'll miss many of the
> spammers' sneaky tricks.  It needs to be a fully functional renderer,
> including css and javascript.  If you can use a Gecko API, then upgrades
> to future versions of html, etc., will be simple.  Then on top of that,
> you need to add in image recognition to determine what color of a
> background image is behind which color of text.  Then you'll need
> character recognition to parse out the text contained in an image.  This
> won't be a small module, and not necessarily very effective.  
> 
> 
>>In the interim, I think that David's ideas for giving bogofilter some 
>>ability to recognize the color tags are definitely a step in the right 
>>direction.
> 
> 
> All a spammer needs to do is encode the spam message in a "noisy" image
> above the fold, and then put some hammish terms way at the bottom,
> hidden by position rather than trickery of color.  Then you'd be back to
> relying on Bayes, which would probably pull out "width='500'" and
> "height='400'" from the IMG tag as very spammy.
> 
> My point is that with rules-based filters, you're in an arms race
> against spammers.  Tricks and counter-tricks.  With a Bayesian filter,
> this isn't the case... just train the filter on errors and it quickly
> learns the new tricks and flags them appropriately without modifying a
> single line of code.
> 
> Tom