obscured URL not being tokenized
Dan Singletary
dvsing at sonicspike.net
Sun Dec 21 20:58:51 CET 2003
Has any thought been put into not only registering single tokens as
bogofilter does now, but registering dual tokens so that "color, white"
was a token, or "display, none" was a token-- this might enhance
bogofilters accuracy because often you get enhanced meaning from looking
at two adjacent tokens .. "click here" comes to mind.
-Dan
Tom Anderson wrote:
> On Sat, 2003-12-20 at 23:07, Dan Singletary wrote:
>
>>Now, I understand that we want to keep bogofilter as simple as
>>possible.... but it should also be taken into consideration wether we
>>want bogofilter to be comparing apples and oranges or not.
>
>
> Obviously we want to ignore comments and other invisible stuff, however,
> not at the expense of turning bogofilter from a Bayesian filter into a
> rules-based filter. The surprising and wonderful characteristic of a
> Bayesian filter is that it finds significance in things you would just
> throw away, and discovers that what you thought was significant hardly
> even registers. If "display:none" is used extensively in spams, and
> almost never in hams, then that'll tip the scale way into the spam
> direction irrespective of the 50 slightly hammish terms following it.
> It might slip past the filter for the first dozen times, but as you
> train on error, bogofilter quickly learns.
>
> If you really wanted full html rendering, my suggestion would be to use
> a pre-filter which does that and then outputs an html comment "<!-- -->"
> around the offending text. This way, bogofilter could contain only very
> simplistic html removal and remain very slim and fast, concentrating on
> its Bayesian functionality. The html rendering could thus be turned on
> or off modularly.
>
>
>>rendering would be much simplified from a graphical renderer, but would
>>basically render the HTML just as any other browser would. This, would
>>definitely bloat the code, but seeing how a lot of my spam IS in HTML,
>>it might be beneficial.
>
>
> For the html module, I'd suggest pulling some code from the Mozilla
> project. You can't do it half-assed or you'll miss many of the
> spammers' sneaky tricks. It needs to be a fully functional renderer,
> including css and javascript. If you can use a Gecko API, then upgrades
> to future versions of html, etc., will be simple. Then on top of that,
> you need to add in image recognition to determine what color of a
> background image is behind which color of text. Then you'll need
> character recognition to parse out the text contained in an image. This
> won't be a small module, and not necessarily very effective.
>
>
>>In the interim, I think that David's ideas for giving bogofilter some
>>ability to recognize the color tags are definitely a step in the right
>>direction.
>
>
> All a spammer needs to do is encode the spam message in a "noisy" image
> above the fold, and then put some hammish terms way at the bottom,
> hidden by position rather than trickery of color. Then you'd be back to
> relying on Bayes, which would probably pull out "width='500'" and
> "height='400'" from the IMG tag as very spammy.
>
> My point is that with rules-based filters, you're in an arms race
> against spammers. Tricks and counter-tricks. With a Bayesian filter,
> this isn't the case... just train the filter on errors and it quickly
> learns the new tricks and flags them appropriately without modifying a
> single line of code.
>
> Tom
More information about the Bogofilter
mailing list