spaced out spam words

David Relson relson at osagesoftware.com
Sat Jun 10 12:46:13 CEST 2006


On Sat, 10 Jun 2006 12:05:12 +0200 (CEST)
Tony L. Svanstrom wrote:

> On Fri, 9 Jun 2006 the voices made David Relson write:
> 
> DR> HTML is a complex issue.  There are lots of tricks possible, for
> DR> example bogus tags and putting single letters in each cell of a
> DR> table.  HTML also allows "camouflaged" text (think white on white)
> DR> that a human won't see but a computer program will.  I'm unaware
> DR> of algorithms for successfully dealing with camo.
> 
>  Considering that you'd have to be as close as possible to 100%
> compatible with the current webbrowsers out there (or there'll be
> hacks to get around the filter), you probably would have to use a
> webbrowser to render the page, and then use OCR to compare the
> resulting image/page with the source (to find if there's hidden text,
> and probably using the source to improve the OCRs accuracy).
> 
>  But, of course, that you could get around simply by using javascript
> to set the webpage to the size of the screen and then use a :hover on
> the main bodytag to show/hide material.
> 
> 
> And people wonder why I block HTML-emails from most of my
> emailaccounts... =/
> 
> 
> 	/Tony

Tony,

Good idea.  I think you've got it!  Which engine should we use?

As you indicate, dealing with html is highly complex due to issues like
camouflage, javascript, and the difference between plain text
(underlying the email protocols) and how a message is rendered (for
human reading).  Using a browser engine would make bogofilter much
slower, which we don't want.

My local server statistics for June are 4211 spam classified as spam and
3 classified as unsure.  No false positives that I'm aware, no false
negatives that I remember.  

Bogofilter at osagesoftware.com is doing fine.  Bogofilter is also
checking non-subscriber messages to the bogofilter mailing lists.  Each
day it classifies those messages and emails a list of spam scores and
subject lines so I can decide if any of the messages need to be
accepted.  If not, they're automatically archived.  The process is much
quicker than the old "gotta run mailman and review messages" routine.

I question the value of run-time token length setting and
multi-word tokens.  However they're not too difficult to implement ...

Regards,

David



More information about the Bogofilter mailing list