Including html-tag contents may be unnecessary

David Relson relson at osagesoftware.com
Mon May 12 02:36:42 CEST 2003


Tony,

A good reply.  I'm glad you wrote it before heading for bed :-)

At 07:41 PM 5/11/03, Tony L. Svanstrom wrote:

>On Sun, 11 May 2003 the voices made David Relson write:
>
>DR> Looking at the list of suggestions, I see them as dividing into 3 types:
>DR>
>DR> 1 - case folding
>DR> 2 - changed token definitions (tagged header fields, money, 
>exclamation point)
>DR> 3 - html changing (process a, font, and img tags)
>
>  #3 IMHO HTML should be ignored (in the sense that you only deal with the 
> text
>as it would be viewed by someone in the spammers targetgroup; a very
>complicated way of ignoring it). Once that's working you start looking at what
>tokens you can extract/use.

Bogofilter _does_ need some more work on html in "eye space".  Currently it 
doesn't distinguish between tags that separate text into words, for example 
<br> and <p>, and ones that done, for example <font...>.  Processing of 
tags for "meaningful" information, for example urls, is separate.  Likely, 
the easier task will be done first - with "easier" being determined by 
whoever takes on html processing.

>  #1 Case folding... standard should be to ignore upper/lower case; but for
>those of us that get a lot of e-mails (10k per month?) there should be the
>option to not ignore it. It'd also be nice to be able to set different
>expiration dates on tokens depending on how common they are; this ought to 
>be a
>good way to control how large ones databases becomes.

The patch I sent out earlier has a command line switch and a config file 
option for enabling case sensitivity.  Bogofilter's default (case 
insensitivity) hasn't changed.

>  #2 I won't say much about this, besides that it should be easy for the 
> user to
>pick what headers to ignore, or not ignore.

At the moment, tagging header fields is an all or nothing capability.  The 
default is to _not_ do it and there's a switch and an option to turn it on.

With the patch, tagging applies to "To:", "From:", "Subject:", and 
"Return-Path:".  Why do you think it's necessary to be more selective 
(finer grained)?

>  Or should I have deleted this e-mail and gone to bed, as I was planing 
> before
>I thought I'd write something smart about this? =D

Discussion of ideas is good.  I'm glad you didn't delete it.

David






More information about the Bogofilter mailing list