Including html-tag contents may be unnecessary
David Relson
relson at osagesoftware.com
Mon May 12 02:36:42 CEST 2003
Tony,
A good reply. I'm glad you wrote it before heading for bed :-)
At 07:41 PM 5/11/03, Tony L. Svanstrom wrote:
>On Sun, 11 May 2003 the voices made David Relson write:
>
>DR> Looking at the list of suggestions, I see them as dividing into 3 types:
>DR>
>DR> 1 - case folding
>DR> 2 - changed token definitions (tagged header fields, money,
>exclamation point)
>DR> 3 - html changing (process a, font, and img tags)
>
> #3 IMHO HTML should be ignored (in the sense that you only deal with the
> text
>as it would be viewed by someone in the spammers targetgroup; a very
>complicated way of ignoring it). Once that's working you start looking at what
>tokens you can extract/use.
Bogofilter _does_ need some more work on html in "eye space". Currently it
doesn't distinguish between tags that separate text into words, for example
<br> and <p>, and ones that done, for example <font...>. Processing of
tags for "meaningful" information, for example urls, is separate. Likely,
the easier task will be done first - with "easier" being determined by
whoever takes on html processing.
> #1 Case folding... standard should be to ignore upper/lower case; but for
>those of us that get a lot of e-mails (10k per month?) there should be the
>option to not ignore it. It'd also be nice to be able to set different
>expiration dates on tokens depending on how common they are; this ought to
>be a
>good way to control how large ones databases becomes.
The patch I sent out earlier has a command line switch and a config file
option for enabling case sensitivity. Bogofilter's default (case
insensitivity) hasn't changed.
> #2 I won't say much about this, besides that it should be easy for the
> user to
>pick what headers to ignore, or not ignore.
At the moment, tagging header fields is an all or nothing capability. The
default is to _not_ do it and there's a switch and an option to turn it on.
With the patch, tagging applies to "To:", "From:", "Subject:", and
"Return-Path:". Why do you think it's necessary to be more selective
(finer grained)?
> Or should I have deleted this e-mail and gone to bed, as I was planing
> before
>I thought I'd write something smart about this? =D
Discussion of ideas is good. I'm glad you didn't delete it.
David
More information about the Bogofilter
mailing list