Including html-tag contents may be unnecessary

Tony L. Svanstrom tony at moon.pp.se
Mon May 12 01:41:31 CEST 2003


On Sun, 11 May 2003 the voices made David Relson write:

DR> Looking at the list of suggestions, I see them as dividing into 3 types:
DR>
DR> 1 - case folding
DR> 2 - changed token definitions (tagged header fields, money, exclamation point)
DR> 3 - html changing (process a, font, and img tags)

 #3 IMHO HTML should be ignored (in the sense that you only deal with the text
as it would be viewed by someone in the spammers targetgroup; a very
complicated way of ignoring it). Once that's working you start looking at what
tokens you can extract/use.

 #1 Case folding... standard should be to ignore upper/lower case; but for
those of us that get a lot of e-mails (10k per month?) there should be the
option to not ignore it. It'd also be nice to be able to set different
expiration dates on tokens depending on how common they are; this ought to be a
good way to control how large ones databases becomes.

 #2 I won't say much about this, besides that it should be easy for the user to
pick what headers to ignore, or not ignore.


 Or should I have deleted this e-mail and gone to bed, as I was planing before
I thought I'd write something smart about this? =D

-- 
  .-------------------------------------------------------------------.
  | Per scientiam ad libertatem! (Through knowledge towards freedom!) |
  `-------------------------------------------------------------------´
                   << ©1998-2003 tony at svanstrom.com >>





More information about the Bogofilter mailing list