Including html-tag contents may be unnecessary

Tony L. Svanstrom tony at moon.pp.se
Mon May 12 14:07:04 CEST 2003


On Sun, 11 May 2003 the voices made David Relson write:

DR> A good reply.  I'm glad you wrote it before heading for bed :-)

 Thank you; it isn't always easy to tell when one's been out drinking while
watching those stupid Canadians beating Sweden for the world championship
(hockey). =D

DR> At 07:41 PM 5/11/03, Tony L. Svanstrom wrote:

DR> >  #3 IMHO HTML should be ignored (in the sense that you only deal with the
DR> >text as it would be viewed by someone in the spammers targetgroup; a very
DR> >complicated way of ignoring it). Once that's working you start looking at what
DR> >tokens you can extract/use.
DR>
DR> Bogofilter _does_ need some more work on html in "eye space".  Currently it
DR> doesn't distinguish between tags that separate text into words, for example
DR> <br> and <p>, and ones that done, for example <font...>.  Processing of
DR> tags for "meaningful" information, for example urls, is separate.  Likely,
DR> the easier task will be done first - with "easier" being determined by
DR> whoever takes on html processing.

 The problem is that once you start keeping lists of "inline" and "object" tags
the spammers will start using CSS to change things around...

 The real nightmare would be spam which consists of JavaScript using CSS to
position each and every character; bogofilter could learn to treat the JS-code
as spammy, but it'd take a long time before it catches most versions (since
there's a lot you can do to avoid using the same tokens again and again).

DR> >  #1 Case folding... standard should be to ignore upper/lower case; but for
DR> >those of us that get a lot of e-mails (10k per month?) there should be the
DR> >option to not ignore it. It'd also be nice to be able to set different
DR> >expiration dates on tokens depending on how common they are; this ought to
DR> >be a
DR> >good way to control how large ones databases becomes.
DR>
DR> The patch I sent out earlier has a command line switch and a config file
DR> option for enabling case sensitivity.  Bogofilter's default (case
DR> insensitivity) hasn't changed.

 I liked what Peter Bishop's done, but I think one might want to extend that a
bit; something like this maybe...

 V14GR4 => v14gr4, zz.viagra, yy.v14gr4.

DR> >  #2 I won't say much about this, besides that it should be easy for the
DR> >user to pick what headers to ignore, or not ignore.
DR>
DR> At the moment, tagging header fields is an all or nothing capability.  The
DR> default is to _not_ do it and there's a switch and an option to turn it on.
DR>
DR> With the patch, tagging applies to "To:", "From:", "Subject:", and
DR> "Return-Path:".  Why do you think it's necessary to be more selective
DR> (finer grained)?

 It's possible that you might want to look at certain headers simply because
that will be a good thing for friends sending e-mails via the same servers all
the time; and you might want use all but a few if those few are added by a
server via which you're getting something like 1 ham in every 10'000 spam
(yeah, I got an address like that; I'm using procmail to catch that 1 ham
though, so it's mostly just a spamtrap).

 Basically it comes down to this: I use procmail to mess around with the
headers quite a lot, and that means that I know which headers contains good
spam/hamsigns, and I'd like bogofilter to use those headers.

 Let's end this e-mail with a featurerequest, a feature I'd love: Being able to
tell bogofilter how spammy I think the e-mail is; maybe something like this:

 -u 1	= 99% sure it's ham.
 -u 2	= ham(ish)
 -u 3	= dunno (as -u is today)
 -u 4	= spammy
 -u 5	= very very spammy

 I'm doing a lot of white- and blacklisting today (autowhitelist all outgoing
e-mails etc), but since there's always the risk of me getting spam with a
forged sender I'd rather do a -u 2 for autowhitelisted, and maybe -u 1 for
manually whitelisted e-mailaddresses.

-- 
  .-------------------------------------------------------------------.
  | Per scientiam ad libertatem! (Through knowledge towards freedom!) |
  `-------------------------------------------------------------------´
                   << ©1998-2003 tony at svanstrom.com >>





More information about the Bogofilter mailing list