Including html-tag contents may be unnecessary

David Relson relson at osagesoftware.com
Mon May 12 14:45:59 CEST 2003


At 08:07 AM 5/12/03, Tony L. Svanstrom wrote:

>  The problem is that once you start keeping lists of "inline" and 
> "object" tags
>the spammers will start using CSS to change things around...

Hi Tony,

I'm not too worried about that.  Spammers are already using mime-multipart 
messages to send one payload as plain text and another as html.  Bogofilter 
is handling that satisfactorily.  With the min_dev parameter, words never 
seen before don't matter.  Assuming the message gets added to the spamlist, 
the next time those words are used, the message will be scored as spam.

>  The real nightmare would be spam which consists of JavaScript using CSS to
>position each and every character; bogofilter could learn to treat the JS-code
>as spammy, but it'd take a long time before it catches most versions (since
>there's a lot you can do to avoid using the same tokens again and again).

JavaScript has a limited number of keywords which will come to be 
recognized.  Function and variable identifiers can be whatever the spammer 
wants.  Again, new and different identifiers will be treated just like new 
and different words and bogofilter will deal with them.

>DR> >  #1 Case folding... standard should be to ignore upper/lower case; 
>but for
>DR> >those of us that get a lot of e-mails (10k per month?) there should 
>be the
>DR> >option to not ignore it. It'd also be nice to be able to set different
>DR> >expiration dates on tokens depending on how common they are; this 
>ought to
>DR> >be a
>DR> >good way to control how large ones databases becomes.
>DR>
>DR> The patch I sent out earlier has a command line switch and a config file
>DR> option for enabling case sensitivity.  Bogofilter's default (case
>DR> insensitivity) hasn't changed.
>
>  I liked what Peter Bishop's done, but I think one might want to extend 
> that a
>bit; something like this maybe...
>
>  V14GR4 => v14gr4, zz.viagra, yy.v14gr4.
>
>DR> >  #2 I won't say much about this, besides that it should be easy for the
>DR> >user to pick what headers to ignore, or not ignore.
>DR>
>DR> At the moment, tagging header fields is an all or nothing capability.  The
>DR> default is to _not_ do it and there's a switch and an option to turn 
>it on.
>DR>
>DR> With the patch, tagging applies to "To:", "From:", "Subject:", and
>DR> "Return-Path:".  Why do you think it's necessary to be more selective
>DR> (finer grained)?
>
>  It's possible that you might want to look at certain headers simply because
>that will be a good thing for friends sending e-mails via the same servers all
>the time; and you might want use all but a few if those few are added by a
>server via which you're getting something like 1 ham in every 10'000 spam
>(yeah, I got an address like that; I'm using procmail to catch that 1 ham
>though, so it's mostly just a spamtrap).

We know which headers Paul Graham thinks are important.  Which ones do you 
think are important?

Given Graham's involvement with bayesian filters, I'm willing to implement 
his suggestions and test them to verify their usefulness.  When we find out 
that a different set of headers is better, bogofilter's parsing can be changed.

>  Basically it comes down to this: I use procmail to mess around with the
>headers quite a lot, and that means that I know which headers contains good
>spam/hamsigns, and I'd like bogofilter to use those headers.
>
>  Let's end this e-mail with a featurerequest, a feature I'd love: Being 
> able to
>tell bogofilter how spammy I think the e-mail is; maybe something like this:
>
>  -u 1   = 99% sure it's ham.
>  -u 2   = ham(ish)
>  -u 3   = dunno (as -u is today)
>  -u 4   = spammy
>  -u 5   = very very spammy
>
>  I'm doing a lot of white- and blacklisting today (autowhitelist all outgoing
>e-mails etc), but since there's always the risk of me getting spam with a
>forged sender I'd rather do a -u 2 for autowhitelisted, and maybe -u 1 for
>manually whitelisted e-mailaddresses.

Right now we have 1,3,5 and label them as Yes/No/Unsure.  The meanings of 2 
& 4 aren't given in sufficient detail.

If you'd like to write some code to implement your idea and post a patch to 
the list, people can try it and see how well it works for them.

David





More information about the Bogofilter mailing list