Including html-tag contents may be unnecessary
David Relson
relson at osagesoftware.com
Mon May 12 16:11:45 CEST 2003
At 09:44 AM 5/12/03, Tony L. Svanstrom wrote:
>On Mon, 12 May 2003 the voices made David Relson write:
>
>DR> At 08:07 AM 5/12/03, Tony L. Svanstrom wrote:
>
>DR> JavaScript has a limited number of keywords which will come to be
>DR> recognized. Function and variable identifiers can be whatever the spammer
>DR> wants. Again, new and different identifiers will be treated just like new
>DR> and different words and bogofilter will deal with them.
>
> Well... you could write HTML+CSS+JavaScript so that if you ignore the
>JavaScript you get one text, but the CSS+JavaScript will position characters
>A-Za-z0-9._ with a background set so that those characters will cover the
>"real" text.
> The result would be an innocent looking text put together using the most
>common words/phrases, and then lists telling the JavaScript where to position
>the characters. Those lists could consist themselfs of only the most common
>words, which the JavaScript turns into positions for the CSS.
> There'd be a lot of noice though, so these spam would be quick long for a
>short message.
>
> You'd be drowning in good tokens, with only a limited few bad ones; and the
>worst part is that it'd lower the accuracy of any bayesian filter which is
>learning these spam as spam.
Sounds very complicated. Let's put that on our "we'll deal with it when we
_have_ it" list.
>DR> We know which headers Paul Graham thinks are important. Which ones do you
>DR> think are important?
>
> Different ones depending on the route the e-mails take etc, I'd like to be
>able to control that myself; maybe setting a list of headers like this: To,
>From, Received(1), Received(-5). Where the positive numbers are counting from
>the first server that added the header, and negative numbers counting from the
>last.
The flex scanner is built at compile time. If headers A, B, and C are of
interest (or of possible interest), they need to be included in the .l
file. Right now, when an interesting line is encountered, function
set_tag(char *prefix) is called to set a prefix for all tokens encountered
on the rest of the list. The prefix values currently in use "to:",
"from:", "rtrn:", and "subj:". It would be pretty easy for set_tag() to
check the prefix against a list and ignore any that don't match. A list
like "to:|from:|rtrn:|subj:" would suffice.
Does that sound useful to you? Any fields besides Received: that you'd
like to see?
The positive and negative numbers seem like overkill. Implementing the
idea calls for a history list from which the proper items would be selected
and tokenized. I'm nixing that idea.
>DR> Right now we have 1,3,5 and label them as Yes/No/Unsure. The meanings
>of 2
>DR> & 4 aren't given in sufficient detail.
>
> Sorry, I guess I didn't explain it well enough... I meant like an "incoming
>score", giving bogofilter a nudge towards, or from, spaminess.
>
> I could do that today using different config-files with different values for
>what is to be considered spam, it just would be a lot easier if I could attach
>a value to the -u switch instead; telling bogofilter how mean/nice it
>should be
>to that particular e-mail.
Switch '-u' is already used to tell bogofilter to update the wordlists
after classifying the message. Other letters are available if the feature
is deemed useful.
>DR> If you'd like to write some code to implement your idea and post a
>patch to
>DR> the list, people can try it and see how well it works for them.
>
> Not a C-programmer, nor do I currently have the time to become one, so
> unless
>I can figure out a quick way to hide perlcode inside c-code I just have to
>keep
>on complaining every now and then on this list. ;-)
I confess that I still don't grok how values "2" and "4" would affect
bogofilter's behavior. Can you write some simple perl that would
illustrate your idea?
More information about the Bogofilter
mailing list