Including html-tag contents may be unnecessary

Mon May 12 16:11:45 CEST 2003

At 09:44 AM 5/12/03, Tony L. Svanstrom wrote:

>On Mon, 12 May 2003 the voices made David Relson write:
>
>DR> At 08:07 AM 5/12/03, Tony L. Svanstrom wrote:
>
>DR> JavaScript has a limited number of keywords which will come to be
>DR> recognized.  Function and variable identifiers can be whatever the spammer
>DR> wants.  Again, new and different identifiers will be treated just like new
>DR> and different words and bogofilter will deal with them.
>
>  Well... you could write HTML+CSS+JavaScript so that if you ignore the
>JavaScript you get one text, but the CSS+JavaScript will position characters
>A-Za-z0-9._ with a background set so that those characters will cover the
>"real" text.
>  The result would be an innocent looking text put together using the most
>common words/phrases, and then lists telling the JavaScript where to position
>the characters. Those lists could consist themselfs of only the most common
>words, which the JavaScript turns into positions for the CSS.
>  There'd be a lot of noice though, so these spam would be quick long for a
>short message.
>
>  You'd be drowning in good tokens, with only a limited few bad ones; and the
>worst part is that it'd lower the accuracy of any bayesian filter which is
>learning these spam as spam.

Sounds very complicated.  Let's put that on our "we'll deal with it when we 
_have_ it" list.

>DR> We know which headers Paul Graham thinks are important.  Which ones do you
>DR> think are important?
>
>  Different ones depending on the route the e-mails take etc, I'd like to be
>able to control that myself; maybe setting a list of headers like this: To,
>From, Received(1), Received(-5). Where the positive numbers are counting from
>the first server that added the header, and negative numbers counting from the
>last.

The flex scanner is built at compile time.  If headers A, B, and C are of 
interest (or of possible interest), they need to be included in the .l 
file.  Right now, when an interesting line is encountered, function 
set_tag(char *prefix) is called to set a prefix for all tokens encountered 
on the rest of the list.  The prefix values currently in use "to:", 
"from:", "rtrn:", and "subj:".  It would be pretty easy for set_tag() to 
check the prefix against a list and ignore any that don't match.  A list 
like "to:|from:|rtrn:|subj:" would suffice.

Does that sound useful to you?  Any fields besides Received: that you'd 
like to see?

The positive and negative numbers seem like overkill.  Implementing the 
idea calls for a history list from which the proper items would be selected 
and tokenized.  I'm nixing that idea.

>DR> Right now we have 1,3,5 and label them as Yes/No/Unsure.  The meanings 
>of 2
>DR> & 4 aren't given in sufficient detail.
>
>  Sorry, I guess I didn't explain it well enough... I meant like an "incoming
>score", giving bogofilter a nudge towards, or from, spaminess.
>
>  I could do that today using different config-files with different values for
>what is to be considered spam, it just would be a lot easier if I could attach
>a value to the -u switch instead; telling bogofilter how mean/nice it 
>should be
>to that particular e-mail.

Switch '-u' is already used to tell bogofilter to update the wordlists 
after classifying the message.  Other letters are available if the feature 
is deemed useful.

>DR> If you'd like to write some code to implement your idea and post a 
>patch to
>DR> the list, people can try it and see how well it works for them.
>
>  Not a C-programmer, nor do I currently have the time to become one, so 
> unless
>I can figure out a quick way to hide perlcode inside c-code I just have to 
>keep
>on complaining every now and then on this list. ;-)

I confess that I still don't grok how values "2" and "4" would affect 
bogofilter's behavior.  Can you write some simple perl that would 
illustrate your idea?