HTML treatment [was: how many tokens?]

Wed Feb 26 20:57:15 CET 2003

At 01:48 PM 2/26/03, Chris Wilkes wrote:

>Running this email
>         http://ladro.com/bf/20030226-01.txt
>through bogolexer version 0.10.1.2 it looks like I'm missing the parsing
>of this img src url:
>         http://www.homebusinesszone.net/printer/clipart/specs.gif
>What's odd is that I cut out that part of the email, available here
>         http://ladro.com/bf/20030226-02.txt
>and it correctly gets out the "printer" "clipart" tokens.
>
> > It seems that all html tags can be abused by including random character
> > sequences.  Some of the listed choices are given with the thought of
> > keeping the "good" stuff and discarding the random stuff.
>
>Yep.  I've seen some spam come through with totally random characters in
>it, probably there to throw off some BF like spam programs.
>
> > At the current time, bogofilter discards the innards.  It's a trivial
> > change to tokenize them.  The other options are more difficult.
>
>It looks like bogolexer keeps around some innards, but not all the time.
>Maybe its me misusing the tool.

Hi Chris,

Nope.  You're not misusing it and it's not misbehaving.  I spotted the 
difference about 1 second after looking at the two files.  What's happening 
is that bogofilter's behavior is more complex than you thought :-)

In message headers and plain text message bodies, angle brackets are 
another kind of token delimiter.  Your #2 message isn't identified as 
"Content-Type: text/html", so it's treated as plain text.  This causes the 
text between the angle brackets to be seen - which is exactly what you'd 
want because that's how email addresses are delimited.

In html mode, angle brackets are special.  Currently, they cause the text 
between them to be discarded.  Your #1 message has a "Content-Type: 
text/html" declaration in its header.  When bogofilter gets to the body of 
the message, it's using html parsing rules (rather than plain text 
rules).  So, the innards are discarded.

> > Also, should bogofilter convert items like &123; to their characters?
>
>Along those lines, would it be helpful to convert any IP only URLs into
>some magic token?  A lot of the spam sites don't list domain names in
>them, but rather the IP address of the server.

Independent of the html question, bogofilter has a config file option, 
block_on_subnets, which gives special treatment to numeric URLs.   With 
this option enabled, a URL becomes 4 tokens corresponding to the actual IP 
addresses and the Class A, B, and C subnets.  For examplee, address 
192.10.20.30 generates "url:192.10.20.30", "url:192.10.20",  "url:192.10", 
and "url:192"

How useful this is is unclear.  I do know that once or twice a week, this 
option causes bogofilter to give an "Unsure" classsification to a 
message.  What happens is that each night my mail server sends me all the 
anomalous messages from /var/log/syslog.  Spam from bogus addresses 
generates "domain not verified" messages which are included in the nightly 
mail.  Bogofilter looks at the IP addresses in these messages and 
recognizes them as spammish.  The result is a message with some strong ham 
indicators and some strong spam indicates.  The Robinson-Fisher algorithm 
evaluates them and concludes that it can't classify the message as either 
Ham or Spam - hence the Unsure classification.

It's fun to see how programs (mis)interpret the information available to 
them :-)

David