Including html-tag contents may be unnecessary

Sun May 11 21:36:14 CEST 2003

On 20030511 (Sun) at 1453:23 -0400, David Relson wrote:
> At 01:46 PM 5/11/03, Greg Louis wrote:
> 
> I wonder if we'd do better by parsing the innards differently.  Rather than 
> use the usual broad definition of a token, bogofilter could be more 
> selective when parsing innards.  The lexer can be changed quite easily to 
> recognize http://{token}/junk or ftp://{token}/whatever or font={token}.
> 
> Let me know if you want to run such a test and I'll create a patch.  If you 
> have other ideas of patterns to look for, suggest them.
> 
Well, Paul Graham says: (quote)

Now I have a more complicated definition of a token:

   1. Case is preserved.

   2. Exclamation points are constituent characters.

   3. Periods and commas are constituents if they occur between two
   digits. This lets me get ip addresses and prices intact.

   4. A price range like $20-25 yields two tokens, $20 and $25.

   5. Tokens that occur within the To, From, Subject, and Return-Path
lines, or within urls, get marked accordingly. E.g. `foo'' in the
Subject line becomes `Subject*foo''. (The asterisk could be any
character you don't allow as a constituent.) 

...

Finally, what should one do about html? I've tried the whole spectrum
of options, from ignoring it to parsing it all. Ignoring html is a bad
idea, because it's full of useful spam signs. But if you parse it all,
your filter might degenerate into a mere html recognizer. The most
effective approach seems to be the middle course, to notice some tokens
but not others. I look at a, img, and font tags, and ignore the rest.
Links and images you should certainly look at, because they contain
urls.

(end quote).

Worth trying?
-- 
| G r e g  L o u i s          | gpg public key: finger     |
|   http://www.bgl.nu/~glouis |   glouis at consultronics.com |
| http://wecanstopspam.org in signatures fights junk email |