Including html-tag contents may be unnecessary

David Relson relson at osagesoftware.com
Sun May 11 22:24:18 CEST 2003


At 03:36 PM 5/11/03, Greg Louis wrote:

>On 20030511 (Sun) at 1453:23 -0400, David Relson wrote:
> > At 01:46 PM 5/11/03, Greg Louis wrote:
> >
> > I wonder if we'd do better by parsing the innards differently.  Rather 
> than
> > use the usual broad definition of a token, bogofilter could be more
> > selective when parsing innards.  The lexer can be changed quite easily to
> > recognize http://{token}/junk or ftp://{token}/whatever or font={token}.
> >
> > Let me know if you want to run such a test and I'll create a patch.  If 
> you
> > have other ideas of patterns to look for, suggest them.
> >
>Well, Paul Graham says: (quote)
>
>Now I have a more complicated definition of a token:
>
>    1. Case is preserved.
>
>    2. Exclamation points are constituent characters.
>
>    3. Periods and commas are constituents if they occur between two
>    digits. This lets me get ip addresses and prices intact.
>
>    4. A price range like $20-25 yields two tokens, $20 and $25.
>
>    5. Tokens that occur within the To, From, Subject, and Return-Path
>lines, or within urls, get marked accordingly. E.g. `foo'' in the
>Subject line becomes `Subject*foo''. (The asterisk could be any
>character you don't allow as a constituent.)
>
>...
>
>Finally, what should one do about html? I've tried the whole spectrum
>of options, from ignoring it to parsing it all. Ignoring html is a bad
>idea, because it's full of useful spam signs. But if you parse it all,
>your filter might degenerate into a mere html recognizer. The most
>effective approach seems to be the middle course, to notice some tokens
>but not others. I look at a, img, and font tags, and ignore the rest.
>Links and images you should certainly look at, because they contain
>urls.
>
>(end quote).
>
>Worth trying?

Seeing as how Paul Graham spawned the whole Bayesian paradigm, anything he 
suggests is worth considering.

Do we want to try them all at once, or provide some selectivity?








More information about the Bogofilter mailing list