Including html-tag contents may be unnecessary

David Relson relson at osagesoftware.com
Sun May 11 20:53:23 CEST 2003


At 01:46 PM 5/11/03, Greg Louis wrote:

>An experiment was performed to determine whether including contents of
>html tags (supported in bogofilter 0.12.3) improves discrimination:

... [snip] ...

>As expected, there were a lot more tokens when html tag contents were
>extracted from spam messages, and a proportionately smaller increase
>among the nonspams.  Next the test files were classified:

...[snip]...

>It seems that including contents of html tags makes a difference to the
>distribution of scores.  We need to shift the spam cutoff, so as to get
>roughly the same numbers of false positives; then we can compare the
>false-negative counts fairly.  The default spam cutoff was 0.65, so the
>classification with html tag contents was repeated with cutoff 0.75:

...[snip]...

>Including contents of html tags did not significantly improve
>discrimination when the shift in distribution is taken into account; R
>was used to run an analysis of variance suggesting that the difference
>is probably insignificant statistically, as well as practically:

Greg,

An interesting result !  It's certainly not what I'd have expected. 'Tis 
common knowledge that html innards include urls and image names and other 
important info from/for/about the products being promoted in the 
spam.  With the current tokenize_html_tags capability, the innards are 
analyzed by bogofilter's usual rule which is, roughly, a sequence of 
alphanumeric characters (with a certain few punctuation marks) makes up a 
token.

I wonder if we'd do better by parsing the innards differently.  Rather than 
use the usual broad definition of a token, bogofilter could be more 
selective when parsing innards.  The lexer can be changed quite easily to 
recognize http://{token}/junk or ftp://{token}/whatever or font={token}.

Let me know if you want to run such a test and I'll create a patch.  If you 
have other ideas of patterns to look for, suggest them.

David





More information about the Bogofilter mailing list