Including html-tag contents may be unnecessary
David Relson
relson at osagesoftware.com
Sun May 11 20:53:23 CEST 2003
At 01:46 PM 5/11/03, Greg Louis wrote:
>An experiment was performed to determine whether including contents of
>html tags (supported in bogofilter 0.12.3) improves discrimination:
... [snip] ...
>As expected, there were a lot more tokens when html tag contents were
>extracted from spam messages, and a proportionately smaller increase
>among the nonspams. Next the test files were classified:
...[snip]...
>It seems that including contents of html tags makes a difference to the
>distribution of scores. We need to shift the spam cutoff, so as to get
>roughly the same numbers of false positives; then we can compare the
>false-negative counts fairly. The default spam cutoff was 0.65, so the
>classification with html tag contents was repeated with cutoff 0.75:
...[snip]...
>Including contents of html tags did not significantly improve
>discrimination when the shift in distribution is taken into account; R
>was used to run an analysis of variance suggesting that the difference
>is probably insignificant statistically, as well as practically:
Greg,
An interesting result ! It's certainly not what I'd have expected. 'Tis
common knowledge that html innards include urls and image names and other
important info from/for/about the products being promoted in the
spam. With the current tokenize_html_tags capability, the innards are
analyzed by bogofilter's usual rule which is, roughly, a sequence of
alphanumeric characters (with a certain few punctuation marks) makes up a
token.
I wonder if we'd do better by parsing the innards differently. Rather than
use the usual broad definition of a token, bogofilter could be more
selective when parsing innards. The lexer can be changed quite easily to
recognize http://{token}/junk or ftp://{token}/whatever or font={token}.
Let me know if you want to run such a test and I'll create a patch. If you
have other ideas of patterns to look for, suggest them.
David
More information about the Bogofilter
mailing list