invalid html warfare

Peter Bishop pgb at adelard.com
Wed May 28 12:13:22 CEST 2003


On 27 May 2003 at 23:53, John McCain wrote:

> imho, I think the days of using code-level html to identify spam are gone.  I 
> think the only way statistical filters are going to continue to be effective 
> is if they see the message exactly as a human would.  I've seen a great deal 
> of statistical filter evasion such as the examples I cited.  The best case 
> scenario right now seems to be that the filter will still catch the message, 
> but our databases will gradually degrade with garbage data such as 
> <gmurfoophead>, assuming we are maintaining the training database

With regard to junk words, perhaps we could define heuristics for detecting 
them. One possibile test is a sequence of 5 non-vowels in a token.
I tried this test on my spamlist.db looking for junk works that only appear 
once, i.e.

bogoutil -d ~/.bogofilter/spamlist.db | \
grep -P [^-_.aeiouy]{5}\w*\ 1$  | wc -l

Note all the tokens in my database are casefolded
Also the separators - _ . can break up the letter sequence
(as in email addresses, IP addresses, etc) 

This detected 14516 "singleton" tokens out of a total of 72261

Most of the tokens looked pretty random to me

Looking for "junk" tokens that appeared in any number of messages the count 
rose to 15967

So if bogofilter ignored "junk" we would have lost around 1500 useful 
tokens out of 72261.

Another thought, what if there was a "randomness" test applied to tokens
(this of course is language dependent).
If the token is "random" *and* unregistered we could give it a different 
weighting, i.e  use a different robsx parameter (0.7 0.8?) so tresting it 
as a spammy token.

Lots of issues with this - could it bias good messages ?
e.g. which have unique message counts, unique email boundary separators.

Still it is an option - it would mean that anti-filter text can be used for 
spam detection - foist with their own petard !!


-- 
Peter Bishop 
pgb at adelard.com
pgb at csr.city.ac.uk






More information about the Bogofilter mailing list