invalid html warfare

Wed May 28 13:18:10 CEST 2003

***** At 12:53 AM 5/28/03, John McCainwrote:
>
>but our databases will gradually degrade with garbage data such as
><gmurfoophead>, assuming we are maintaining the training database.

As of 0.13.0 and the addition of Paul Graham's latest findings on parsing, 
bogofilter only tokenizes the innards of A, IMG, and FONT tags.  In 
text/html <gmurfoophead> will be discarded, i.e. no garbage in the wordlist.

***** At 03:53 AM 5/28/03, Gustaf Erikson wrote:
>What are the computational costs of running text/html messages through
>a parser (lynx) before processing. It wouldn't have to be very smart
>-- a html version of strings(1) should suffice. Messages containing
>just a GIF would be automatic spam...

I gather that, for some people, every CPU cycle is needed.  For them the 
costs would be too high.  That doesn't prevent you from trying the idea and 
seeing what happens.  Let us know how it goes, please.

>How about a utility for scanning the database against the dict db and
>a user defined wordlist, presenting words that don't match so that
>they could be deleted from the spamlist. If you apply English language
>rules for number of consonants etc. it could be pretty smart in
>filtering nonsense.

The dict db could be used.  I suspect many domain names would get red 
flagged.  Also, p0rn and v1agra.  Conceivably one could create tokens 
indicating how much nonsense is present, e.g. "nonsense:1", ... 5, ... 10, 
20, 50, 100, ...

***** At 06:13 AM 5/28/03, Peter Bishop wrote:

>Note all the tokens in my database are casefolded
>Also the separators - _ . can break up the letter sequence
>(as in email addresses, IP addresses, etc)
>
>This detected 14516 "singleton" tokens out of a total of 72261

Greg's tested what happens when singletons are discarded to shrink the 
wordlist size.  Bogofilter's accuracy went _way_ down.

>Another thought, what if there was a "randomness" test applied to tokens
>(this of course is language dependent).
>If the token is "random" *and* unregistered we could give it a different
>weighting, i.e  use a different robsx parameter (0.7 0.8?) so tresting it
>as a spammy token.
>
>Lots of issues with this - could it bias good messages ?
>e.g. which have unique message counts, unique email boundary separators.

Possibly CPU intensive as well.  On the other hand, if you find a way ...

>Still it is an option - it would mean that anti-filter text can be used for
>spam detection - foist with their own petard !!

Poetic justice :-)

At 06:53 AM 5/28/03, Simon Huggins wrote:
>Rightly worthwhile analysts would grep /usr/share/dict/words to avoid
>a synchronized psychotic lynching before claiming such apocryphal
>results so lightly.

I thought we were going to hoist him with his own petard?  In modern terms, 
frag him :-)

>Postscript: count the number of real words with five non-vowels in this
>email - apologies for the forced wording :)

I learned the vowels as "aeiou and sometimes y".  Simon, sometimes, you do 
a great job!

David