OT: Chunking the cruft - random lettered words

Wed Mar 17 13:31:52 CET 2004

Bob George wrote:
> Tom Allison wrote:
> 
>> [...]
>> I would try the perl script first and see how it pans out.
>> pipe it after bogofilter since bogofilter is already 99.9% effective 
>> and the theory we're testing here is that the remaining 0.1% can't 
>> spell worth a d at rn.
> 
> 
> Rather than incur the penalty of doing a spell-check on all the words in 
> such a message -- which will fail on the "random word" technique anyhow 
> -- many on the spamassassin list have had good luck with things like:

Last night, for sake of something to play with, I did this against all 
my archived ham.  For reasons to be apparent, I didn't bother running 
against my spam archive.

in each email body only I sucked in tokens by means of
@tokens = grep (/^[a-zA-Z]{3,}$/, split(/\b/, $_))
to give myself all alphabetic (no numbers) tokens of three or more 
letters.  (This isn't exactly the code used as a %hash is much faster, 
but you get the idea).  I didn't bother to decode MIME messages at all.

I then took a ratio of unique tokens and their appearance in my 
dictionary file.  I was doing this as a comparison between two hashes 
which is the easiest and fast method that I can cobble together in perl 
in under 10 minutes.

That said, it took about an hour to parse everything in just my ham.

For ham alone the spelling ratio of good words to all words ranged from 
0.01 to 0.89.  Initially I'm not to impressed with this idea since my 
suspicion is that most of these poor readings were because the words 
were not available in my dictionary.  It would require a pretty huge 
dictionary to cover all the words and wouldn't a "large enough" 
bogofilter wordlist sort of cover this one for you?

I didn't run the spam archive because I felt like getting some sleep 
instead.  But maybe this is useful information just the same.