How to avoid s p lit up wor ds?

Tue Jan 21 02:15:15 CET 2003

Chris Wilkes wrote:
> On Fri, Jan 17, 2003 at 04:23:42PM -0600, Karl Schmidt wrote:
> 
> 
>>Bogofilter could also keep track of the number of spaces, underlines etc 
>>- as a percent of the message as a statistical indicator. Or individual 
>>letters surrounded by anything as a percent of the message.
> 
> 
> I like that idea, but how are you going to put a number to that result?
> That in their email 10% of the words were one letter, 15% two, etc?
> Mash those percentages together to get "the number" and then see if it
> is spam?  It could work.

The Idea of Bayesian filters is to combine the probabilities of several 
  statistics. No one statistic tells it all. The statistics don't all 
have to be word frequencies.

A quick and dirty test would be to see what percent of the words show up 
in the dictionary. This would give yet another indicator. Some items 
will show up as meta numbers - combining a high number of misspelled 
words with lots of programming command words would yield a pattern that 
would indicate that it is a program rather than spam.

A meta level that would take the word frequency probabilities and then 
several other message statistics that together would form patterns that 
would then be associated with Spam or NOT spam as a statistical metric. 
We can't just look at a simple number but instead patterns of numbers 
that indicate spam.

A simple example would be a possible spam mail as seen from the current 
  word statistics that has a high number of single letter words AND is 
in HTML format AND has a large percentage of capitalized letters - 
should combine to flag it as spam.

The problem is that as the number of dimensions of indicators go up the 
computer time needed to detect them goes up exponentially. Thus we need 
to test what other statistics to combine are worth the computer time in 
providing better filtering. By going with a second level that looks at a 
bogofilter score and combines it (just as bogofilter does with word 
frequencies) with other statistics and/or binary flags (is it in HTML?) 
could be the best compromise of computer time and effectiveness.

I know that such meta analysis is being used to determine what to stock 
in stores - patterns of purchasing that are most profitable are detected 
and then stocked even if the sale of some items result in zero profit by 
themselves.

Word frequency by itself is a powerful indicator, but add other 
statistics to word-frequencies and I think bogofilter will do the job 
for anything other than graphics file based spam.

Graphics file based spam is going to be next - and will use up even more 
of our bandwidth resources. At that point the spammers will have crossed 
the line into being a DoS attack. Ultimately, a authentication/whitelist 
system for email will be a necessity