How to avoid s p lit up wor ds?
Karl Schmidt
karl at xtronics.com
Tue Jan 21 02:15:15 CET 2003
Chris Wilkes wrote:
> On Fri, Jan 17, 2003 at 04:23:42PM -0600, Karl Schmidt wrote:
>
>
>>Bogofilter could also keep track of the number of spaces, underlines etc
>>- as a percent of the message as a statistical indicator. Or individual
>>letters surrounded by anything as a percent of the message.
>
>
> I like that idea, but how are you going to put a number to that result?
> That in their email 10% of the words were one letter, 15% two, etc?
> Mash those percentages together to get "the number" and then see if it
> is spam? It could work.
The Idea of Bayesian filters is to combine the probabilities of several
statistics. No one statistic tells it all. The statistics don't all
have to be word frequencies.
A quick and dirty test would be to see what percent of the words show up
in the dictionary. This would give yet another indicator. Some items
will show up as meta numbers - combining a high number of misspelled
words with lots of programming command words would yield a pattern that
would indicate that it is a program rather than spam.
A meta level that would take the word frequency probabilities and then
several other message statistics that together would form patterns that
would then be associated with Spam or NOT spam as a statistical metric.
We can't just look at a simple number but instead patterns of numbers
that indicate spam.
A simple example would be a possible spam mail as seen from the current
word statistics that has a high number of single letter words AND is
in HTML format AND has a large percentage of capitalized letters -
should combine to flag it as spam.
The problem is that as the number of dimensions of indicators go up the
computer time needed to detect them goes up exponentially. Thus we need
to test what other statistics to combine are worth the computer time in
providing better filtering. By going with a second level that looks at a
bogofilter score and combines it (just as bogofilter does with word
frequencies) with other statistics and/or binary flags (is it in HTML?)
could be the best compromise of computer time and effectiveness.
I know that such meta analysis is being used to determine what to stock
in stores - patterns of purchasing that are most profitable are detected
and then stocked even if the sale of some items result in zero profit by
themselves.
Word frequency by itself is a powerful indicator, but add other
statistics to word-frequencies and I think bogofilter will do the job
for anything other than graphics file based spam.
Graphics file based spam is going to be next - and will use up even more
of our bandwidth resources. At that point the spammers will have crossed
the line into being a DoS attack. Ultimately, a authentication/whitelist
system for email will be a necessity
More information about the Bogofilter
mailing list