How to avoid s p lit up wor ds?

Mon Jan 20 01:17:05 CET 2003

On Fri, Jan 17, 2003 at 04:23:42PM -0600, Karl Schmidt wrote:
> Well we could start running a OCR on any attached pictures, I'm thinking 
> of doing a whitelist set up where any message with attachment not on the 
> white list gets an auto response asking them to go to a web page to fill 
> out something to get on the whitelist - agreeing not to spam etc.   Just 
> worried that it could be used in an attempt to spam bounce.

Check out TMDA at http://tmda.net/ ... if you're not in a whitelist
you'll get an email back asking you to respond to it saying that you're
not an evil spammer and such.  Its all done through tags in the Reply to
that TMDA sends out to unknowns and it updating its white/black lists.
Pretty neat.

> Bogofilter could also keep track of the number of spaces, underlines etc 
> - as a percent of the message as a statistical indicator. Or individual 
> letters surrounded by anything as a percent of the message.

I like that idea, but how are you going to put a number to that result?
That in their email 10% of the words were one letter, 15% two, etc?
Mash those percentages together to get "the number" and then see if it
is spam?  It could work.

However things like people sending computer code could get a bad rank as
they use a lot of single letter variables.  Course in a spam you're not
likely to see a "#define" either (is that 'define' after the lexer gets
through with it?).

A number count of SHOUTING WORDS might be handy too.  I think the
current idea of lowercasing words is great and reduces the word count,
but maybe keeping track off all uppercased words might produce some
benefit.

And hey look at this spam I just got, and like 95% of my spam its an
HTML one that I can't read without hitting a couple of keys (thank
you mutt!):
  Burn F_a_t,  Build  M u s c l e.

Chris