spaced out spam words

Fri Jun 9 23:53:15 CEST 2006

On Fri, 09 Jun 2006 07:38:09 -0400
Jason A. Smith wrote:

> On Fri, 2006-06-09 at 07:05, David Relson wrote:
> > Correct!  You are showing the result of processing "For example here
> > are ...".  I was showing some _examples_ of double-word tokens.
> > 
> > Now all I need is time to find my old patches, apply, and test
> > them...
> 
> So this patch will handle the often requested multi-word feature and
> deal with these spaced out words better.  Will it collapse/replace
> multiple white-spaces (spaces, tabs and maybe newlines) with '+'
> before adding to/checking the database?  What about html spam that
> often places tags, spaces & newlines between single letters, such
> that when displayed by an html viewer, still clearly shows the
> spammer's message?  Will the bogofilter parser collapse white-spaces,
> tags and newlines allowing it to combine the spaced out words in html
> spam?
> 
> ~Jason

Hi Jason,

Bogofilter uses white space as delimiters -- a capability that's
embedded in the lexer_v3.l file.  White space collapsing as you suggest
could, I suspect, be done with a preprocessor.  

HTML is a complex issue.  There are lots of tricks possible, for
example bogus tags and putting single letters in each cell of a
table.  HTML also allows "camouflaged" text (think white on white)
that a human won't see but a computer program will.  I'm unaware of
algorithms for successfully dealing with camo.

To summarize what's likely:

  - allowing setting of maximum and minimum token lengths (as distinct
from the current fixed minimum of 3 and the maximum defined by
pre-processor symbol MAXTOKENLEN, which defaults to 30).
  - multi-word processing.  For example a count of 2 and input of
"this is a sentence" would generate tokens like "this" "is" "this*is"
"a" "is*a" "sentence" "a*sentence" ...

Spaced out html words are likely to appear as single letter tokens and
multi-word tokens.  Thus "t h i s" is likely to generate "t", "h",
"t*h", "i", "t*i", "s", "i*s" (assuming a minimum token length of 1 and
a multiplicity count of 2).

Not yet determined is whether there should be separate maximums for
single word tokens and for multi-word tokens.  If the single word
maximum is 30 (like now), a double-token would be 61 chars, a triple
would be 92, etc.  Alternatively, there's a system wide maximum of,
say, 15, "ohio", "river", and "ohio*river" would be fine, but combining
"mississippi" and "river" would be truncated (or discarded).

What sort of flexibility are people interested in?  Anybody have the
time and energy to work on it?

Regards,

David