suggestions/requests

Thu Jan 23 00:16:48 CET 2003

"Dew-Jones, Malcolm MSER:EX" <Malcolm.DewJones at gems5.gov.bc.ca> writes:

> The following ideas come to mind, and while it seems a bit much to suggest
> them without providing code, well, I'll suggest them anyway.
>
> To track the suggested statistics, use pseudo words.  I think you're
> already using this to track the message count, but with suitably
> chosen "words" the idea can be used for many other things as well.
> Each item to be tracked would have a predefined set of pseudo words
> that would record that item.
>
> Some things cannot be tracked as-is.  Instead you generate one of a
> fixed number of pseudo words that encompass all the values within a
> small number of keywords.  Appropriate ranges would be found by trial
> and error.
>
> e.g. a document length can be saved using one of a predefined number of
> ranges

Is this indicative?

>
> 	"document length in bytes 1-100"
> 	"document length in words 1-10"

> The ranges could possibly overlap to avoid edge cases, in which 
> case the calculated value would be saved twice, using both appropriate
> range

fuzzy match? Hum. How many bins do we want? Or, more sophisticated, what
distribution (mathematically) do we assume to apply? What model?

> 	e.g. overlapping pseudo words for shouting	 
> 		"shouting 0-10%"
> 		"shouting 5-15%"
> 		"shouting 11-20%"

That's something SpamAssassin does already. Personally, I don't think
bogofilter should duplicate that.

> The pseudo words are generated once at the end of the document, after the 
> main while(get_token) loop, but otherwise would be used just like any other 
> word, except that during lookups, the weighting of some of the pseudo words 
> should perhaps be different than a regular word - sensible values for the 
> weighting would be based on testing.

The simpler approach to deal with this would be:

1. if a word only has some characters upper-case, lower case it
   (StudlyCaps need to be taken into account).
2. if a word has all characters upper-case (say more than half), leave
   case as it.

We're currently folding everything to lower case, discarding information.

> Use Judy arrays to track the count of all words in each document being 
> checked - at the end generate a pseudo word describing the repetitiveness 
> of the document.

Judy is history for bogofilter. We're using Gyepi's wordhash function,
it's more portable and similar performance.

> Keep a third word list file, consisting of valid (english or whatever)
> words, and for each message, count the existence of words that are not
> recognized based on the dictionary.  Save the proportion of such words as a
> percentage of the number of words.  
>
> 	"unknown words 1-10%"
> 	"unknown words 11-20%"	... etc...  and/or an absolute count

I can read English, German, French and a fairly large part would cause
false negatives, unless the software would a) properly figure the
language, b) match. I'm not sure if it's worth the effort.

Coined words are strong indicators usually...

-- 
Matthias Andree