suggestions/requests

Dew-Jones, Malcolm MSER:EX Malcolm.DewJones at gems5.gov.bc.ca
Wed Jan 22 23:48:37 CET 2003


Hello.	

The following ideas come to mind, and while it seems a bit much to suggest
them without providing code, well, I'll suggest them anyway.  

To track the suggested statistics, use pseudo words.  I think you're already

using this to track the message count, but with suitably chosen "words" the 
idea can be used for many other things as well.  Each item to be tracked
would have a predefined set of pseudo words that would record that item.

Some things cannot be tracked as-is.  Instead you generate one of a fixed
number of pseudo words that encompass all the values within a small number
of keywords.
Appropriate ranges would be found by trial and error.

e.g. a document length can be saved using one of a predefined number of
ranges

	"document length in bytes 1-100"
	"document length in bytes 101-500"
	"document length in bytes 501-1000"

	"document length in words 1-10"
	"document length in bytes 11-100"	... etc ...

The ranges could possibly overlap to avoid edge cases, in which 
case the calculated value would be saved twice, using both appropriate
ranges	 
	e.g. overlapping pseudo words for shouting	 
		"shouting 0-10%"
		"shouting 5-15%"
		"shouting 11-20%" 

For each message you generate all appropriate pseudo words, and then add one

to erach pseudo word in the word list.  For example the pseudo word
"document 
length in bytes 101-500" would count the number of documents that fell in
that 
length range.

The pseudo words are generated once at the end of the document, after the 
main while(get_token) loop, but otherwise would be used just like any other 
word, except that during lookups, the weighting of some of the pseudo words 
should perhaps be different than a regular word - sensible values for the 
weighting would be based on testing.

I think the values suggested below could be efficiently tracked via simple 
counters during the scanning.  

-1- 

It sounds like you're recognizing html sections and mime parts.  Within
html sections, unrecognized tags within words should be counted, and at the
end of the document their existence saved as a pseudo word 

	e..g	"In addition to our cur<posuerimus>rency report" 

should produce the token "currency" and increment a global counter to be
used later.  

At the end of the document, generate a standard pseudo word such as 

	"unknown tags embedded in words"

		or perhaps with a range

	"unknown tags embedded in words 1-5 tags"
	"unknown tags embedded in words 5-10 tags"
	"unknown tags embedded in words 1-5% of words"
	"unknown tags embedded in words 5-10% of words"

Perhaps count the number of known tags in the same way, so "cur<bold>rency"
would produce "currency" and a count of the bold tag.

-1.5- 

Within html sections, unrecognized tags that are not embedded in a word
should 
also be counted, and their existence saved at the end of a document in a
similar 
manner to the above.  The text itself within the tag would be ignored.  

(In general, anything that effects what the user sees should be counted for
what 
it is, but anything that will not be seen by the user would be counted based
on 
the "existence" of the artifact, not the thing itself)

-2- 

save the percentage of shouting as one of a set of predefined pseudo words.
(Perhaps you could count the shouting while folding the word to lower case).


	pseudo word for word list = "shouting 0-10%" 
	pseudo word for word list = "shouting 11-20%" 
	pseudo word for word list = "shouting 21-30%" 
	pseudo word for word list = "shouting 31-40%" 
	pseudo word for word list = "shouting 41-50%" 
	pseudo word for word list = "shouting 51-60%" 
	pseudo word for word list = "shouting 61-70%" ...etc...  

-3- 

same thing for other easily measured style elements.

average length of sentence (guess this from the average distance between
puctuation marks that are not next to each other),     

	pseudo word = "sentence length 0-10 characters"
	pseudo word = "sentence length 0-10 words"
	pseudo word = "sentence length 11-20 characters"
	pseudo word = "sentence length 11-20 words"	... etc ...  

length of document in characters and/or words 

number of multipart sections 

proportion of sizes of the different multiparts (genuine attachments are 
typically much larger than the text portion of a message).

maximum length of text within html tags

count the quantity of "unreasonably" long words 

percentage of vowels, digits, punctuation

percentage of white space, 

the maximum length of a run of white space

	in particular track the two above items in the subject line 

Use Judy arrays to track the count of all words in each document being 
checked - at the end generate a pseudo word describing the repetitiveness 
of the document.  

-4- 

Keep a third word list file, consisting of valid (english or whatever)
words, and for each message, count the existence of words that are not
recognized based on the dictionary.  Save the proportion of such words as a
percentage of the number of words.  

	"unknown words 1-10%"
	"unknown words 11-20%"	... etc...  and/or an absolute count


Especially examine these words in the subject line.  

	"unknown words in subject 0"
	"unknown words in subject 1"
	"unknown words in subject 2"
	"unknown words in subject 3 or greater"	... etc...	






More information about the bogofilter-dev mailing list