suggestions/requests

Tue Jan 28 23:02:24 CET 2003

A summary of my thoughts about some responses, in no particular order, and
I hope they don't sound argumentative at all, if they do then just be
assured that is not intended.  

Eric S. Raymond [esr at thyrsus.com] wrote:
>This is a good idea, but IMO it doesn't belong in bogofilter itself.
>Bogofiter should stick to doing one thing -- Bayesian analysis of 
>presented features -- and doing it well.

Ultimately, a person can look at spam and know its spam by looking at the
contents of a message and understanding it.  When I read the message I
understand the words being used, but I also observe such things as whether
the text is highlited, or flashing, or repetitive.  My though is to create
a filter that "reads" the mail just as I do.  

My criteria would be to test those things that can be easily and
efficiently measured within the general framework of the existing
procedure, which scans through the document using flex, pulls out tokens of
data, and then looks up their frequencies in a list.  

As for simplicity, I agree, but having a single program that measures a set
of metrics is itself simpler than running two programs, especialy if both
programs have to do similar work (i.e. scan through the document
recognizing various types of tokens, and then doing some kind of lookup to
use probabilities associated with the collected metrics). 

Matthias Andree [matthias.andree at gmx.de] wrote:
>> e.g. a document length can be saved using one of a predefined number of
>> ranges
>Is this indicative?

I do not know, though I plan of doing some tests with the spam I see in our
mail, of this and other measurements.  

In general, I would not wish to make assumptions about what is indicative. 
If it can be easily and efficiently measured, then measure it and let the
Bayesian statistical probabilities decide whether to use it or not. 

>Hum. How many bins do we want? 

Some trial and error initially would be involved in deciding what worked
reasonably.  Also, the ranges could easily be set as compile time options
that any one site or user could play with to find what works well for them
if they desired to do so.  

>> 	e.g. overlapping pseudo words for shouting	 
>> 		"shouting 0-10%"
>That's something SpamAssassin does already. Personally, I don't think
>bogofilter should duplicate that.

Why not.  I don't want to use spamassassin, so there is nothing repetitive
about it for me, and this is certainly a stylistic element of the text of
the document; one that could be easily and efficiently measured by
bogofilter.  It is also one that may be best measured with an understanding
the word structure of the document, which is part of what bogofilter
already has to do.  

>> The pseudo words are generated once at the end of the document, after the

>> main while(get_token) loop, but otherwise would be used just like any
other 
>> word, except that during lookups, the weighting of some of the pseudo
words 
>> should perhaps be different than a regular word - sensible values for the

>> weighting would be based on testing.

>The simpler approach to deal with this would be:
>
>1. if a word only has some characters upper-case, lower case it
>   (StudlyCaps need to be taken into account).
>2. if a word has all characters upper-case (say more than half), leave
>   case as it.

You mention a simpler approach, but my description of the pseudo words was
not limited to just measuring shouting.  It was the general technique that
would be used for any and all such measurements - convert the measurement
into a pseudo word, and then do a lookup on the frequency of that pseudo
word, exactly the same as if it were any other ordinary word. 

>>We're currently folding everything to lower case, discarding information.

The following was sent to me, and was interesting and possibly relevent.  

>Malcolm,
>
>I just saw this.  It's a new discussion by the creator of the Bayesian
approach to >spam filtering.
> http://paulgraham.com/better.html

>> Keep a third word list file, consisting of valid (english or whatever)
>> words, and for each message, count the existence of words that are not
>> recognized based on the dictionary.  Save the proportion of such words as
a
>> percentage of the number of words.  
>>
>> 	"unknown words 1-10%"
>> 	"unknown words 11-20%"	... etc...  and/or an absolute count

>I can read English, German, French and a fairly large part would cause
>false negatives

No, for several reasons.

-1- for example, if I have just an english dictionary, then english spam
containing random text would likely generate an unknown word count of about
10-20% (pseudo word "unknown words 11-20%") This pseudo word would have a
highish count in the spam list, and a low count in the ham list.  

However, if you send me a mail message in, say, German, then it would end
up with an unknown word count of around 90%, generating the pseudo word
"unknown words 90-100%".  This pseudo word would not have a high count in
the spam list.	

The result would be that the english dictionary would help detect english
spam in english mail, and would make no difference for non-english mail.  

-2- if you commonly use multiple languages then simply combine a set of
dictionaries into one master dictionary.  The random text at the end of a
subject line such as "Hi there XcXCVfg67FH" is still going to look random
no matter what languages you have in your list.  

-3- ultimately is doesn't matter anyway. If your use of languages means
that this doesn't help you, then bogofilter will end up not using this
metric, which is not worse than today.	For those people where it does make
a difference it will end up being used.  

>I'm not sure if it's worth the effort.  

I do not know if it is either, but it doesn't look like that much
programmer effort.  There are various word lists available that could be
used virtually as-is as dictionaries, and the code to do the word lookup
into that list would be virtually identical to the current word lookups. 

Matthias Andree wrote:
>Judy is history for bogofilter.

Of course the technique for efficiently counting the word frequency within
documents doesn't matter to me at all, judy was just a description to show
an idea had at least once workable technique.  

Thanks for listening.