testing parsing changes

Sat Nov 8 19:18:15 CET 2003

On Sat, 08 Nov 2003 18:30:52 +0100
Boris 'pi' Piwinger <3.14 at logic.univie.ac.at> wrote:

> David Relson <relson at osagesoftware.com> wrote:
> 
> >Modification D:
> >	Recognize <!DOCTYPE HTML PUBLIC.*> as the beginning of html
> >	text.
> >
> >Modification T:
> >	Accept two character tokens, e.g. "AB", "sp", ...
> 
> So not exactly my patch which also allows numbers?
> 
> >To establish a baseline result, parts 2, 3, and 4 of the spam
> >messages were scored using bogofilter's default parameters
> >(spam_cutoff=0.95, min_dev=0.100, robs=0.010, robx=0.415).  
> 
> Full training that is. Not the way to get most significant
> tokens.

Full training is good.  In this case, half the test messages are being
used for training and have are for testing.

FWIW, I have test results that show full training does as well as (or
better) than partial training.  Whether full training or train-on-error
does better seems to depend on the messages being used.  Neither
technique is (without question) superior.

> >The numbers of false
> >negatives are printed for each of the 3 parts, as well as a total
> >count.  These numbers provide an indication of how accurately
> >bogofilter scores spam (though without an indication of the ham
> >scoring).
> 
> Without results for ham this doesn't say much. Actually,
> false positives are way more important. As I described in my
> mail about my 2-byte-token/numeric-token test, most added
> tokens (pure training on error) were significant, i.e., they
> contribute the calculation. Many of which to the ham side,
> and reducing false positives is IMHO even more useful.
> 
> So my question for you is: Do you get many significant
> tokens?
> 
> So most interesting for me is looking at unmodified
> parameters (i.e., not using your target). Are both false
> positives and false negatives reduced?
> 
> Also interesting: What happens if you take you real
> parameters, not the standard? This would show what really
> happens.

As stated, this is a test to determine whether the lexer changes
contribute to improved parsing or not.  Most bogofilter users use the
default parameters, so calling them "not real" is a mistake.

> >The more interesting results are found next, using the following
> >method: The ham messages are scored and the results are sorted.  A
> >target cutoff of 0.25% (of the messages, i.e. 52 for test 1 and 59
> >for test 2) is used to find the cutoff value that gives 0.25% false
> >positives.  This cutoff value is then used in scoring the 3 sets of
> >spam to see how many of them are scored below the cutoff, i.e. how
> >many false negatives occur using the cutoff value.
> 
> Also here you don't give false positives. The target is not
> guaranteed to work as expected, it can be a bit off due to
> several messages with the same score (I have observed that
> in tests). Also your target is way to high for my taste (it
> is 1 false positive in 400 messages (in my case that would
> be every other day!).

pi,

I _do_ give false positive counts.  One of the test parameters is
setting the false positive target (for ham) at 0.25% of the number of
ham messages.  That count is used (along with the scores for the ham
messages) to determine the cutoff value.  Once the cutoff value is
determined, the spam messages are all scored and the number of false
negatives is reported.

The test did not include your "numeric" change.  Once you start allowing
a digit at the beginning of a token, then values line "110.2.43" go into
the wordlist.  When I did a quick experiment with that lexer change, the
quantity of numeric tokens was large and the tokens didn't appear to be
important.  For thoroughness, I _will_ run the test (but not till after
the weekend)>

David