Sunday test results

Mon Feb 17 16:45:42 CET 2003

;At 09:56 AM 2/17/03, Greg Louis wrote:

>On 20030216 (Sun) at 1747:36 -0500, David Relson wrote:
> > Hi Greg,
> >
> > Finally!  Today's test results.  For training I used 2 months of my data
> > (Oct & Dec) and for testing I used the other 2 months (Dec & Jan).  With
> > the 2-2 split and the date being 02/16, I've called the results
> > test.0216.22.tgz.
>
>I'm still highly suspicious about all those 0.5's, but there seems to
>be no getting around it: you have a small number (less than three
>percent) of spam in your test corpus that are distributed all across
>the spamicity-score spectrum.  The composition of your incoming email
>stream seems to be just plain _different_ from mine.  Unfortunately,
>the numbers are small; but still:

I think I'll take a look at some of the 0.5's, i.e. isolate the message and 
look at it.

One known characteristic of my spam is that many messages arrive in 
duplicate, as two of my users are on the same spam list(s).  Seems like of 
1650 spam in January, 600 had duplicate subjects.  I suspect the distrib 
script is often putting one copy in the training set and the other in the 
test set.  That has to have _some_ effect on the results.  I suppose I 
could try to eliminate the 600 duplicates from the test and see what the 
results look like.

FWIW, I computed counts on my wordlists.  I was curious as to how many 
subj:* tokens I have and how it relates to total count.  Here are the results:

[relson at osage runex.dr]$ for i in test.?.d/*.db ; do j=`basename $i 
.db`.txt ; bogoutil -d $i >$j ; echo $j " " `wc -l $j` ` grep ^subj: $j | 
wc -l` ; done
                 words  subj:
goodlist.txt   128562     0  no-tag
spamlist.txt    45114     0  no-tag
goodlist.txt   133166  4874  subj-tag
spamlist.txt    47639  3083  subj-tag

>read.table("/root/scratch/gl.0216.22/parms.tbl",
>   col.names=c("tagging", "md", "cutoff", "run", "fp", "fn")) ->dr
>dr
>    tagging    md cutoff run fp fn
>1    notag 0.025    0.5   0  4 17
>2    notag 0.025    0.5   1  4 17
>3    notag 0.050    0.5   0  4 17
>4    notag 0.050    0.5   1  4 17
>5    notag 0.075    0.5   0  4 16
>6    notag 0.075    0.5   1  4 15
>7    notag 0.100    0.5   0  4 15
>8    notag 0.100    0.5   1  4 14
>9    notag 0.125    0.5   0  4 14
>10   notag 0.125    0.5   1  4 13
>11     tag 0.025    0.5   0  4 13
>12     tag 0.025    0.5   1  4 16
>13     tag 0.050    0.5   0  4 13
>14     tag 0.050    0.5   1  4 16
>15     tag 0.075    0.5   0  4 12
>16     tag 0.075    0.5   1  4 15
>17     tag 0.100    0.5   0  4 11
>18     tag 0.100    0.5   1  4 11
>19     tag 0.125    0.5   0  4  9
>20     tag 0.125    0.5   1  4 14

Since fp for tag is typically 4 lower than fp for non-tag (and 4 represents 
25% of the non-tag value), it looks like tagging is distinctly helpful.

>dr$tagging <- factor(dr$tagging)
>dr$md <- factor(dr$md)
>attach(dr)
>draov <- aov(fn ~ tagging + md + tagging*md)
>summary(draov)
>
>             Df Sum Sq Mean Sq F value  Pr(>F)
>tagging      1 31.250  31.250 11.3636 0.00711 **
>md           4 39.500   9.875  3.5909 0.04599 *
>tagging:md   4  1.500   0.375  0.1364 0.96509
>Residuals   10 27.500   2.750
>
>Tagging makes a difference, mindev a slight difference (almost too
>small to be significant, but still...) and there's no apparent
>interaction between the two.  In your case the runs aren't a factor so
>I treated them as replication.
>
>Increasing min_dev helps your discrimination and hurts mine; put
>another way, in your training corpus, the low-deviation tokens are poor
>in information characteristic of spam or nonspam, whereas in mine there
>must be useful information in that group.  How does that come about?  I
>have no idea.  The fact that in my case the tagging makes a positive
>difference _only_ with low min_dev suggests the possibility that my
>useful low-deviation tokens are to some extent associated with headers,
>but that does little to explain the differences from your results.

Perplexing ...  We need more hypotheses that we can test ...