Random Thoughts from coffee

Fri May 21 13:29:01 CEST 2004

On Fri, 21 May 2004 07:06:36 -0400
Tom Allison wrote:

> I'm sitting here pouring over my daily dose of spam and coffee and
> came up with a thought and it actually applies to bogofilter.
> 
> Someone mentioned the idea of using a filter like:
> #:0
> #* ^X-Bogosity: Yes, tests=bogofilter, spamicity=1.000000
> #.spam/
> 
> To eliminate the "certain" spam from any examination, implying that 
> everything <1.000 needed human review.
> 
> Similary I guess I could run a filter for 0.000000 to do the same
> thing.
> 
> Then the coffee kicked in...
> 
> I've got a ternary configuration and it seems that my Ham is pretty 
> solid and my Spam isn't so solid.  So I'm not really in a position to 
> just skip any kind of review on the categories of ham/spam and start 
> archiving them without some kind of quarentine/review process to 
> manually ensure I'm not getting any mistakes in there.
> 
> But there does seem to be a value of spamicity, above which I would
> not review.  I would probably start with >0.999999 and slowly move
> down from there.  The idea being that the closer to 0.500 one comes in
> their score for either Spam or Ham, the more attentive I should be
> because of the increased liklihood that I'll have a false reading.
> 
> In order to sort this I could present this by pushing a prefix in
> every Subject line with repitions of a character "X" for each 0.1
> increment of spamicity.  I suppose with enough caffiene something like
> this could be achieved using procmail...  In other terms I'm trying to
> push all my mail into one of ~10 buckets (0.1 increments) with each
> bucket becoming a seperate folder that would get more/less scrutiny
> then it's predecessor.  (procmail could filter on 0.1, 0.2, 0.3 very
> nicely too)

Procmail can certainly do that.  "spamicity=0.N" would easily put copies
into the 10 buckets and you could check them periodically.  

I'm presently using ham_cutoff=0.400000 and spam_cutoff=0.501000.  All
the ham and unsure is looked at and the spam is pretty much ignored. 
Having extra buckets wouldn't be of value to me.

> In a very ugly sense, I'm trying to break up a ternary system into a
> ... decanary (base ten, not decadant, Latin was 20+ years ago so
> forgive me) with many graduations for filtering email to better catch
> that grey area of spam.
> 
> As an example:  I currently have a set of Unsure Spam that has scores
> of: X-Bogosity: Unsure, tests=bogofilter, spamicity=0.500145,
> version=0.17.5 X-Bogosity: Unsure, tests=bogofilter,
> spamicity=0.540586, version=0.17.5 X-Bogosity: Unsure,
> tests=bogofilter, spamicity=0.590084, version=0.17.5 X-Bogosity:
> Unsure, tests=bogofilter, spamicity=0.728257, version=0.17.5
> X-Bogosity: Unsure, tests=bogofilter, spamicity=0.742948,
> version=0.17.5 X-Bogosity: Unsure, tests=bogofilter,
> spamicity=0.758808, version=0.17.5

Bogofilter provides a lot of user flexibility in formatting the
X-Bogosity line.  I suspect most of the flexibility isn't used, but
that's a different subject :-)  Look at the "SPAM_HEADER_NAME" and
"Format of SPAM_HEADER" sections of bogofilter.cf.example.  You can have
lines like:

  X-Bogosity: H 0.000145 0.17.5
  X-Bogosity: U 0.400145 0.17.5
  X-Bogosity: U 0.500145 0.17.5
  X-Bogosity: U 0.600145 0.17.5
  X-Bogosity: S 0.900145 0.17.5

which should make filtering into 10 buckets easy.

> So I can't very well adjust my spam_cutoff accordingly, I'll have a
> lot of false-positives (Is that the correct term for mis-classified
> ham?).

Correct

> But I could push these into different "buckets" easily enough and save
> 
> some filtering.
> 
> What I'm getting at is this:
> 
> Does it make sense to subdivide the different cutoff regions into 
> smaller segments to aide human review?  Even if sorting was applied,
> it would effectively segment the email.

It's an experiment whose value only you can determine since it's your
time that's at stake.

> Is this something that might be appropriate for inclusion into 
> bogofilter?  how?

Extending the formatting rules to include a "decade" output wouldn't be
hard.  However, it seems like overkill.

> Or should I just knuckle down with procmail for 6 months and come back
> 
> when I have some more hard evidence?  (I'll probably do that today
> anyways)

Go for the evidence :-)

David