Random Thoughts from coffee

Fri May 21 13:06:36 CEST 2004

I'm sitting here pouring over my daily dose of spam and coffee and came 
up with a thought and it actually applies to bogofilter.

Someone mentioned the idea of using a filter like:
#:0
#* ^X-Bogosity: Yes, tests=bogofilter, spamicity=1.000000
#.spam/

To eliminate the "certain" spam from any examination, implying that 
everything <1.000 needed human review.

Similary I guess I could run a filter for 0.000000 to do the same thing.

Then the coffee kicked in...

I've got a ternary configuration and it seems that my Ham is pretty 
solid and my Spam isn't so solid.  So I'm not really in a position to 
just skip any kind of review on the categories of ham/spam and start 
archiving them without some kind of quarentine/review process to 
manually ensure I'm not getting any mistakes in there.

But there does seem to be a value of spamicity, above which I would not 
review.  I would probably start with >0.999999 and slowly move down from 
there.  The idea being that the closer to 0.500 one comes in their score 
for either Spam or Ham, the more attentive I should be because of the 
increased liklihood that I'll have a false reading.

In order to sort this I could present this by pushing a prefix in every 
Subject line with repitions of a character "X" for each 0.1 increment of 
spamicity.  I suppose with enough caffiene something like this could be 
achieved using procmail...  In other terms I'm trying to push all my 
mail into one of ~10 buckets (0.1 increments) with each bucket becoming 
a seperate folder that would get more/less scrutiny then it's 
predecessor.  (procmail could filter on 0.1, 0.2, 0.3 very nicely too)

In a very ugly sense, I'm trying to break up a ternary system into a ... 
decanary (base ten, not decadant, Latin was 20+ years ago so forgive me) 
with many graduations for filtering email to better catch that grey area 
of spam.

As an example:  I currently have a set of Unsure Spam that has scores of:
X-Bogosity: Unsure, tests=bogofilter, spamicity=0.500145, version=0.17.5
X-Bogosity: Unsure, tests=bogofilter, spamicity=0.540586, version=0.17.5
X-Bogosity: Unsure, tests=bogofilter, spamicity=0.590084, version=0.17.5
X-Bogosity: Unsure, tests=bogofilter, spamicity=0.728257, version=0.17.5
X-Bogosity: Unsure, tests=bogofilter, spamicity=0.742948, version=0.17.5
X-Bogosity: Unsure, tests=bogofilter, spamicity=0.758808, version=0.17.5

So I can't very well adjust my spam_cutoff accordingly, I'll have a lot 
of false-positives (Is that the correct term for mis-classified ham?).

But I could push these into different "buckets" easily enough and save 
some filtering.

What I'm getting at is this:

Does it make sense to subdivide the different cutoff regions into 
smaller segments to aide human review?  Even if sorting was applied, it 
would effectively segment the email.

Is this something that might be appropriate for inclusion into 
bogofilter?  how?

Or should I just knuckle down with procmail for 6 months and come back 
when I have some more hard evidence?  (I'll probably do that today anyways)

Thoughts?