Random Thoughts from coffee
David Relson
relson at osagesoftware.com
Fri May 21 13:29:01 CEST 2004
On Fri, 21 May 2004 07:06:36 -0400
Tom Allison wrote:
> I'm sitting here pouring over my daily dose of spam and coffee and
> came up with a thought and it actually applies to bogofilter.
>
> Someone mentioned the idea of using a filter like:
> #:0
> #* ^X-Bogosity: Yes, tests=bogofilter, spamicity=1.000000
> #.spam/
>
> To eliminate the "certain" spam from any examination, implying that
> everything <1.000 needed human review.
>
> Similary I guess I could run a filter for 0.000000 to do the same
> thing.
>
> Then the coffee kicked in...
>
> I've got a ternary configuration and it seems that my Ham is pretty
> solid and my Spam isn't so solid. So I'm not really in a position to
> just skip any kind of review on the categories of ham/spam and start
> archiving them without some kind of quarentine/review process to
> manually ensure I'm not getting any mistakes in there.
>
> But there does seem to be a value of spamicity, above which I would
> not review. I would probably start with >0.999999 and slowly move
> down from there. The idea being that the closer to 0.500 one comes in
> their score for either Spam or Ham, the more attentive I should be
> because of the increased liklihood that I'll have a false reading.
>
> In order to sort this I could present this by pushing a prefix in
> every Subject line with repitions of a character "X" for each 0.1
> increment of spamicity. I suppose with enough caffiene something like
> this could be achieved using procmail... In other terms I'm trying to
> push all my mail into one of ~10 buckets (0.1 increments) with each
> bucket becoming a seperate folder that would get more/less scrutiny
> then it's predecessor. (procmail could filter on 0.1, 0.2, 0.3 very
> nicely too)
Procmail can certainly do that. "spamicity=0.N" would easily put copies
into the 10 buckets and you could check them periodically.
I'm presently using ham_cutoff=0.400000 and spam_cutoff=0.501000. All
the ham and unsure is looked at and the spam is pretty much ignored.
Having extra buckets wouldn't be of value to me.
> In a very ugly sense, I'm trying to break up a ternary system into a
> ... decanary (base ten, not decadant, Latin was 20+ years ago so
> forgive me) with many graduations for filtering email to better catch
> that grey area of spam.
>
> As an example: I currently have a set of Unsure Spam that has scores
> of: X-Bogosity: Unsure, tests=bogofilter, spamicity=0.500145,
> version=0.17.5 X-Bogosity: Unsure, tests=bogofilter,
> spamicity=0.540586, version=0.17.5 X-Bogosity: Unsure,
> tests=bogofilter, spamicity=0.590084, version=0.17.5 X-Bogosity:
> Unsure, tests=bogofilter, spamicity=0.728257, version=0.17.5
> X-Bogosity: Unsure, tests=bogofilter, spamicity=0.742948,
> version=0.17.5 X-Bogosity: Unsure, tests=bogofilter,
> spamicity=0.758808, version=0.17.5
Bogofilter provides a lot of user flexibility in formatting the
X-Bogosity line. I suspect most of the flexibility isn't used, but
that's a different subject :-) Look at the "SPAM_HEADER_NAME" and
"Format of SPAM_HEADER" sections of bogofilter.cf.example. You can have
lines like:
X-Bogosity: H 0.000145 0.17.5
X-Bogosity: U 0.400145 0.17.5
X-Bogosity: U 0.500145 0.17.5
X-Bogosity: U 0.600145 0.17.5
X-Bogosity: S 0.900145 0.17.5
which should make filtering into 10 buckets easy.
> So I can't very well adjust my spam_cutoff accordingly, I'll have a
> lot of false-positives (Is that the correct term for mis-classified
> ham?).
Correct
> But I could push these into different "buckets" easily enough and save
>
> some filtering.
>
> What I'm getting at is this:
>
> Does it make sense to subdivide the different cutoff regions into
> smaller segments to aide human review? Even if sorting was applied,
> it would effectively segment the email.
It's an experiment whose value only you can determine since it's your
time that's at stake.
> Is this something that might be appropriate for inclusion into
> bogofilter? how?
Extending the formatting rules to include a "decade" output wouldn't be
hard. However, it seems like overkill.
> Or should I just knuckle down with procmail for 6 months and come back
>
> when I have some more hard evidence? (I'll probably do that today
> anyways)
Go for the evidence :-)
David
More information about the Bogofilter
mailing list