relson at osagesoftware.com
Wed Mar 3 14:50:02 EST 2004
On Wed, 03 Mar 2004 11:41:57 -0800
Greg McCann wrote:
> On 3/3/2004 at 1:49 PM David Relson <relson at osagesoftware.com> wrote:
> >True. Ignoring tokens (via ignore lists) is different from ignoring
> >lines. What ideas have you on this? So far, "ignore 'X-ABC:' lines"
> >and "ignore 1st n ABC:" lines have been suggested. What else?
> I have a funny problem with my scoring that an ignore wordlist would
> probably help. Email headers (at least with sendmail) always contain
> the current date. My ham and spam corpuses (corpi?) are all from
> recent email and my spam corpus, which gets automatically updated from
> spamtrap addresses, is updated much more frequently than my ham, with
> about 1200 new spam every day.
> The unexpected consequence is that every time the month changes, the
> abbreviation for the current month instantly gets a very high spam
> score until I manually throw some more ham at it. Here's one from
> "rcvd:Mar" 3790 0.000000 0.029003 0.999998 +
> In this case, an ignore wordlist would probably be more useful than
> ignoring lines, since the "Received:" lines that contain the date also
> contain lots of other useful information, like the sender domain and
> IP address.
Sounds like it would indeed help you (and possibly other newbies).
Having run bogofilter for a full set of months, that wouldn'd help me.
FYI, this is what I have:
[relson at osage bogofilter]$ bogoutil -p $BOGOFILTER_DIR rcvd:Jan rcvd:Feb
rcvd:Mar rcvd:Apr rcvd:May rcvd:Jun rcvd:Jul rcvd:Aug rcvd:Sep
spam good Fisher
rcvd:Jan 6378 8999 0.469933
rcvd:Feb 2734 4388 0.438005
rcvd:Mar 3252 5266 0.435817
rcvd:Apr 3311 4717 0.467527
rcvd:May 4295 4987 0.518607
rcvd:Jun 4630 3478 0.624794
rcvd:Jul 2887 3948 0.477728
rcvd:Aug 3299 4685 0.468317
rcvd:Sep 6373 6239 0.560969
rcvd:Oct 4670 8247 0.414633
rcvd:Nov 7869 9985 0.496423
rcvd:Dec 10706 11200 0.544566
More information about the Bogofilter