Newbie Q

Tue Oct 15 03:24:31 CEST 2002

Michele Bariani wrote:
> On Sunday 13 October 2002 12:54, Tom Allison wrote:
> 
>>How do I train it?
>>
>>Right now it's mostly wrong simply because it has no known history
>>to work from.
>>
>>How can I beef up it's experience level?
> 
> 
> I've used a little Perl script to feed it with a "ham" mbox file and then a 
> "spam" one (I've been collecting messages for months to experiment on them 
> 8-). This way I've got some initial tuning on my own mail stream.
> I can send you the script off-list if you like, it's not to be used as an 
> example of good Perl ;-) but works ok.
> 
> 
> 
>>Which brings me to what is probably a much harder question: Is there
>>some way that I can have a client email a spam mail back and have
>>the mail used for correcting a bogofilter setting?  Right now I'm
>>not really sure how to accomplish this.
> 
> 
> I've been talking about this with a friend of mine, the idea would be to have 
> a single keystroke/button that adds a new (personalized) header to the 
> message and sends it back to the server. The rules on the server would see 
> the header and understand the message is a false positive/negative that needs 
> correction (and not a new one).
> 
> Michele
> 

Here's what I decided upon for now

-----------------------------------
:0fw:
| /usr/bin/spamc -f

#:0fw:
#| $BOGOFILTER -p -v

:0wc:
* ^X-Spam-Status: Yes
| $BOGOFILTER -s

:0wc:
* ^X-Spam-Status: No
| $BOGOFILTER -n

:0wc:
*^TO_asdf
| $BOGOFILTER -s
------------------------------------

I'm using spamassassin to set the criteria for spam.
I also created a dummy account user=asdf that I've posted in all the usually 
stupid places to put email address (web pages, usenet posts in alt.business.* 
and so on).

Between these two, I hope to get a decent feed of what makes spam.

Meanwhile, after looking into some discussions about Bayesian filters and 
some assumptions that were made in the original articles.  I thought I would 
try gathering some statistics on words in general email content to see what 
happens with that.

Question on statistics, I was going to take a simple approach for now of 
taking each word ( =~ /(\w{3,})/) and counting how many times they show up 
across an email file.

Are there any suggestions on how to do this better or more effectively.  I've 
noticed that perhaps the HEADERS might be pretty useless as the words like 
From, Recieved, and To tend to hit the top of the list all the time.

-- 
Women wish to be loved without a why or a wherefore; not because they are
pretty, or good, or well-bred, or graceful, or intelligent, but because
they are themselves.
		-- Amiel

For summay digest subscription: bogofilter-digest-subscribe at aotto.com