Idea for improving the learning stage
mouss
mlist.only at free.fr
Fri Sep 7 23:53:06 CEST 2007
David Relson wrote:
> On Fri, 7 Sep 2007 10:14:56 +0000 (UTC)
> Andrew wrote:
>
>> On Thu, 6 Sep 2007 21:33:42 -0400,
>> David Relson <relson at osagesoftware.com> wrote:
>>
>>> The intelligence you suggest belongs in a script driving bogofilter.
>>> With claws-mail I have two actions "classify as spam" and "classify
>>> as ham". These actions forward the messages to special addresses
>>> on my mail server and procmail spots the messages and passes them
>>> to a reclassify script. The reclassify script looks at the
>>> forwarding address and the message's X-Bogosity line then invokes
>>> bogofilter with appropriate flags. For example, since "X-Bogosity:
>>> Spam" and "forward as ham" indicates a "False Positive" bogofilter
>>> gets run with "-S -n". Note that all the decision making is
>>> _outside_ of bogofilter.
>>
>> So how could an external script tell bogofilter to "ignore the
>> subject" or "ignore the body" ?
>>
>>
>> Regards,
>> Andrew
>
> Bogofilter doesn't have such capabilities, nor does it need them. If
> you want part of a message to be excluded, a copy of the message needs
> to be created without that part. Tools that you should consider are
> formail, awk, and grep.
>
> formail is a very powerful tool for working with email messages. Read
> it man page.
>
> grep can be used for simple exclusion tasks. For example, to exclude
> only the subject:
>
> grep -v ^Subject: < message | bogofilter ...
[body only]
Isn't "Subject" a token and that removing it will make it no more
neutral? I mean, suppose you remove Subject from thousand spam messages,
then "Subject" may become a ham sign, which it should not be.
[subject only]
and if you only train by subject, you will miss the spammy body tokens.
It would be more interesting to "duplicate" the message and train
multiple times, once with body+subject and once with subject. however,
one should then train ham messages N times (N>=2) to avoid skewing the
filter.
More information about the bogofilter-dev
mailing list