Using BF for scoring text on other types of polarities?

Wed Jun 24 22:47:34 CEST 2009

On Wed, Jun 24, 2009 at 03:42:16PM -0400, jkinz at kinz.org wrote:
>Are there any tools or documents, emails etc.. that give any
>hints about how you can do this? 
>
>Just fyi- I'm planning on using it on text, not protein data 
>or anything like that.  :)
>
>I hope to simply be able to train it using a collection of text
>files, one group would be ham (having attribute X). the other
>would be the spam group (not having attribute X).
>
>Attribute X would be a topic like C programming, or political
>discussions, things I can clearly define as being in or out (with
>some unsures.. :) ) 

I use bogofilter to additionally calculate an incoming email's 
"archivicity" (the likelihood that I will want to save it in my 
permanent email archive, excluding messages from machines, "uplifting" 
email forwards from one of my sisters, etc).

I just use -c to select a different config file, which specifies a 
different wordlist file, a different email header, different labels 
instead of "Ham" and "Spam", etc.

One word of caution: bogofilter's classification command line arguments 
(-n and -s) are very clearly for nonspam versus spam classification 
(versus another Bayesean tool, dbacl, where you have to specify the 
category explicitly), and that has profound usage implications.  Let me 
illustrate.

When I set this up initially, I wanted to have messages that wanted I to 
keep in my archive to have a high numerical "archivicity", so a message 
I was certain to want in the archive would have a reported bogosity of 
1.00.  So I set up my tools to classify those messages with -s (OK, 
that's weird), and changed the "Spam" label to "Yes".  Vice versa for 
messages I was certain to not want: classify with -n, and change "Ham" 
to "No".

As long as the tools were handling it, everything would be fine, but a 
couple of times a year I'd try to classify something by hand, and I 
would *always* think "I want to retain this message, so classify with 
-n", which would change the statistics 100% in the wrong direction.

So finally I threw away my existing database, changed all my tools to 
switch -n and -s, and relabeled things so a message with a high numeric 
bogosity means that I probably *don't* want to keep it, and it's labeled 
"X-Archive: No".  The numbers don't make sense, but it keeps me from 
messing up my database.

I've experimented with dbacl, which supports comparing a message to 
multiple categories (eg, email that might be work, personal, or 
hobby-related), in addition to not being tuned for spam vs ham.  It 
looks good, but my existing setup satisfies my current desires, so I 
just haven't taken the time to play with it very much.

I've attached my config.archive.  I hope this helps.

Ed
-------------- next part --------------
#
db_cachesize=2
robs=0.0100
min_dev=0.314
robx=0.409081
sp_esf=0.013363
ns_esf=0.003171
spam_cutoff=0.501122    # for 0.20% fp (1); expect 1.00% fn (5).
ham_cutoff=0.3

#### WORDLIST: define additional word lists
#
#	char type: 'r' (regular) or 'i' (ignore)
#	char *name: name of list, e.g. "system", "user", "ignore"
#	char *path: absolute path to file or
#	            file name (relative to bogofilter_dir)
#	int  order - once found, skip higher numbered lists
#
wordlist r,archivewords,archivewords.db,1

#### SPAM_HEADER_NAME
#
#	used in reporting spamicity and
#	in removing already existing headers
#
spam_header_name=X-Archive

#### Format of spamicity output
#
# for two-state output the third entry is not needed and not used
#
spamicity_tags = No, Yes, Unsure
spamicity_formats = %0.6f, %0.6f, %0.6f

#### Format of SPAM_HEADER
#
#	formatting characters:
#
#	    h - spam_header_name, e.g. "X-Bogosity"
#
#	    c - classification, e.g. Yes/No, Spam/Ham/Unsure, +/-/?
#
#	    D - date, fixed ISO-8601 format for Universal Time ("GMT")
#
#	    e - spamicity as 'e' format
#	    f - spamicity as 'f' format
#	    g - spamicity as 'g' format
#
#	    A - IP address (from first Received: statement having one)
#		Not guaranteed to be the originating address of the message.
#	    I - Message ID
#	    Q - Queue ID (from first id tag found in Received: headers)
#
#	    l - logging tag (from '-l' option)
#
#	    o - spam_cutoff, ex. cutoff=%o
#
#	    p - spamicity value
#	    d - if ham or unsure, the spamicity
#		if spam, difference of spamicity from 1.0
#
#	    r - runtype
#	        w - word count
#	        m - message count
#
#	    u - username - this will either be the login from getlogin(),
#			   if that is empty, the pw_name obtained from
#			   the password database, or the user id
#			   prefixed by #, for instance, #1003
#
#	    v - version
#
#    customizable messages:
#
#	header_format - the "X-Bogosity" line that '-p' adds to
#		the message header and '-v' outputs.
#	terse_format - an abbreviated form of header_format;
#		selected by command line option '-t'
#	log_header_format - written to syslog by '-u' option
#		when classifying messages.
#	log_update_format - written to syslog by '-u' option
#		when registering messages.
#
#
header_format = %h: %c, tests=bogofilter, archivicity=%p, version=%v
#terse_format = %1.1c %f
log_header_format = %h: %c, archivicity=%p, version=%v
#log_update_format = register-%r, %w words, %m messages
##log_header_format = %h: %c, spamicity=%f, ipaddr=%A, queueID=%Q, msgID=%I, version=%v
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.txt
Type: application/pgp-signature
Size: 189 bytes
Desc: Digital signature
URL: <https://www.bogofilter.org/pipermail/bogofilter/attachments/20090624/3bd4cc39/attachment.sig>