0.93.6 vs 0.94.0

Matthias Andree matthias.andree at gmx.de
Fri Feb 25 20:50:37 EST 2005


".rp" <printer at moveupdate.com> writes:

> Well as one of those still using 0.1x , I would say that it would be very 
> usefull if the ReleaseNotes where made available separate from the whole 
> dang package AND if it was better written for those who are not super 
> technical manual readers. (I read the notes for .91.x and was so confused 
> and bewildred by the language and layout that it simply froze me out).

May I ask you to review the attached RELEASE.NOTES file, taken from a
late 0.93.X version, and let me know (on the list) what areas are
unclear, too technical, too little instruction, too terse, too verbose,
and so on?

Thanks in advance,

-- 
Matthias Andree
-------------- next part --------------
INCOMPATIBLE CHANGES IN BOGOFILTER 0.93
=======================================

Summary for the hasty
---------------------

YOU MUST ADJUST YOUR SCRIPTS IF YOU ARE EVALUATING "X-Bogosity" HEADERS!

YOU MAY NEED TO ADJUST YOUR SCRIPTS IF YOU ARE PARSING 'bogofilter -V' OUTPUT!

WHEN USING BERKELEY DB (DEFAULT), NFS NO LONGER WORKS AND
YOU   M U S T   READ doc/README.db AND POSSIBLY CONFIGURE THE DATABASE!


Defaults changed
----------------

Bogofilter's defaults have been changed.  It now operates in tri-state
mode and will classify messages as Spam, Ham, or Unsure.

If you're checking messages for "X-Bogosity: Yes" or "X-Bogosity: No",
you _need_ to change your checks.  Use "X-Bogosity: Spam" and
"X-Bogosity: Ham" instead of the old forms.  Also, checking for
"X-Bogosity: Unsure" and putting those messages in a separate folder (or
mailbox) will give you an excellent set of messages for training, as
"Unsure" messages are messages that bogofilter has too little
information to classify (with certainty) as spam or ham.


Berkeley DB switched to Transactional Data Store
------------------------------------------------

Bogofilter will now use the Berkeley DB Transactional Data Store when
compiled with Berkeley DB as the data base engine (the default).

This means the Berkeley DB directory can no longer reside on a networked
or otherwise shared file system (such as NFS, AFS, Coda).

When using BerkeleyDB 4.1 - 4.3, it is recommended that you dump and
load the data bases to add checksums, for enhanced reliablity. See
section 2.2 in doc/README.db for details.

This means that bogofilter programs now exhibit the A C I D traits:
changes are atomic (all-or-nothing); the data base is always consistent;
changes are always isolated from each other; and all changes that are
acknowledged are durable.

Bogofilter can support multiple writers at the same time, mixed freely
with simultaneous readers, and the data base will not be corrupted by
application or system crashes, except when the disk drive gets damaged.

Note that this requires that the operating system and disk drive
maintain proper write order on the disk, and that both be honest about
synchronous I/O completion.

Note also that this causes bogofilter to write additional "log" files
to its ~/.bogofilter (or other) home directory.  The log files need to
be archived or deleted periodically.

For detailed instructions, be sure to _read_ doc/README.db and check the
BerkeleyDB documentation.

As a backwards compatibility option, for instance when space and I/O
bandwidth are tight, it is possible to use the old non-transactional,
non-concurrent Berkeley DB Data Store, which can only register messages
when there are NO scoring processes at all and that may not be able
recover from application or system crashes.

These benefits are not available when bogofilter is compiled to use the
TDB or QDBM data bases.


Berkeley DB version strings changed
-----------------------------------

Bogofilter will now return the BerkeleyDB's actual DB_VERSION_STRING
in the output of 'bogofilter -V'. The OLD format was:

    Database: BerkeleyDB (4.3.21)

The NEW format is:

    Database: Sleepycat Software: Berkeley DB 4.3.21: (November  8, 2004)

You may need to adjust your scripts.


QDBM database format changed to B+ trees
----------------------------------------

The QDBM database format has been changed from hash tables to B+
trees, i.e. from the Depot API to the Villa API.  This results in
significantly better performance, i.e. faster speed.  Unfortunately,
the two modes are incompatible, so upgrading to 0.93 requires running
a special command to convert the database once:

bogoQDBMupgrade wordlist.qdbm wordlist.tmp wordlist.qdbm.old

If this command didn't print anything, everything has gone well and it
has left your old data base in wordlist.qdbm.old.

NOTE: bogoQDBMupgrade needs qdbm-1.7.23 or newer to build.

Bogotune option parsing changes
-------------------------------

In bogotune 0.93.2 and newer, you must repeat the -n or -s option as
prefix for the mailbox.

Example: bogotune -n good1 good2 -s bad1 bad2 ...
will be: bogotune -n good1 -n good2 -s bad1 -s bad2


MAJOR CHANGES IN BOGOFILTER 0.93
================================

Bogofilter 0.93.3 and newer add support for SQLite 3.0.8 (older versions
do not work). It isn't nearly as fast as Berkeley DB but uses only one
permanent and one transient file (hence less maintenance work) and is
supposed to be proof against application and system crashes.

Bogofilter 0.93.1 and newer add support for Berkeley DB 4.3.

########################################################################

INCOMPATIBLE CHANGES IN BOGOFILTER 0.92
=======================================

NOTE: the formatting parameters have changed,
      '%A' is now the message's IP address.
      '%I' is now the Message-ID.
      '%Q' is now the Queue-ID.

########################################################################

INCOMPATIBLE CHANGES IN BOGOFILTER 0.17
=======================================

Support for --enable-deprecated-code (see the 0.16 release notes)
has been removed. If you've run 0.16.X without that switch, nothing
changes for you.

Support for Berkeley DB 3.0 was removed in bogofilter 0.17.3
as a side effect of adding Concurrent Database support.

########################################################################

INCOMPATIBLE CHANGES IN BOGOFILTER 0.16
=======================================

With the 0.16.0 release, a number of features have been deprecated.  The
relevant code is bracketed by "#ifdef ENABLE_DEPRECATED_CODE" and
"#endif" statements.  The default build will not include the deprecated
features.  For those who still need these features, configure option
"--enable-deprecated-code" exists to allow them to be turned on.

THIS MAY REQUIRE MAJOR CHANGES TO YOUR CONFIGURATION OR SCRIPTS!

The following list is supposed to be complete.  Let us know if we've
omitted anything. We shall try to provide workarounds and migration
paths whenever possible.


1) Scoring algorithms
---------------------

Bogofilter will support only the Robinson-Fisher algorithm, commonly
called the "Fisher algorithm".  The Graham algorithm and Robinson
geometric-mean algorithm, a.k.a. Robinson algorithm, have been
deprecated.


2) Wordlist support
-------------------

Bogofilter will now support only the combined wordlist, i.e.
wordlist.db, which contains both the ham and spam counts for each token.
The older, separate wordlists (spamlist.db and goodlist.db) are no
longer supported.

The bogoupgrade program can still be used to merge the separate
databases for you.  Type "bogoupgrade -d /you/wordlist/directory/".

Ignore lists, i.e. ignorelist.db, are also being deprecated.  The ignore
list feature has never been thoroughly tested and is not used (as far as
we know).


3) Command line switches
------------------------

Bogofilter will no longer support the switches listed in this section.
If used, bogofilter will print an error message and exit.

  Scoring related switches:

    -g - select Graham algorithm
    -r - select Robinson Geometric-Mean algorithm
    -f - select Robinson-Fisher algorithm

    see section 1 above

    -2 - set binary classification mode
    -3 - set ternary classification mode

    Bogofilter will use binary mode if ham_cutoff is zero and will use
    ternary mode (Yes, No, Unsure) if ham_cutoff in non-zero and less
    than spam_cutoff.

  Wordlist modes:

    -W   - use combined wordlist  for spam and ham tokens
    -WW  - use separate wordlists for spam and ham tokens

    Bogofilter will always operate in combined mode now.

  Backwards compatible token generation switches:

    -Pi and -PI - ignore_case
    -Pt and -PT - tokenize_html_tags
    -Pc and -PC - strict_check
    -Pd and -PD - degen_enabled
    -Pf and -PF - first_match

    Note: Since last May, the default values for these switches
    have been:

    ignore_case         disabled
    tokenize_html_tags  enabled
    strict_check        disabled
    degen_enabled       disabled
    first_match         disabled

    There will be no change in the default values.


4) Configuration options
------------------------

The following configuration options (for the above switches) are
deprecated:

    algorithm

    wordlist
    wordlist_mode

    ignore_case
    tokenize_html_tags
    tokenize_html_script
    header_degen
    degen_enabled
    first_match

The following configuration options (which don't correspond to
switches) are deprecated:

    thresh_stats
    thresh_rtable

Note:  Bogofilter will print a warning message if it sees any of
these options, but will run fine anyhow.


5) Miscellany
-------------

The user formatted SPAM_HEADER will no longer support format
specification "%a" (for algorithm) since bogofilter now has only one
algorithm.

########################################################################

INCOMPATIBLE CHANGES IN BOGOFILTER 0.15
=======================================

Since release 0.15.9, bogofilter no longer allows to disable algorithms,
which has never been supported well.

Since release 0.15.4, all header line tokens are now tagged as:

	Subject:      subj:
	To:           to:
	From:         from:
	Return-Path:  rtrn:
	Received:     rcvd:   ***new***
	any other:    head:   ***new***

Because existing wordlists don't have "head:???" tokens, the new tokens
won't be found in the wordlist and bogofilter's accuracy will go down.

To correct this you can do one of the following things:

1 - Use the new "-H" (for header-degen) option when scoring messages.
This option tells bogofilter to check the wordlist twice for each header
token - once for "head:xyz" and a second time for "xyz".  The ham and
spam counts are added together to give a cumulative result.

Note that, with bogofilter 0.15.4 and later, during message
registration, "head:xyz" tokens are added to the wordlist (for the
header lines).  The "-H" option is only applied during scoring.

The "-H" option is meant for temporary usage to cover the period while
bogofilter goes from having no "head:xyz" tokens in the wordlist to the
time when there are enough such tokens to score messages effectively.
After a few weeks, or perhaps months, of registering messages with the
new bogofilter, use of the "-H" option can end and bogofilter will use
the newly added "head:xyz" tokens.

2 - Retrain bogofilter with whatever ham and spam you have available.
This will create "header:xyz" tokens and allow the new, more effective
header tagging to be used to fullest advantage.


MAJOR CHANGES IN BOGOFILTER 0.15
================================

With release 0.15, bogofilter's code for processing multiple messages
has been rewritten.  In addition to understanding mbox format files,
bogofilter now understands maildirs and MH folders.

########################################################################

INCOMPATIBLE CHANGES IN BOGOFILTER 0.14
=======================================

The exit codes returned by bogofilter have been expanded.  They are:

	Spam   = 0 -- unchanged
	Ham    = 1 -- unchanged
	Unsure = 2 -- *NEW*
	Error  = 3 -- *CHANGED*


MAJOR CHANGES IN BOGOFILTER 0.14
================================

Bogofilter 0.14 now supports TDB (Trivial Data base).

Instead of separate wordlists for spam and ham tokens, bogofilter can
now use a single combined, wordlist that stores both all tokens.
In the combined wordlist each token contains two counts - for spam and
ham.  The name of the new file is wordlist.db.

However, this change broke the early versions (up to and including
0.14.2) of bogofilter. You should use at least bogofilter 0.14.3.

Bogofilter will check in $BOGOFILTER_DIR and use the wordlist(s) that
are there.  If wordlist.db is present, bogofilter will use the combined
mode.  If wordlist.db is not present, but both spamlist.db and
goodlist.db are present, bogofilter will use the separate wordlist mode.
If no wordlists are present, bogofilter will create wordlist.db and use
it.

Command line switches '-W' and '-WW' can be used to tell bogofilter the
mode you want.  Also config file options "wordlist_mode=combined" and
"wordlist_mode=separate" can be used.

Upgrading from an old bogofilter environment with its two wordlists
(spamlist.db and goodlist.db) to the new 0.14.x environment with its
single, combined wordlist.db involves 3 main steps - dumping the current
spamlist.db and goodlist.db files, formatting that output, and then
loading the data into a new file wordlist.db.  The script "bogoupgrade" is
included with bogofilter and performs the task.  Use command
"bogoupgrade -d /path/to/your/wordlists" to do the upgrade.  After
running it, your BOGOFILTER_DIR will contain all 3 database files.  When
started, bogofilter checks for wordlist.db and will use it.

########################################################################

INCOMPATIBLE CHANGES IN BOGOFILTER 0.13
=======================================

With release 0.13, bogofilter's parsing has changed.  As background,
Paul Graham has done work to improve the results of his bayesian filter
and has published them in "Better Bayesian Filtering" at
http://www.paulgraham.com/better.html.  He found the following
definition of a token to be beneficial:

       1. Case is preserved.

       2. Exclamation points are constituent characters.

       3. Periods and commas are constituents if they occur between two
	  digits. This lets me get ip addresses and prices intact.

       4. A price range like $20-25 yields two tokens, $20 and $25.

       5. Tokens that occur within the To, From, Subject, and Return-Path
	  lines, or within urls, get marked accordingly.

Bogofilter has always done #3 and has tagged for Subject lines for a
while.  Its parser now does all of these things.  Several command line
switches and config file options have been added to allow enabling or
disabling them.  Here are the new switches and options:

       -Pi/-PI	ignore_case		default - disabled
       -Ph/-PH	header_line_markup 	default - enabled
       -Pt/-PT	tokenize_html_tags 	default - enabled

The options can be enabled using the lower case switch or disabled using
the upper case switch.

When header_line_markup_is enabled, tokens in To:, From:, Subject:, and
Return-Path: lines are prefixed by "to:", "from:", "subj:", and "rtrn:"
respectively.

When tokenize_html_tags_is enabled, tokens in A, IMG, and FONT tags are
scored while classifying the message.

NOTE: To take full advantage of these changes, additional training of
bogofilter is necessary.  Here's why:

With bogofilter's use of upper and lower case, the wordlists won't match
as many words as before.  For example, "From" and "from" both used to
match "from", but this is no longer the case.  As additional training is
done, words like these will be added to the wordlists and bogofilter
will have a larger number of distinct tokens to use when classifying
messages.  This will improve its classification accuracy.

Similarly, the use of header_line_markup will tokenize "Subject: great
p0rn site" as "subj:great", "subj:p0rn", and "subj:site".  At first
these tokens won't be recognized, so bogofilter won't use them to score
the message.  After being trained, bogofilter will have these additional
tokens to aid in the classification process.

########################################################################

MAJOR CHANGES IN BOGOFILTER 0.12
================================

Directory bogofilter/tuning has been added and contains scripts for
running tuning experiments as described in the new HOWTO. See file
bogofilter/tuning/README for more information.

Bogofilter's man page and help message describe the many command line
switches.  They have been divided into groups (help, classification,
registration, general, algorithm, parameter, and info) in both places.

Bogofilter 0.12.0 has three new command line switches for rapidly
scoring large numbers of messages.  These "bulk mode" switches are
especially useful for the tuning process.  The new switches are:

    -M - allows scoring all the messages in a mbox formatted file.  If
    used with "-v", an X-Bogosity line is printed as each message is
    scored.  Using the "-t" (terse) option is recommended to reduce the
    amount of output.

    -B - allows scoring of multiple message files, with each file
    containing a single message.  With this option, bogofilter expects the
    file names to be at the end of the command line.  If used with "-v",
    the file name is included in each printed line.  Using "-t" is
    recommended.

    -b - allows scoring of multiple message files, with each file
    containing a single message.  With this option, bogofilter reads the
    file names from stdin.  This option can be used with maildirs, as in
    "ls Maildir/* | bogofilter -b ..."  If used with "-v", the file name
    is included in each printed line.  Using "-t" is recommended.

New script bogolex.sh converts an email to a special file format that
contains the information needed by bogofilter to score the email.  Its
use speeds up the message scoring done by the tuning scripts.  The
script is described in more detail in bogofilter/tuning/README.

########################################################################

INCOMPATIBLE CHANGES IN BOGOFILTER 0.11
=======================================

Command line flags
------------------

The meaning of command line flags '-S' and '-N' was changed in version
0.11.0.  Previously '-S' meant to unregister a message from the spam
wordlist and register the message in the non-spam wordlist and '-N'
meant to unregister from non-spam and register as spam.

Each of the flags now performs a single action.

	'-S' unregisters a message from the spam wordlist and
	'-N' unregisters a message from the non-spam wordlist.

To duplicate the old (compound) actions, it is necessary to use two
options - an unregister option ('-S' or '-N') and a register option
('-s' or '-n').

To duplicate the effect of the old '-S' option, use '-N -s'.  To
duplicate the effect of the old '-N' option, use '-S -n'.  The order of
the options doesn't matter and they can be concatenated, as in '-Sn' and
'-sN'.


Config file processing
----------------------

The code to process config files now checks numeric values for validity.
It complains when it detects something wrong.  In particular, double
precision values are no longer allowed to have a terminal 'f'.  For
example "spam_cutoff=0.95f" will generate a messages.


MAJOR CHANGES IN BOGOFILTER 0.11
================================

New parameter query option
--------------------------

Using options "-q -v" in a bogofilter command line will run the
query_config() function and will display bogofilter's various parameter
values.  This can be very useful in finding the reason for an unexpected
message classification.

########################################################################
-------------- next part --------------
_______________________________________________
Bogofilter-dev mailing list
Bogofilter-dev at bogofilter.org
http://www.bogofilter.org/mailman/listinfo/bogofilter-dev


More information about the Bogofilter-dev mailing list