README.ext3

Greg Louis glouis at dynamicro.on.ca
Fri Jan 31 22:26:33 CET 2003


I thought there was a big problem with bogofilter-0.10 because of its
database bloat, which on my system translated to terribly slow database
access.  Turns out that's sort of true, but it's not nearly so bad if
you don't use a journalling filesystem.  I did a moderately rigorous
comparison, and as a result, here's a draft for yet another README.*
file that might be useful:

-------8<--------------
It seems to be a bad idea, from the performance standpoint, to keep
your bogofilter database files in an ext3 filesystem.  Use ext2 or at
least mount the filesystem with the data=writeback option.  If a crash
occurs in ext3's writeback mode, old data may appear in the database
files, and it may be necessary to restore from a backup, just as might
be the case without journalling; only the internal filesystem integrity
is guaranteed.  But each of the other two modes carries an impressive
performance penalty.

Here's a comparison, performed on a machine with a 400MHz PII, 128 Mb
RAM, and an elderly SCSI hard disk that I've been using for scratch. 
The db file was created from a 200-Mb spam corpus that resides on
another drive in an ext3 partition.  Database reads aren't impacted
quite as badly as writes; classifying a message with the database files
on ext3 in ordered mode takes me about four times as long as it does
when the .db files are on ext2.

1.  With ext2

# umount /xtrn
# mount -t ext2 /dev/scd1 /xtrn
# rm -f /xtrn/db/*
# time /lighter/usr/bin/bogo10 -d /xtrn/db -v -s <spam_corpus 
# 5868782 words, 14502 messages

real    3m41.225s	user    2m33.020s	sys     0m36.660s


2.  With ext3 in the normal data=ordered mode

# umount /xtrn
# mount -t ext3 /dev/scd1 /xtrn
# rm -f /xtrn/db/*
# time /lighter/usr/bin/bogo10 -d /xtrn/db -v -s <spam_corpus 
# 5868782 words, 14502 messages

real    27m59.297s	user    2m33.750s	sys     0m46.310s


3.  With ext3 in the data=journal mode (all data are committed to the
journal prior to being written into the main file system)

# umount /xtrn
# mount -t ext3 -o data=journal /dev/scd1 /xtrn
# rm -f /xtrn/db/*
# time /lighter/usr/bin/bogo10 -d /xtrn/db -v -s <spam_corpus 
# 5868782 words, 14502 messages

real    14m11.143s	user    2m34.430s	sys     0m45.170s


4.  With ext3 in the data=writeback mode (data are written back
"lazily" to the main files ystem, perhaps after the metadata have been
committed to the journal).

# umount /xtrn
# mount -t ext3 -o data=writeback /dev/scd1 /xtrn
# rm -f /xtrn/db/*
# time /lighter/usr/bin/bogo10 -d /xtrn/db -v -s <spam_corpus
# 5868782 words, 14502 messages

real    4m2.027s	user    2m34.460s	sys     0m43.640s

------------8<--------------------
-- 
| G r e g  L o u i s          | gpg public key:      |
|   http://www.bgl.nu/~glouis |   finger greg at bgl.nu |
| Help free our mailboxes. Include                   |
|        http://wecanstopspam.org in your signature. |




More information about the Bogofilter mailing list