Random probabilistic split mailbox in two (random_split_mbox.py)

Arcady Genkin agenkin-lst-bogofilter at thpoon.com
Fri Nov 5 22:02:52 CET 2004


Greetings:

While training bogofilter, I was in need of a means to split two huge
mailboxes of spam and ham messages into two heaps each: first for the
full initial training, and another one for training on error.  I
wanted to split randomly, specifying the proportion of the messages to
end up in each of the two heaps.  Since I didn't find a better way of
doing it, I wrote a rather trivial Python script random_split_mbox.py,
which can be run like this:

  random_split_mbox.py ~/mail/spam /tmp/spam1 /tmp/spam2 0.7

This randomly picks out (roughly) 70% of all messages from ~/mail/spam
and puts them into /tmp/spam1, putting the rest into /tmp/spam2.

I think that this script can be useful for others.  It can be
downloaded from here:

  http://www.cdf.toronto.edu/~agenkin/downloads/random_split_mbox.py

Hope this helps,
-- 
Arcady Genkin : CDF Systems Administrator
http://www.cdf.toronto.edu/~agenkin/contact.html



More information about the Bogofilter mailing list