token age format

David Relson relson at osagesoftware.com
Mon Dec 16 03:35:12 CET 2002


At 09:14 PM 12/15/02, Matthias Andree wrote:

>David Relson <relson at osagesoftware.com> writes:
>
> > I've been thinking about dates and formats.  There are at least three
> > different things going on.
> >
> > 1 - internal format - how dates are stored in wordlists
> > 2 - external format - how dates are dumped/loaded by bogoutil
> > 3 - ages - how the user says "discard tokens older than X"
> >
> > While time_t may be good for #1, it's not good for #2 or #3.
> >
> > For #2, something human readable like yyyymmdd is more useful that the
> > time_t equivalent.  For today, the two values would be 20021215 and
> > 1040001069.
> >
> > For #3, using "days" as the unit of measurement is good.  "Discard
> > tokens older than 100 days" is easier than its time_t equivalent
> > "discard tokens older than 8,640,000 seconds".
> >
> > "A Plan for Spam" was published in August and /.'ed on August 16.  ESR
> > started bogofilter around then (0.2 is dated Aug 22 and it's oldest file
> > is dated Aug 18 05:51).  Bogofilter could assign an August 2002 date to
> > tokens without dates and wouldn't be too far off.
>
>#2 and #3 are user interface issues. The data base already suffered from
>endianness, let's not make it suffer again, this time from choosing the
>wrong internal representation. Also, let's not make our lives harder
>than need be, use existing tools, oh, and please let's not store time_t
>directly, but convert it to a string, in hex (slightly faster) or
>decimal presentation. We must also be prepared that systems switch to
>64bit time_t, and we're going to lose big time when that happens and we
>read 32bit into the wrong half of the time_t...

I was hoping to have less impact on the database rather than more.  My 
initial idea was to use a single byte for age, knowing that it would roll 
over in 8.5 months.  I'm not particularly happy with using a 32-bit long 
but went ahead with it for simplicity.  Using a string will use even more 
space, which is unsatisfactory.  Using a long to represent YYYYMMDD in the 
database makes import and export easy and won't roll over for thousands of 
years.  Where the format is hard to use is in the aging code.  In the 
present implementation, discarding tokens based on age is part of 
bogoutil's dump/load capabilities.  Stated differently, it is part of an 
offline operation.

>Printing the date to the user is a matter of strftime or
>something. Reading an age from command line can well happen as count of
>days, you just turn that into a reference time_t by doing "time(NULL) -
>86400 * age_in_days" and compare the token age against this.
>
>There is no need to put a human-readable format into the data base that
>requires us to write our set of tools when time_t tools are available in
>every libc.

I apologize for lack of clarity in my message. #1 is internal wordlist 
format.  #2 is external human readable format output by "bogoutil -d" and 
read by "bogoutil -l".

David





More information about the bogofilter-dev mailing list