[hypermail] hashed filenames patch commited to CVS

From: Jose Kahan <jose.kahan_at_w3.org_at_hypermail-project.org>
Date: Wed, 25 Sep 2002 17:51:21 +0200
Message-ID: <20020925155121.GB6891_at_inrialpes.fr>


Hello,

I just commited a patch that adds a new option to hypermail that changes the sequential numbering used for filenames, to a name resulting from a hash of the mail properties. This allows you to separate the filenames the archiving order... quite useful if you are rebuilding an archive and the source files changed (you deleted something, something else arrived and it wasn't archived).

I'm using an 32 bit. FNV1 hash function [1] and giving it as input the msgid and the From date. This will hopefully allow to have a unique hash name. If the 2^32 hash space isn't wide enough, we can always move to 64 bits... time will tell.

In practice, filenames now have 4 more chars. Here is an abstract of one of my archive directories:

04affc9c.html 4123c74a.html 714eba2c.html att-ea1bb52c/

This is not so much harder to quote. I had previously considered using a sha or an MD5 hash function, but the resulting filename was too big for quoting.

All the code is available on CVS. In order to use it, you'll have to redo a configure as follows:

	cd hypermail
	autoconf (so that you can get the new options)
	./configure --enable-libfnv
	make

This will buid you a hypermail with the correct options and will link it to the fnv library (which is in src/fnv).

The name of the new option is nonsequential, with a command line shortcut of -N. Turn it on and build a test archive to see the differences.

More in detail, the change was quite straightforward to do. I changed all the functions where a msgnum value was used to create lnks or make filenames and made it go through a function. In function of your hypermail options, this function (file.c:message_name()) will either return the msgnum formatted in %.04d or the message hash name.

Another file, called for the moment "messageindex", keeps a tab of the messages that we have in the archive. That's necessary as there are no heuristics to find the current files in the archive. I use the info in "messageindex" to build an internal table that relates a msgno to each of the hashed messages. That was all that was needed. There's no option yet to parametrize the "messageindex" name. This can be easily added if needed.

The cost of this feature is that we need to store the messageindex table in memory. If you have thousands and thousands of messages, it may be an issue. On the other hand, we already store so much things in memory, that it wouldn't possible to use hypermail anyway, with or without this option. There may be some slowup too as we now pass thru a function to get the message name, rather than getting the name directly from a structure + the hash computation cost. There may be some optimizations to do, but this is right now a first implementation. Experience will tell us if more work is needed.

I hope you find this new feature as useful as I do.

I profited from this commit to fix the make tgz rule. It wasn't working anymore (at least for me). I also updated the FILES file.

I tried to test it and take into account all possible side-effects on hypermail. With my current set of options, it works without a hitch. Howwever, hypermail has become much more complex than what it was some time ago. I am not sure if this option will work with all the other features, such as GDBM files and so on.

Before, it was easy to understand the code just from the comments, but it's now easy to get lost without some doc :(

Cheers,

-jose

[1] http://www.isthe.com/chongo/tech/comp/fnv/index.html Received on Mon 30 Sep 2002 06:12:49 PM GMT

This archive was generated by hypermail 2.3.0 : Sat 13 Mar 2010 03:46:12 AM GMT GMT