At 02:22 PM 01/24/02 -0800, Diwakar Kannan wrote:
>Hi
>
>I need some ideas on how to hook up a search engine to hypermail archives.
I can offer two suggestions. One would be to use swish-e (http://swish-e.org) and use the CGI script and the included script for parsing hypermail archives.
The other option is a script I use to manage a number of hypermail archives which also adds a search box to the archive lists, and manages reindexing and so on. It needs to be updated for archives that are split into directories.
If you want reasonably untested code, and unedited docs I can offer up this script:
> pod2text mail_archive.pl
NAME
mail_archive.pl - Creates and indexes a hypermail archive
SYNOPSIS
mail_archive.pl [options]
Options: --mode=[create|update|index] --chdir=dir cd to "dir" --config=file lists config (default: lists.conf) --help brief docs --man full documentation --test show what would happen --verbose
DESCRIPTION
make_archive.pl creates and updates hypermail archives, and is designed
to manange a number of different lists (all configured within a single
configuration file). It also is designed to assist in indexing the
archives for searching with swish-e. Swish-e, and associated files must
be installed.
This script is not designed for use with a very large number of lists, or where there's a high volume of email traffic, due to the startup costs of perl.
If you are reading this with the --man option, you might find the formatting better if you run
perldoc mail_archive.pl
The program can be run in one of three modes:
create mode
In create mode the program scans the list configuration file (lists.conf by default) and create an archive directory for each defined list that doesn't already exist. A hypermail configuration file is written to this directory, and a symlink it created to the search CGI script. If a mailbox directory is defined in the config file (lists.conf) all mailbox files are imported into the newly created hypermail archive. The mailbox directory is defined by the mbox_dir setting: mbox_dir = /path/to/my/mailboxes No recursion is done when reading the mailbox directory. If a mailbox file ends in .gz the file will be passed through `gzip(1)' with the `-dc' flags. By default, the program looks for mailbox files that match the regular expression: ^(\d{6,6})(?:\.gz)?$ That is, it's expecting mbox files to look like: 200112 200111.gz The pattern used to match files can be defined by the mbox_macth configuration setting. If every file in `mbox_dir' is a mailbox, you can use a pattern to match all files: mbox_match = . Capture parenthesis can be used to capture a *numeric* substring. This string is used for sorting the mailbox files when reading in messages (to help put the messages in numeric order by date). The default pattern of: ^(\d{6,6})(?:\.gz)?$ Will extract out the six digits (year plus month) and use that for sorting. If the captured pattern is not numeric (or not used), then the file name will be assigned the number zero with regard to sorting. A warning will be issues if the captured pattern is not numeric. When two files have the same numeric sort value they will be sorted by file name. Once created, the archive will be indexed by swish-e. Example: cd ~/archives ./mail_archive.pl --create --verbose update mode Update mode is used to read a *single* message from stdin, and route it to the correct archive. This makes configuring with procmail simple: MAILDIR=$ARCHIVE_DIR : 0w | ./mail_archive.pl --mode=update The mail_archive.pl program will return a non-zero exit status on messages that are not delivered to a defined mailing list. A non-zero return will cause procmail to continue processing for the message. This allows non-defined mail to be delivered normally. You can avoid this setup and use the more standard use of directing mail to the archive via an aliases file, but this allows one command to manage all lists. This setup is not designed for a very large number of very high volume lists. index mode Index mode is used to reindex the archive with swish-e. This allows the use of cron to better control how often the archives are checked for re-indexing. Only archives that have been added to since the last indexing will be indexed again. For example, to check every ten minutes: 0,10,20,30,40,50 * * * * ./mail_archive.pl --mode=index--chdir=$HOME/archives
INSTALLATION
Create a top-level directory. All the individual list archives will be
created below this directory. The idea is that all paths can then be
relative which makes relocating the archives easy.
For the sake of discussion, we will call the top-level directory:
~/archives
You must also have a reasonably current version of hypermail installed.
You will need a 2.1-dev or later version of swish-e. http://swish-e.org. It's recommended to build swish-e with both zlib and libxml2 support, but neither are required. For example:
cd lwp-download http://swish-e.org/<foo>/<name of swish-e tarball>.tar.gz gzip -dc <name of swish-e tarball>.tar.gz | tar xof - cd swish-e-<version> ./configure --with-zlib --with-libxml2 make make test
In the top-level directory place the following files and directories:
swish-e
Copy the swish-e binary from the swish-e/src directory. This needs to be executable by you and by the web server process. 0755 perms should work. cp ~/swish-e-<version>/src/swish-e . chmod 0755 swish-e swish.cgi This is the CGI script included with the swish-e distribution, located in the swish-e/example directory. Again, must be executable by the web server process. cp ~/swish-e-<version>/example/swish.cgi . chmod 0755 swish.cgi Open up swish.cgi in your editory and make sure the first line of the program points to the location of perl. modules directory. The swish.cgi script needs a few modules to operate. Copy the modules directory from the swish-e distribution to the ~/archives directory. For example cd ~/archives cp -rp ~/swish-e-<version>/example/modules . This files need read access by the web server. index_hypermail.pl Copy the index_hypermail.pl program from the swish-e distribution. This program parses the hypermail formatted messages. cd ~/archives cp ~/swish-e/prog-bin/index_hypermail.pl . mail_archive.pl Place this program (mail_archive.pl) also in your top-level directory (e.g.~/archives). Run this program with the --mode=create option: chmod 755 mail_archive.pl ./mail_archive.pl --mode=create It will create a few support files if they do not already exist: lists.conf - configuration file for your lists swish-e.conf - configuration file using by swish-e. indexheader.html - hypermail template file msgheader.html - hypermail template file By default, it is expected that swish-e is compiled with libxml2. If this is not the case, then you MUST edit swish-e.conf: Change these lines: IndexContents HTML2 .html StoreDescription HTML2 <body> 100000 to: IndexContents HTML .html StoreDescription HTML <body> 100000 It is also HIGHLY recommended that you build swish-e with zlib support for compression of the stored descriptions in the swish-e index. Now you are ready to use the mail_archvie.pl program.
AUTOMATIC UPDATES
Adding new messages to the archvie
Before defining your lists in the lists.conf file, you may want to enable automatic updates.
When the mail_archive.pl program is run with the `update' mode, it reads a single message from stdin, and tries to match it up with one of the active lists in the archive. If no match is found, the program returns a non-zero exit status.
For example, if all your mail is processed by procmail, you can add this to your .procmailrc file:
: 0w | $HOME/archives/mail_archive.pl --mode=update --chdir=$HOME/archives
Each incoming message will be passed through the mail_archive.pl program, and passed onto hypermail if a list is matched. If no list is matched and active the program exits with a non-zero exit stat, and procmail will continue processing.
After this is setup you can define lists in your lists.conf file. The list will be activated when you run the program in create mode after defining a new list or lists.
Reindexing the archive
You will want to reindex the archive when new messages are added to keep the swish-e index up to date. Add the following to your crontab:
0,10,20,30,40,50 * * * * cd $HOME/archives && ./mail_archive.pl --mode=index
or the same:
0,10,20,30,40,50 * * * * $HOME/archives/mail_archive.pl --mode=index --chdir=$HOME/archives
Then every ten minutes the program will be run and it will look for any swish-e indexes that need to be updated.
LIST CONFIGURATION
The lists.conf configuration file defines all your lists. You may define
as many lists as you like. After defining a new list (or lists) run with
the `--mode=create' option to create the new list. Only new lists are
operated on when running in create mode.
Note: You will probably want to have list messages delivered to the mail_archive.pl program before actually creating a new list with the `--mode=create' option to avoid missing any messages. This is discussed above.
The format of the configuration file is described in the configuration file itself.
A configuration file template should have been created automatically in the INSTALLATION section above, but if the configuration file does not exist, simply create a new config by running this program:
make_archive.pl --mode=create
This creates the default configuration file lists.conf.
Or to specify a configuration file.
make_archive.pl --mode=create --config=mylists.conf
Open the configuration file with your favorite editor, and define your lists.
Blank lines and lines that begin with a "#" are ignored.
The configuration file contains a section for every defined list. Sections are defined by placing the description of the list in brackets, followed by configuration settings. Leading white space may be used.
#------------- pigs ------------------------------- [ Pig Lovers List Archive ] list_email = pig-lovers_at_piggiesweare.com archive_dir = bacon strip_subject = [Pigs Discussion] mbox_dir = /path/to/mbox/pigs mbox_match = ^pigs(\d{6,6})$ hypermail_opts = gmtime=On, showhtml=1 header_order = List-Post To Cc
Not all config options are shown above, and not all are required. You can have as many sections as you like. Other than `hypermail_opts', you may not repeate a config option in a section.
You can disable a list simply by placing a ! at the start of the list name:
# Disable for now [! Pig Lovers List Archive ]
List Configuration Options
list_email (required)
`list_email' is the email address of the specific list. It's used as the name of the archive directory unless `archive_dir' is defined, and is used for matching new messages up with the correct list (for routing a new mail message to the correct list) unless `match_string' is defined. See `match_string' for how the matching works. archive_dir (optional) Defines the hypermail archive directory. This should be a relative directory (e.g. relative to ~/archives in the examples above. match_string (optional) Sting used to match an incoming message to a list. If this is not set `list_email' is used. All the match strings for all the lists are sorted from longest to shortest strings, then the string is matched with a case-insensitive regular expression against the mail headers. By default the headers are searched in this order: List-Post: To: Cc: This list can be changed by the `header_order' setting. Currently, Received are ignored. header_order (optional) Defines the order headers are checked for the match string. Case is not important. Do not end the headers with ':'. header_order = List-Name To Cc mbox_dir (optional) If specified, files listed in this directory will be used to initialize the list's archive. See above for more information. mbox_match (optional) Defines the perl regular expression to use to match against file names in the `mbox_dir' directory. See above for more information. strip_subject (optional) This simply passes on the setting to hypermail. hypermail_opts (optional) Define parameters that are passed on directly to the hypermail configuration file for this list. The settings must be separated by a comma. This setting may be repeated on more than one line. By default, the settings used are: showhtml = 0 deleted = X-blabla gmtime = On warn_surpressions = On Any setting you specify will override these settings. Example: hypermail_opts = gmtime=On, showhtml=1 hypermail_opts = spamprotect = On hmrc (optional) Hypermail config file to use. No need to change this. The default is to use .hmrc in the list's directory.
To test your new configuration additions:
./mail_archvie.pl --mode=create --verbose --test
which will display what will happen. To actually create the list(s) run:
./mail_archvie.pl --mode=create --verbose
WEB SETUP
Ok, it's not completely automatic.
It's up to you how to link the archives to your web site.
One suggestion:
cd /usr/local/apache/htdocs mkdir archives cd archives ln -s $HOME/archives/somelist ln -s $HOME/archives/otherlist ...
AUTHOR
Bill Moseley - moseley_at_hank.org
-- Bill Moseley mailto:moseley_at_hank.orgReceived on Fri 25 Jan 2002 03:39:51 AM GMT
This archive was generated by hypermail 2.2.0 : Thu 22 Feb 2007 07:33:54 PM GMT GMT