Re: [hypermail] Search engine for hypermail archives

From: Bill Moseley <>
Date: Thu, 24 Jan 2002 17:29:58 -0800
Message-Id: <>

At 02:22 PM 01/24/02 -0800, Diwakar Kannan wrote:
>I need some ideas on how to hook up a search engine to hypermail archives.

I can offer two suggestions. One would be to use swish-e ( and use the CGI script and the included script for parsing hypermail archives.

The other option is a script I use to manage a number of hypermail archives which also adds a search box to the archive lists, and manages reindexing and so on. It needs to be updated for archives that are split into directories.

If you want reasonably untested code, and unedited docs I can offer up this script:

> pod2text

NAME - Creates and indexes a hypermail archive

SYNOPSIS [options]

            --chdir=dir    cd to "dir"
            --config=file  lists config (default: lists.conf)
            --help         brief docs 
            --man          full documentation
            --test         show what would happen

DESCRIPTION creates and updates hypermail archives, and is designed     to manange a number of different lists (all configured within a single     configuration file). It also is designed to assist in indexing the     archives for searching with swish-e. Swish-e, and associated files must     be installed.

    This script is not designed for use with a very large number of lists,     or where there's a high volume of email traffic, due to the startup     costs of perl.

    If you are reading this with the --man option, you might find the     formatting better if you run


  The program can be run in one of three modes:

    create mode

       In create mode the program scans the list configuration file
       (lists.conf by default) and create an archive directory for each
       defined list that doesn't already exist. A hypermail configuration
       file is written to this directory, and a symlink it created to the
       search CGI script.

       If a mailbox directory is defined in the config file (lists.conf) all
       mailbox files are imported into the newly created hypermail archive.
       The mailbox directory is defined by the mbox_dir setting:

           mbox_dir = /path/to/my/mailboxes
       No recursion is done when reading the mailbox directory. If a mailbox
       file ends in .gz the file will be passed through `gzip(1)' with the
       `-dc' flags.

       By default, the program looks for mailbox files that match the
       regular expression:


       That is, it's expecting mbox files to look like:


       The pattern used to match files can be defined by the mbox_macth
       configuration setting. If every file in `mbox_dir' is a mailbox, you
       can use a pattern to match all files:

           mbox_match = .

       Capture parenthesis can be used to capture a *numeric* substring.
       This string is used for sorting the mailbox files when reading in
       messages (to help put the messages in numeric order by date). The
       default pattern of:


       Will extract out the six digits (year plus month) and use that for

       If the captured pattern is not numeric (or not used), then the file
       name will be assigned the number zero with regard to sorting. A
       warning will be issues if the captured pattern is not numeric. When
       two files have the same numeric sort value they will be sorted by
       file name.

       Once created, the archive will be indexed by swish-e.


           cd ~/archives
           ./ --create --verbose

    update mode
       Update mode is used to read a *single* message from stdin, and route
       it to the correct archive. This makes configuring with procmail

           : 0w
           | ./ --mode=update

       The program will return a non-zero exit status on
       messages that are not delivered to a defined mailing list. A non-zero
       return will cause procmail to continue processing for the message.
       This allows non-defined mail to be delivered normally.

       You can avoid this setup and use the more standard use of directing
       mail to the archive via an aliases file, but this allows one command
       to manage all lists.

       This setup is not designed for a very large number of very high
       volume lists.

    index mode
       Index mode is used to reindex the archive with swish-e. This allows
       the use of cron to better control how often the archives are checked
       for re-indexing. Only archives that have been added to since the last
       indexing will be indexed again.

       For example, to check every ten minutes:

          0,10,20,30,40,50 * * * * ./ --mode=index

    Create a top-level directory. All the individual list archives will be     created below this directory. The idea is that all paths can then be     relative which makes relocating the archives easy.

    For the sake of discussion, we will call the top-level directory:


    You must also have a reasonably current version of hypermail installed.

    You will need a 2.1-dev or later version of swish-e.     It's recommended to build swish-e with both zlib and libxml2 support,     but neither are required. For example:

        lwp-download<foo>/<name of swish-e tarball>.tar.gz
        gzip -dc <name of swish-e tarball>.tar.gz | tar xof -
        cd swish-e-<version>
        ./configure --with-zlib --with-libxml2
        make test

    In the top-level directory place the following files and directories:


       Copy the swish-e binary from the swish-e/src directory. This needs to
       be executable by you and by the web server process. 0755 perms should

           cp ~/swish-e-<version>/src/swish-e .
           chmod 0755 swish-e
       This is the CGI script included with the swish-e distribution,
       located in the swish-e/example directory. Again, must be executable
       by the web server process.

           cp ~/swish-e-<version>/example/swish.cgi .
           chmod 0755 swish.cgi
       Open up swish.cgi in your editory and make sure the first line of the
       program points to the location of perl.

    modules directory.
       The swish.cgi script needs a few modules to operate. Copy the modules
       directory from the swish-e distribution to the ~/archives directory.
       For example

           cd ~/archives
           cp -rp ~/swish-e-<version>/example/modules .

       This files need read access by the web server.
       Copy the program from the swish-e distribution.
       This program parses the hypermail formatted messages.

           cd ~/archives
           cp ~/swish-e/prog-bin/ .
       Place this program ( also in your top-level directory

       Run this program with the --mode=create option:

           chmod 755
           ./ --mode=create

       It will create a few support files if they do not already exist:

           lists.conf          - configuration file for your lists
           swish-e.conf        - configuration file using by swish-e.
           indexheader.html    - hypermail template file
           msgheader.html      - hypermail template file

       By default, it is expected that swish-e is compiled with libxml2. If
       this is not the case, then you MUST edit swish-e.conf:

       Change these lines:

           IndexContents HTML2 .html
           StoreDescription HTML2 <body> 100000


           IndexContents HTML .html
           StoreDescription HTML <body> 100000

       It is also HIGHLY recommended that you build swish-e with zlib
       support for compression of the stored descriptions in the swish-e

       Now you are ready to use the program.

  Adding new messages to the archvie

    Before defining your lists in the lists.conf file, you may want to     enable automatic updates.

    When the program is run with the `update' mode, it reads     a single message from stdin, and tries to match it up with one of the     active lists in the archive. If no match is found, the program returns a     non-zero exit status.

    For example, if all your mail is processed by procmail, you can add this     to your .procmailrc file:

        : 0w
        | $HOME/archives/ --mode=update --chdir=$HOME/archives

    Each incoming message will be passed through the     program, and passed onto hypermail if a list is matched. If no list is     matched and active the program exits with a non-zero exit stat, and     procmail will continue processing.

    After this is setup you can define lists in your lists.conf file. The     list will be activated when you run the program in create mode after     defining a new list or lists.

  Reindexing the archive

    You will want to reindex the archive when new messages are added to keep     the swish-e index up to date. Add the following to your crontab:

       0,10,20,30,40,50 * * * * cd $HOME/archives && ./ --mode=index

    or the same:

      0,10,20,30,40,50 * * * * $HOME/archives/ --mode=index --chdir=$HOME/archives

    Then every ten minutes the program will be run and it will look for any     swish-e indexes that need to be updated.

    The lists.conf configuration file defines all your lists. You may define     as many lists as you like. After defining a new list (or lists) run with     the `--mode=create' option to create the new list. Only new lists are     operated on when running in create mode.

    Note: You will probably want to have list messages delivered to the program before actually creating a new list with the     `--mode=create' option to avoid missing any messages. This is discussed     above.

    The format of the configuration file is described in the configuration     file itself.

    A configuration file template should have been created automatically in     the INSTALLATION section above, but if the configuration file does not     exist, simply create a new config by running this program: --mode=create

    This creates the default configuration file lists.conf.

    Or to specify a configuration file. --mode=create --config=mylists.conf

    Open the configuration file with your favorite editor, and define your     lists.

    Blank lines and lines that begin with a "#" are ignored.

    The configuration file contains a section for every defined list.     Sections are defined by placing the description of the list in brackets,     followed by configuration settings. Leading white space may be used.

        #------------- pigs -------------------------------

        [ Pig Lovers List Archive ]
            list_email      =
            archive_dir     = bacon
            strip_subject   = [Pigs Discussion]
            mbox_dir        = /path/to/mbox/pigs
            mbox_match      = ^pigs(\d{6,6})$
            hypermail_opts  = gmtime=On, showhtml=1
            header_order    = List-Post To Cc

    Not all config options are shown above, and not all are required. You     can have as many sections as you like. Other than `hypermail_opts', you     may not repeate a config option in a section.

    You can disable a list simply by placing a ! at the start of the list     name:

        # Disable for now
        [! Pig Lovers List Archive ]

  List Configuration Options

    list_email (required)

       `list_email' is the email address of the specific list. It's used as
       the name of the archive directory unless `archive_dir' is defined,
       and is used for matching new messages up with the correct list (for
       routing a new mail message to the correct list) unless `match_string'
       is defined. See `match_string' for how the matching works.

    archive_dir (optional)
       Defines the hypermail archive directory. This should be a relative
       directory (e.g. relative to ~/archives in the examples above.

    match_string (optional)
       Sting used to match an incoming message to a list. If this is not set
       `list_email' is used.

       All the match strings for all the lists are sorted from longest to
       shortest strings, then the string is matched with a case-insensitive
       regular expression against the mail headers.

       By default the headers are searched in this order:


       This list can be changed by the `header_order' setting.

       Currently, Received are ignored.

    header_order (optional)
       Defines the order headers are checked for the match string. Case is
       not important. Do not end the headers with ':'.

           header_order    = List-Name To Cc

    mbox_dir (optional)
       If specified, files listed in this directory will be used to
       initialize the list's archive. See above for more information.

    mbox_match (optional)
       Defines the perl regular expression to use to match against file
       names in the `mbox_dir' directory. See above for more information.

    strip_subject (optional)
       This simply passes on the setting to hypermail.

    hypermail_opts (optional)
       Define parameters that are passed on directly to the hypermail
       configuration file for this list. The settings must be separated by a
       comma. This setting may be repeated on more than one line.

       By default, the settings used are:

           showhtml = 0
           deleted  = X-blabla
           gmtime   = On
           warn_surpressions = On

       Any setting you specify will override these settings.


           hypermail_opts  = gmtime=On, showhtml=1
           hypermail_opts  = spamprotect = On

    hmrc (optional)
       Hypermail config file to use. No need to change this. The default is
       to use .hmrc in the list's directory.

    To test your new configuration additions:

        ./ --mode=create --verbose --test

    which will display what will happen. To actually create the list(s) run:

        ./ --mode=create --verbose

    Ok, it's not completely automatic.

    It's up to you how to link the archives to your web site.

    One suggestion:

        cd /usr/local/apache/htdocs
        mkdir archives
        cd archives
        ln -s $HOME/archives/somelist
        ln -s $HOME/archives/otherlist

    Bill Moseley -

Bill Moseley
Received on Fri 25 Jan 2002 03:39:51 AM GMT

This archive was generated by hypermail 2.2.0 : Thu 22 Feb 2007 07:33:54 PM GMT GMT