Re: [hypermail] Search engine for hypermail archives

From: Bill Moseley <moseley_at_hank.org_at_hypermail-project.org>
Date: Thu, 24 Jan 2002 17:29:58 -0800
Message-Id: <3.0.3.32.20020124172958.022b4e24_at_pop3.hank.org>


At 02:22 PM 01/24/02 -0800, Diwakar Kannan wrote:
>Hi
>
>I need some ideas on how to hook up a search engine to hypermail archives.

I can offer two suggestions. One would be to use swish-e (http://swish-e.org) and use the CGI script and the included script for parsing hypermail archives.

The other option is a script I use to manage a number of hypermail archives which also adds a search box to the archive lists, and manages reindexing and so on. It needs to be updated for archives that are split into directories.

If you want reasonably untested code, and unedited docs I can offer up this script:

> pod2text mail_archive.pl

NAME
    mail_archive.pl - Creates and indexes a hypermail archive

SYNOPSIS
    mail_archive.pl [options]

        Options:
            --mode=[create|update|index]
            --chdir=dir    cd to "dir"
            --config=file  lists config (default: lists.conf)
            --help         brief docs 
            --man          full documentation
            --test         show what would happen
            --verbose

DESCRIPTION
    make_archive.pl creates and updates hypermail archives, and is designed     to manange a number of different lists (all configured within a single     configuration file). It also is designed to assist in indexing the     archives for searching with swish-e. Swish-e, and associated files must     be installed.

    This script is not designed for use with a very large number of lists,     or where there's a high volume of email traffic, due to the startup     costs of perl.

    If you are reading this with the --man option, you might find the     formatting better if you run

        perldoc mail_archive.pl

  The program can be run in one of three modes:

    create mode

       In create mode the program scans the list configuration file
       (lists.conf by default) and create an archive directory for each
       defined list that doesn't already exist. A hypermail configuration
       file is written to this directory, and a symlink it created to the
       search CGI script.

       If a mailbox directory is defined in the config file (lists.conf) all
       mailbox files are imported into the newly created hypermail archive.
       The mailbox directory is defined by the mbox_dir setting:

           mbox_dir = /path/to/my/mailboxes
    
       No recursion is done when reading the mailbox directory. If a mailbox
       file ends in .gz the file will be passed through `gzip(1)' with the
       `-dc' flags.

       By default, the program looks for mailbox files that match the
       regular expression:

           ^(\d{6,6})(?:\.gz)?$

       That is, it's expecting mbox files to look like:

           200112
           200111.gz

       The pattern used to match files can be defined by the mbox_macth
       configuration setting. If every file in `mbox_dir' is a mailbox, you
       can use a pattern to match all files:

           mbox_match = .

       Capture parenthesis can be used to capture a *numeric* substring.
       This string is used for sorting the mailbox files when reading in
       messages (to help put the messages in numeric order by date). The
       default pattern of:

           ^(\d{6,6})(?:\.gz)?$

       Will extract out the six digits (year plus month) and use that for
       sorting.

       If the captured pattern is not numeric (or not used), then the file
       name will be assigned the number zero with regard to sorting. A
       warning will be issues if the captured pattern is not numeric. When
       two files have the same numeric sort value they will be sorted by
       file name.

       Once created, the archive will be indexed by swish-e.

       Example:

           cd ~/archives
           ./mail_archive.pl --create --verbose

    update mode
       Update mode is used to read a *single* message from stdin, and route
       it to the correct archive. This makes configuring with procmail
       simple:

           MAILDIR=$ARCHIVE_DIR
           : 0w
           | ./mail_archive.pl --mode=update

       The mail_archive.pl program will return a non-zero exit status on
       messages that are not delivered to a defined mailing list. A non-zero
       return will cause procmail to continue processing for the message.
       This allows non-defined mail to be delivered normally.

       You can avoid this setup and use the more standard use of directing
       mail to the archive via an aliases file, but this allows one command
       to manage all lists.

       This setup is not designed for a very large number of very high
       volume lists.

    index mode
       Index mode is used to reindex the archive with swish-e. This allows
       the use of cron to better control how often the archives are checked
       for re-indexing. Only archives that have been added to since the last
       indexing will be indexed again.

       For example, to check every ten minutes:

          0,10,20,30,40,50 * * * * ./mail_archive.pl --mode=index
--chdir=$HOME/archives

INSTALLATION
    Create a top-level directory. All the individual list archives will be     created below this directory. The idea is that all paths can then be     relative which makes relocating the archives easy.

    For the sake of discussion, we will call the top-level directory:

        ~/archives

    You must also have a reasonably current version of hypermail installed.

    You will need a 2.1-dev or later version of swish-e. http://swish-e.org.     It's recommended to build swish-e with both zlib and libxml2 support,     but neither are required. For example:

        cd
        lwp-download http://swish-e.org/<foo>/<name of swish-e tarball>.tar.gz
        gzip -dc <name of swish-e tarball>.tar.gz | tar xof -
        cd swish-e-<version>
        ./configure --with-zlib --with-libxml2
        make
        make test
    

    In the top-level directory place the following files and directories:

    swish-e

       Copy the swish-e binary from the swish-e/src directory. This needs to
       be executable by you and by the web server process. 0755 perms should
       work.

           cp ~/swish-e-<version>/src/swish-e .
           chmod 0755 swish-e
    
    swish.cgi
       This is the CGI script included with the swish-e distribution,
       located in the swish-e/example directory. Again, must be executable
       by the web server process.

           cp ~/swish-e-<version>/example/swish.cgi .
           chmod 0755 swish.cgi
    
       Open up swish.cgi in your editory and make sure the first line of the
       program points to the location of perl.

    modules directory.
       The swish.cgi script needs a few modules to operate. Copy the modules
       directory from the swish-e distribution to the ~/archives directory.
       For example

           cd ~/archives
           cp -rp ~/swish-e-<version>/example/modules .

       This files need read access by the web server.

    index_hypermail.pl
       Copy the index_hypermail.pl program from the swish-e distribution.
       This program parses the hypermail formatted messages.

           cd ~/archives
           cp ~/swish-e/prog-bin/index_hypermail.pl .

    mail_archive.pl
       Place this program (mail_archive.pl) also in your top-level directory
       (e.g.~/archives).

       Run this program with the --mode=create option:

           chmod 755 mail_archive.pl
           ./mail_archive.pl --mode=create

       It will create a few support files if they do not already exist:

           lists.conf          - configuration file for your lists
           swish-e.conf        - configuration file using by swish-e.
           indexheader.html    - hypermail template file
           msgheader.html      - hypermail template file

       By default, it is expected that swish-e is compiled with libxml2. If
       this is not the case, then you MUST edit swish-e.conf:

       Change these lines:

           IndexContents HTML2 .html
           StoreDescription HTML2 <body> 100000

       to:

           IndexContents HTML .html
           StoreDescription HTML <body> 100000

       It is also HIGHLY recommended that you build swish-e with zlib
       support for compression of the stored descriptions in the swish-e
       index.

       Now you are ready to use the mail_archvie.pl program.

AUTOMATIC UPDATES
  Adding new messages to the archvie

    Before defining your lists in the lists.conf file, you may want to     enable automatic updates.

    When the mail_archive.pl program is run with the `update' mode, it reads     a single message from stdin, and tries to match it up with one of the     active lists in the archive. If no match is found, the program returns a     non-zero exit status.

    For example, if all your mail is processed by procmail, you can add this     to your .procmailrc file:

        : 0w
        | $HOME/archives/mail_archive.pl --mode=update --chdir=$HOME/archives

    Each incoming message will be passed through the mail_archive.pl     program, and passed onto hypermail if a list is matched. If no list is     matched and active the program exits with a non-zero exit stat, and     procmail will continue processing.

    After this is setup you can define lists in your lists.conf file. The     list will be activated when you run the program in create mode after     defining a new list or lists.

  Reindexing the archive

    You will want to reindex the archive when new messages are added to keep     the swish-e index up to date. Add the following to your crontab:

       0,10,20,30,40,50 * * * * cd $HOME/archives && ./mail_archive.pl --mode=index

    or the same:

      0,10,20,30,40,50 * * * * $HOME/archives/mail_archive.pl --mode=index --chdir=$HOME/archives

    Then every ten minutes the program will be run and it will look for any     swish-e indexes that need to be updated.

LIST CONFIGURATION
    The lists.conf configuration file defines all your lists. You may define     as many lists as you like. After defining a new list (or lists) run with     the `--mode=create' option to create the new list. Only new lists are     operated on when running in create mode.

    Note: You will probably want to have list messages delivered to the     mail_archive.pl program before actually creating a new list with the     `--mode=create' option to avoid missing any messages. This is discussed     above.

    The format of the configuration file is described in the configuration     file itself.

    A configuration file template should have been created automatically in     the INSTALLATION section above, but if the configuration file does not     exist, simply create a new config by running this program:

       make_archive.pl --mode=create

    This creates the default configuration file lists.conf.

    Or to specify a configuration file.

       make_archive.pl --mode=create --config=mylists.conf

    Open the configuration file with your favorite editor, and define your     lists.

    Blank lines and lines that begin with a "#" are ignored.

    The configuration file contains a section for every defined list.     Sections are defined by placing the description of the list in brackets,     followed by configuration settings. Leading white space may be used.

        #------------- pigs -------------------------------

        [ Pig Lovers List Archive ]
        
            list_email      = pig-lovers_at_piggiesweare.com
            archive_dir     = bacon
            strip_subject   = [Pigs Discussion]
            mbox_dir        = /path/to/mbox/pigs
            mbox_match      = ^pigs(\d{6,6})$
            hypermail_opts  = gmtime=On, showhtml=1
            header_order    = List-Post To Cc

    Not all config options are shown above, and not all are required. You     can have as many sections as you like. Other than `hypermail_opts', you     may not repeate a config option in a section.

    You can disable a list simply by placing a ! at the start of the list     name:

        # Disable for now
        [! Pig Lovers List Archive ]

  List Configuration Options

    list_email (required)

       `list_email' is the email address of the specific list. It's used as
       the name of the archive directory unless `archive_dir' is defined,
       and is used for matching new messages up with the correct list (for
       routing a new mail message to the correct list) unless `match_string'
       is defined. See `match_string' for how the matching works.

    archive_dir (optional)
       Defines the hypermail archive directory. This should be a relative
       directory (e.g. relative to ~/archives in the examples above.

    match_string (optional)
       Sting used to match an incoming message to a list. If this is not set
       `list_email' is used.

       All the match strings for all the lists are sorted from longest to
       shortest strings, then the string is matched with a case-insensitive
       regular expression against the mail headers.

       By default the headers are searched in this order:

           List-Post:
           To:
           Cc:

       This list can be changed by the `header_order' setting.

       Currently, Received are ignored.

    header_order (optional)
       Defines the order headers are checked for the match string. Case is
       not important. Do not end the headers with ':'.

           header_order    = List-Name To Cc

    mbox_dir (optional)
       If specified, files listed in this directory will be used to
       initialize the list's archive. See above for more information.

    mbox_match (optional)
       Defines the perl regular expression to use to match against file
       names in the `mbox_dir' directory. See above for more information.

    strip_subject (optional)
       This simply passes on the setting to hypermail.

    hypermail_opts (optional)
       Define parameters that are passed on directly to the hypermail
       configuration file for this list. The settings must be separated by a
       comma. This setting may be repeated on more than one line.

       By default, the settings used are:

           showhtml = 0
           deleted  = X-blabla
           gmtime   = On
           warn_surpressions = On

       Any setting you specify will override these settings.

       Example:

           hypermail_opts  = gmtime=On, showhtml=1
           hypermail_opts  = spamprotect = On

    hmrc (optional)
       Hypermail config file to use. No need to change this. The default is
       to use .hmrc in the list's directory.

    To test your new configuration additions:

        ./mail_archvie.pl --mode=create --verbose --test

    which will display what will happen. To actually create the list(s) run:

        ./mail_archvie.pl --mode=create --verbose

WEB SETUP
    Ok, it's not completely automatic.

    It's up to you how to link the archives to your web site.

    One suggestion:

        cd /usr/local/apache/htdocs
        mkdir archives
        cd archives
        ln -s $HOME/archives/somelist
        ln -s $HOME/archives/otherlist
        ...

AUTHOR
    Bill Moseley - moseley_at_hank.org

-- 
Bill Moseley
mailto:moseley_at_hank.org
Received on Fri 25 Jan 2002 03:39:51 AM GMT

This archive was generated by hypermail 2.2.0 : Thu 22 Feb 2007 07:33:54 PM GMT GMT