Threading by subject

From: Tom von Alten <>
Date: Fri, 7 May 1999 10:41:11 -0600
Message-ID: <002001be98a8$6a74f4d0$>

In the "Duplicate message ids" thread, Daniel Stenberg wrote:
> Yes (hypermail v1.02) did (try to thread messages with certain
> modified patterns of the subject). Although it did mess around
> too much in the original strings (i.e it wrote zero-bytes etc
> in them) for me to be able to keep them. When I cleaned up the
> thread and hash mess I never got around to add code in my
> newly written threadprint.c that checks for replies in other ways than
> In-Reply-To headers.

The threading of messages modified in predictable ways by MUAs was one of the first things I undertook with our hypermail v1+ implementation, although I did it via a wrapper script, rather than within hypermail.

We're still using the v1+ code and a wrapper script, but it has problems with proper handling of multiple messages arriving close together. I'm hoping to move to a v2.x and do away with the external bits soon, to take advantage of the many improvements that have been made.

However, the threading we have is not something we want to lose, and there are too many ways for the In-Reply-To approach not to work. (To name a few: broken MUAs; replying to a message sent to multiple archives, or to an archive and cc's; the sender choosing to start a fresh message, copy the subject and quote as needed; recomposition of a "reply" by the sender.)

Our conceptual approach may be of interest. It's done with a shell script and a variety of unix tools, so I don't think the particular code is of interest.

The process is:

  1. Remove any combination of defined subject prefixes, regardless of case and nesting. We do "re:" and "betr.::", with any bracketed number. "fw:" should be in there, too, but I decided early on to skip that, and never went back and added it in. As Daniel pointed out (and our inclusion of "Betr." shows), there's an element of localization involved.
  2. This leaves a string I called the "thread" (as opposed to "subject").
  3. To speed processing, I saved all the thread strings in a file. The new candidate is compared, independent of case, to see if there are any matches. If one is found the subject is changed to "Re: $thread" where "$thread" is the canonical version from the file.
  4. If no match is found, and there were some prefixes stripped, change the subject to the de-prefixed version (preserving whatever case was used in the source message). This new thread string is added to the thread file.
  5. Pass the possibly modified message into hypermail, where it will be threaded based on an exact match, or Re: + match.

Obviously, the approach within hypermail has to be different, but I think it's already taken the trouble to read all the messages (headers?) from the archive, so there wouldn't be a significant performance penalty.

_____________ Hewlett-Packard Computer Peripherals Bristol Tom von Alten

          This posting is for informational purposes only.
          It is not a statement of the Hewlett-Packard Co.
Received on Fri 07 May 1999 06:40:56 PM GMT

This archive was generated by hypermail 2.2.0 : Thu 22 Feb 2007 07:33:51 PM GMT GMT