proposed new linkquotes option

From: Peter C. McCluskey <>
Date: Fri, 17 Sep 1999 15:12:44 -0700
Message-Id: <> (Daniel Stenberg) writes:
>On Fri, 10 Sep 1999, Peter C. McCluskey wrote:
>> I currently have about 1700 lines of code in new modules devoted to the
>> linkquotes option
>Wow, that's a lot of code. Could you take a moment to describe the
>functionality of that feature? With a focus on implementation issues and

 Here's an overview of the more important features of the new modules.

 At the start of any section of quoted text in new messages, if hashreplynumlookup returns a match, it reads through the indicated archive file for an exact match.
 If that fails, it calls the search_for_quote function (described below).

 If either approach finds the source of the quoted text, the quoted message is rewritten to add an <A NAME="nnnnqlinkm">...</A> around the quoted text, where nnnn is the number of the quoting message, and m is a number which distinguishes multiple links between the same pair of messages.  The first line of the quoted text in the quoting message (or set_quote_link_string if specified) is then output as a link to that anchor.

 After each call to printbody in which such quotes are linked is finished, if replylist didn't list this pair of messages, or listed them as "maybereply", it is updated to indicate a definite reply relationship.  Also, it rewrites the latest message to replace any "Maybe in reply to:"s with a single "In reply to:".


 This module saves message body text in the following tree structure intended for fast text search:

struct bigram_list

	const struct body *bp;
	short offset;
	struct bigram_list *next;


struct bigram_tree_entry

	struct bigram_tree_entry *left;
	struct bigram_tree_entry *right;
	struct bigram_list list;
	BIGRAM_TYPE bigram1;
	BIGRAM_TYPE bigram2;


 The BIGRAM_TYPE is an integral type (currently unsigned long) which represents a word.
 Each word (sequence of alphanumeric chars, excluding 20 common ones) is converted to BIGRAM_TYPE's, with each unique word being assigned a different number. Non alphanumeric characters are discarded.  Each node in the tree represents a pair of words that occur consecutively in the archive, with a list indicating each place the pair is found.

 A call to analyze_headers shortly after calling parsemail fills the bigram structure with text from all new messages plus old messages reread in loadoldheaders as limited by set_searchbackmsgnum.

 Then during printbody, the following function searches for the best match: int search_for_quote(char *search_line, const char *exact_line, int max_msgnum,

                     String_Match *match_info);
in messages numbered less than max_msgnum, mainly by converting search_line to BIGRAM_TYPE's and searching for the location with the most consecutive BIGRAM_TYPE's that match.


 The find_quote_prefix function looks through a message body to decide what prefix is most likely being used to indicate quoted text. It looks from the start of each line up to the first alphanumeric char, and counts how many times each unique prefix occurs. A bias is added in favor of prefixes containing '>'. The prefix with the highest count is selected. If there is a tie, the longest prefix is used. There are also provisions for count partial matches, and to often decide there is no quoted text if no prefix occurs more than once.

Peter McCluskey          | Critmail ( | Accept nothing less to archive your mailing list
Received on Sat 18 Sep 1999 12:14:20 AM GMT

This archive was generated by hypermail 2.2.0 : Thu 22 Feb 2007 07:33:51 PM GMT GMT