HTML Filter

From: Byron C. Darrah <bdarr_at_sse.FU.HAC.COM_at_hypermail-project.org>
Date: Mon, 30 Nov 1998 14:42:21 -0800 (PST)
Message-Id: <199811302242.OAA20335_at_pepperoni.pizza.hac.com>

I just released a small change to the file below. I added a bunch of tags to the config file. The HTML filter now understands 50 HTML tags, which should be enable it to work pretty well on just about any HTML input.

--Byron



Date: Mon, 30 Nov 1998 11:12:52 -0800 (PST) From: "Byron C. Darrah" <bdarr_at_sed.hac.com>

Alright, I put together a little something that I think will make a good start for an HTML filter. You can download it from:

     http://www.cs.ucla.edu/~darrah/html_filter.tgz

Here's a little description of how it works:

  1. Comments, SGML commands, and unrecognized HTML tags are removed.
  2. Unmatched close tags are removed.
  3. The list of recognized tags is configurable, in a header file called filter_config.h.
  4. Recognized tags can be supressed. ie: removed.
  5. Recongized tags which are containers can cause all contained text to be supressed.
  6. Close tags are generated for unclosed containers.
  7. In the case of 2 or 6, a comment is emitted into the output, denoting the problem.

The current version has a very small list of recognized tags. We need to expand that.

In order to gurantee no buffer overflows, the html_filter uses the dynamic_strings_t module that I offered to Kent (by way of this mailing list) a while back. So I think the current unreleased Landfield beta version of hypermail probably already has this module in it.

If you want to integrate this filter with a version of hypermail (or other program) that uses different code for handling arbitrary length strings, you may want to either change html_filter or hypermail so that they use the same code for this.

This is my first cut at this, so there may be bugs :-).

--Byron Darrah
Received on Tue 01 Dec 1998 12:45:55 AM GMT

This archive was generated by hypermail 2.2.0 : Thu 22 Feb 2007 07:33:50 PM GMT GMT