[hypermail] XHTML commit done

From: Jose Kahan <jose.kahan_at_w3.org_at_hypermail-project.org>
Date: Thu, 10 Apr 2003 18:58:40 +0200
Message-ID: <20030410165840.GC25347_at_inrialpes.fr>


Hi folks,

I just commited my changes for upgrading hypermail to XHTML 1.0 Strict. This is part of my work for adding WAI enhancements to hypermail. This first commit only concerns the ugprade to XHTML. The next commits will be for the WAI enhancements once they're ready.

Before doing the commit I added a tag "Before_XHTML" to the source code so that in case of problems, we can find out easier what went wrong.

I made three kind of changes. Users who
have used HTML templates to customize their archives should upgrade them to the XHTML syntax in order to have valid documents. Here's a list of the most common changes I did:

  1. Well-formed documents (respecting the XML syntax):
    • All elements need to have an end tag. <li>something is now <li>something</li>
    • Single elements like <br> become <br />
    • All attributes need to have a value. If there was no value before, they take the name of the attribute.
    • The -- sequence is forbidden inside a comment.
  2. Valid XHTML documents (according to the strict DTD):
    • <ul><li><ul><li>something</ul></ul> has become <ul><li><ul><li>something</li></ul></li></ul>
    • It's invalid to have an empty <ul> <ul></ul> has become <ul>><li style="display: none"></li></ul>
    • The <u> (underline) tag has been deprecated. We were only using it in tables. I removed it.
  3. Charset problems (related to both XHTML, HTML, and XML):

Many mail clients specify one kind of charset (often ISO-8859-1), but include other characters belonging to other charsets (often WinLatin1) in the message body (note that there is a way to combine charsets in the headers that we already take care of). I noticed this problem while validating the XHTML changes. In order to get this working, I added some code so that WinLatin1 chars be coded into the respective Unicode entities.

For example, in an ISO-8859-1, the 0x80 character is invalid. I assume that it belongs to WinLatin1 and convert it to &#x20AC;, which is the equivalent entity. This character can now happily live inside an ISO-8859-1 document.

In order to achieve the above, I modified the API of some functions so that we can pass the value of the charset and do the convertion when needed.

When a message has no charset, I assumed it was ISO-8859-1.

The things that won't work yet is when an archive has messages belonging to different charsets. Let's suppose that each subject is written with a different charset. We can't say anymore that the subject.html index belongs to ISO-8859-1 or something else without transcoding the characters. I only added a transcoding for WinLatin1. A longer term solution will be to move on to UTF-8, which solves those problems.

The only drawback of having moved to XHTML is that when we parse a generated XHTML message that mixes charsets in the wrong way, the parser, if it's a valid XML one, will complain about invalid characters. HTML parsers should complain too, but browsers have lots of fallbacks to hide this error from users.

All in all, this shouldn't affect users of this XHTMLized hypermail as its backwards compatible with HTML browsers. As long as your documents are served with the text/html MIME type, things will work as usual for you. There are some turn arounds I can add if needed if more problems should arise. For example, we can suppress the XML prologue and just handle the document as HTML one. Let's wait and see how this turns out.


Some links:

    In particular, look at the "differences with HTML4" section:     http://www.w3.org/TR/xhtml1/#diffs

    I found it quite nifty and useful during the convertion

Hope this is helpful! Send in your bug reports and comments.

-jose Received on Thu 10 Apr 2003 07:05:42 PM GMT

This archive was generated by hypermail 2.2.0 : Thu 22 Feb 2007 07:33:54 PM GMT GMT