[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] URL encoding in html page



Martin Duerst <duerst@w3.org> wrote:

> HTML defines href or source as CDATA, which means just about 'anything
> goes'.

href and src are defined in the DTD as %URI; which in turn is defined as
CDATA, but section 6.4 of the HTML spec says:

    This specification uses the term URI as defined in [URI]

    URIs are represented in the DTD by the parameter entity %URI;.

[URI] links to RFC-2396.  So the href and src values are indeed
constrained by RFC-2396, even though this constraint cannot be expressed
in the DTD.

> > Either the web designer uses an HTML editor, in which case the
> > editor should know the HTML syntax rules and convert to/from the
> > local charset as needed,
>
> I don't know exactly what you mean by 'convert to/from'.

I mean that HTML editors should know that only a certain set of ASCII
characters are allowed in the href and src attributes, and if the user
tries to put other characters in there, the HTML editor should convert
the string to something that conforms to the HTML spec (if it knows how
to do such a conversion without changing the intended meaning) or at
least it should alert the user that they are violating the HTML spec.

Wouldn't you agree that an HTML editor that puts non-ASCII characters
into href and src attributes, in plain violation of the HTML spec,
without even alerting the user, is not quite living up to the name "HTML
editor"?  Because the thing it's editing isn't really HTML anymore.

> You seem to imply that the HTML editor is checking or should check
> the URI syntax, and that they could be upgraded to add a legacy
> encoding->Unicode->ACE conversion.

Regardless of whether today's HTML editors already check URI syntax,
they could be upgraded to convert IRIs input by the user into URIs in
the document, and that conversion could involve ToASCII.

> If you enter hppt:something, should the editor tell you this is an
> error?

No, because it's not an error.  A warning might be helpful, or might be
annoying, just like spell-checking.

> Similarly, even if the HTML editor does the very limited checks that
> RFC 2396 allows, this doesn't get it very far. For example, while
> http: uses generic URI syntax, mailto: uses opaque syntax, so there is
> no general way to know where in an URI there is a domain name. This
> includes the case that a domain name is sent as a parameter.

True, the editor can't be expected to know the syntax for every URI
scheme.  For the ones it does know, it can fully check the syntax.  For
any URI that begins with / or scheme:/, it can check the generic URI
syntax (which is quite detailed).  For opaque URIs with unknown schemes,
it can at least check for characters disallowed in all URIs.

I think that gets you pretty far in practice.  The vast majority of
opaque URIs could also be handled if the editor knows just a few schemes
(mailto, and maybe news and telnet).

> HTML 4 already says what a browser should do if it finds a
> non-ASCII character in an URI.  Please see
> http://www.w3.org/TR/REC-html40/appendix/notes.html#h-B.2.1
> Appendix B.2.1: Non-ASCII characters in URI attribute values
> (this is supported by all major browsers in newer versions)

It says:

    Although URIs do not contain non-ASCII values (see [URI], section   
    2.1) authors sometimes specify them in attribute values expecting   
    URIs (i.e., defined with %URI; in the DTD).  For instance, the      
    following href value is illegal:                                    
                                                                                
    <A href="http://foo.org/Håkon";>...</A>

    We recommend that user agents adopt the following convention for
    handling non-ASCII characters in such cases:

     1. Represent each character in UTF-8 (see [RFC2279]) as one or more
        bytes.

     2. Escape these bytes with the URI escaping mechanism (i.e.,       
        by converting each byte to %HH, where HH is the hexadecimal     
        notation of the byte value).                                    

    This procedure results in a syntactically legal URI (as defined in
    [RFC1738], section 2.2 or [RFC2141], section 2) that is independent
    of the character encoding to which the HTML document carrying the
    URI may have been transcoded.

RFC-1738 section 2.2 seems to say that %HH escapes are allowed in the
host part of a URI, but the BNF in section 5 clearly prohibits %HH
escapes in the host part.  And RFC-2396, which is a normative reference
of the HTML spec and which updates RFC-1738, is unambiguous in its
prohibition of %HH escapes in the host part.

Therefore, it is not true that "this procedure results in a
syntactically legal URI".  I suggest that the procedure be ammended so
that it does result in a syntactically legal URI.  That would not be
a serious change in the HTML spec, because the procedure doesn't even
apply to valid HTML, it applies only to invalid HTML.

AMC