[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] URL encoding in html page



At 23:40 02/03/26 +0000, Adam M. Costello wrote:
>"David Leung (Neteka Inc.)" <david@neteka.com> wrote:
>
> > If we use ACE as the URL links then all web designers in the world
> > needs to be retrained for ACE conversion.
>
>Either the web designer uses an HTML editor, in which case the editor
>should know the HTML syntax rules and convert to/from the local charset
>as needed,

I don't know exactly what you mean by 'convert to/from'. Many
HTML editors these days work in Unicode internally, and convert
to whatever encoding the author chooses on input/output.
This conversion is independent of HTML syntax, except for
the use (or not) of numeric character references (&#xHHHH;,...),
and the <meta> charset information.

Some HTML editors may work in a local encoding throughout,
and then they don't do any conversion.

You seem to imply that the HTML editor is checking or should check
the URI syntax, and that they could be upgraded to add a legacy
encoding->Unicode->ACE conversion.

First, I'm not aware of HTML editors actually doing such syntax
checks. Second, there is the very fundamental problem that they
won't be able to do such checks. If you enter hppt:something,
should the editor tell you this is an error? It has no idea
whether hppt is a legal URI scheme or not. Similarly, even if
the HTML editor does the very limited checks that RFC 2396
allows, this doesn't get it very far. For example, while
http: uses generic URI syntax, mailto: uses opaque syntax,
so there is no general way to know where in an URI there is
a domain name. This includes the case that a domain name
is sent as a parameter.

>or the web designer uses a text editor, in which case the web
>designer is taking responsibility for knowing and obeying the HTML/URI
>syntax rules, one of which is that href and src attributes contain only
>ASCII characters.
>
>Maybe future HTML/URI specs will allow non-ASCII characters in href and
>src attributes, but it's not obvious how to do that without breaking
>deployed browsers, and that discussion is for another forum.

HTML 4 already says what a browser should do if it finds an
non-ASCII character in an URI. Please see
http://www.w3.org/TR/REC-html40/appendix/notes.html#h-B.2.1
Appendix B.2.1: Non-ASCII characters in URI attribute values
(this is supported by all major browsers in newer versions)

Further information, e.g. on other W3C specs, see also:
http://www.w3.org/International/O-URL-and-ident.html

Of course this doesn't solve the problem of how to handle
IDNs in URIs automatically, but it provides a clear direction.


Regards,    Martin.


#-#-#  Martin J. Du"rst, I18N Activity Lead, World Wide Web Consortium
#-#-#  mailto:duerst@w3.org   http://www.w3.org/People/D%C3%BCrst