[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [idn] URL encoding in html page



Hello Adam,

At 23:52 02/03/28 +0000, Adam M. Costello wrote:
>Martin Duerst <duerst@w3.org> wrote:

> > You seem to imply that the HTML editor is checking or should check
> > the URI syntax, and that they could be upgraded to add a legacy
> > encoding->Unicode->ACE conversion.
>
>Regardless of whether today's HTML editors already check URI syntax,
>they could be upgraded to convert IRIs input by the user into URIs in
>the document, and that conversion could involve ToASCII.

In some cases, that might work, but in the general case, it won't.


> > If you enter hppt:something, should the editor tell you this is an
> > error?
>
>No, because it's not an error.  A warning might be helpful, or might be
>annoying, just like spell-checking.

So if you enter hppt:KANJI.KANJI.jp, what should the HTML editor do?
How does it know whether KANJI.KANJI.jp is a domain name or not?


> > Similarly, even if the HTML editor does the very limited checks that
> > RFC 2396 allows, this doesn't get it very far. For example, while
> > http: uses generic URI syntax, mailto: uses opaque syntax, so there is
> > no general way to know where in an URI there is a domain name. This
> > includes the case that a domain name is sent as a parameter.
>
>True, the editor can't be expected to know the syntax for every URI
>scheme.  For the ones it does know, it can fully check the syntax.  For
>any URI that begins with / or scheme:/, it can check the generic URI
>syntax (which is quite detailed).

This would mean that the URI syntax cannot be updated in any way.
There are some advantages in careful checking at some places,
and there are some disadvantages. For URI syntax, the general
tendency is to not try to check too much. The generic syntax
is there to allow things like relative URI resolution,...


>For opaque URIs with unknown schemes,
>it can at least check for characters disallowed in all URIs.
>
>I think that gets you pretty far in practice.  The vast majority of
>opaque URIs could also be handled if the editor knows just a few schemes
>(mailto, and maybe news and telnet).

In practice, an HTML editor could indeed do such things.
And for domain names, it may indeed be reasonable to
have the editor use toASCII when possible, at least for
a certain time.


> > HTML 4 already says what a browser should do if it finds a
> > non-ASCII character in an URI.  Please see
> > http://www.w3.org/TR/REC-html40/appendix/notes.html#h-B.2.1
> > Appendix B.2.1: Non-ASCII characters in URI attribute values
> > (this is supported by all major browsers in newer versions)
>
>It says:

...

>     This procedure results in a syntactically legal URI (as defined in
>     [RFC1738], section 2.2 or [RFC2141], section 2) that is independent
>     of the character encoding to which the HTML document carrying the
>     URI may have been transcoded.

>And RFC-2396, which is a normative reference
>of the HTML spec and which updates RFC-1738, is unambiguous in its
>prohibition of %HH escapes in the host part.
>
>Therefore, it is not true that "this procedure results in a
>syntactically legal URI".

Well, indeed this sentence is a bit too general. For example,
a space in a src attribute would not lead to a legal URI.

>I suggest that the procedure be ammended so
>that it does result in a syntactically legal URI.

How do you want to do that. Just saying 'convert host names
with toASCII' doesn't do the job.

Regards,    Martin.