[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[idn] Re: IURI questions



(I'm not sure if this reached the idn list (on Thu, 02 Mar 2000) - so I repost and
apologise if you've seen it already)

Aaron Irvine wrote:

> > > > * hex-encoded characters in URLs.  I just tried surfing to
> > > > www.%79%61%68%6f%6f.com, and on IE5, it takes me to www.yahoo.com, but
> > > > Netscape Navigator 4.6 can't find the server.
> >
> > It's interesting that it works! The question is whether it should.
> >
> > Larry
> > --
> > http://larry.masinter.net
>
> Hi all,
>
> Yes I believe it should work.
>
> I think:
> that human visible (typing into browsers, adverts on radio, etc.maybe in hrefs
> too) escaped Unicode should be consistent with URI path escaped Unicode (i.e.
> %hh escaped utf8),
> and that URI-authorities like www.%79%61%68%6f%6f.com [works in IE5] and
> schemes like k%C3%A1va [RFC2324] are IMHO the correct way to _present URI's_
> to end users
> however within the net we have to _encode URI's_:
> scheme        = alpha *( alpha | digit | "+" | "-" | "." ) ;[RFC 2396]
> domainlabel   = alphanum | alphanum *( alphanum | "-" ) alphanum  ;[RFC 2396]
> labels 63 'septets' max each, dns 255 'septets' max,
> possibly a desire not to change (immediately) the dns infrastructure,
> and I also note:
> hyphen hyphen and hyphen hyphen hyphen are allowed but rarely (never?) used in
> practice, hence free for our use...
>
> So at the very top of the stack, use %hh escaped UTF-8.  But deeper, utilise
> somehow the hyphen to encode characters above ASCII.  One possibility I here
> suggest could be:
> * triple-hyphened UTF-5 for when a scheme/username/domainlabel contains one or
> more characters above Latin extended B
> * double-hyphened UTF-8 otherwise
> where:
> * triple-hyphened UTF-5 means convert to UTF5 then insert "---" after first
> letter
> * double-hyphened UTF-8 means covert %XY to "X--Y"
> * and note a bare(trailing) hyphen never occurs in these
> * if in the unlikley event the original contains -- (or ---) then this is
> encoded as "----2" (or "----3")
>
> Examples:
>
> nihongo.jp
> M---5E5M72COA9E.jp       (is in triple-hyphened UTF-5; note translation done
> on per label basis)
>
> www.{alpha=\u3B1}{beta=\u3B2}.gr
> www.J---B1JB2.gr
>
> {oe=\u0153}uf.fr
> For universal typing: %C5%93uf.fr
> For the network itself: C--59--3uf.fr (rather than H---53N5M6.fr)
>
> feli{^c=\u0109}ulo
> For universal typing: feli%C4%89ulo (or even %66%65%6C%69%C4%89%75%6C%6F also
> allowed)
> For the network itself: feliC--48--9ulo (rather than the longer
> M---6M5MCM9H09N5MCMF)
>
> ridanta-feli{^c=\u0109}ulo@{oe=\u0153}uf.fr
> ridanta-feliC--48--9ulo@C--59--3uf.fr
>
> (BTW, will toplabel ever need Unicode?  If .store .web etc then yes)
> (BTW, rather than these two methods could we just use double-hyphened UTF-5 or
> would this not be compact enough for Latin languages?)
>
> Comments welcome please.  Regards,
> Aaron Irvine
> (Belfast, Northern Ireland)
> --
>
> -----------------------------------------------------
> Aaron Irvine
>   mailto:airvine@corp.phone.com
> -----------------------------------------------------

--

-----------------------------------------------------
Aaron Irvine
  mailto:airvine@corp.phone.com
-----------------------------------------------------