1964 lines
66 KiB
Plaintext
1964 lines
66 KiB
Plaintext
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Network Working Group A. Costello
|
|||
|
Request for Comments: 3492 Univ. of California, Berkeley
|
|||
|
Category: Standards Track March 2003
|
|||
|
|
|||
|
|
|||
|
Punycode: A Bootstring encoding of Unicode
|
|||
|
for Internationalized Domain Names in Applications (IDNA)
|
|||
|
|
|||
|
Status of this Memo
|
|||
|
|
|||
|
This document specifies an Internet standards track protocol for the
|
|||
|
Internet community, and requests discussion and suggestions for
|
|||
|
improvements. Please refer to the current edition of the "Internet
|
|||
|
Official Protocol Standards" (STD 1) for the standardization state
|
|||
|
and status of this protocol. Distribution of this memo is unlimited.
|
|||
|
|
|||
|
Copyright Notice
|
|||
|
|
|||
|
Copyright (C) The Internet Society (2003). All Rights Reserved.
|
|||
|
|
|||
|
Abstract
|
|||
|
|
|||
|
Punycode is a simple and efficient transfer encoding syntax designed
|
|||
|
for use with Internationalized Domain Names in Applications (IDNA).
|
|||
|
It uniquely and reversibly transforms a Unicode string into an ASCII
|
|||
|
string. ASCII characters in the Unicode string are represented
|
|||
|
literally, and non-ASCII characters are represented by ASCII
|
|||
|
characters that are allowed in host name labels (letters, digits, and
|
|||
|
hyphens). This document defines a general algorithm called
|
|||
|
Bootstring that allows a string of basic code points to uniquely
|
|||
|
represent any string of code points drawn from a larger set.
|
|||
|
Punycode is an instance of Bootstring that uses particular parameter
|
|||
|
values specified by this document, appropriate for IDNA.
|
|||
|
|
|||
|
Table of Contents
|
|||
|
|
|||
|
1. Introduction...............................................2
|
|||
|
1.1 Features..............................................2
|
|||
|
1.2 Interaction of protocol parts.........................3
|
|||
|
2. Terminology................................................3
|
|||
|
3. Bootstring description.....................................4
|
|||
|
3.1 Basic code point segregation..........................4
|
|||
|
3.2 Insertion unsort coding...............................4
|
|||
|
3.3 Generalized variable-length integers..................5
|
|||
|
3.4 Bias adaptation.......................................7
|
|||
|
4. Bootstring parameters......................................8
|
|||
|
5. Parameter values for Punycode..............................8
|
|||
|
6. Bootstring algorithms......................................9
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Costello Standards Track [Page 1]
|
|||
|
|
|||
|
RFC 3492 IDNA Punycode March 2003
|
|||
|
|
|||
|
|
|||
|
6.1 Bias adaptation function.............................10
|
|||
|
6.2 Decoding procedure...................................11
|
|||
|
6.3 Encoding procedure...................................12
|
|||
|
6.4 Overflow handling....................................13
|
|||
|
7. Punycode examples.........................................14
|
|||
|
7.1 Sample strings.......................................14
|
|||
|
7.2 Decoding traces......................................17
|
|||
|
7.3 Encoding traces......................................19
|
|||
|
8. Security Considerations...................................20
|
|||
|
9. References................................................21
|
|||
|
9.1 Normative References.................................21
|
|||
|
9.2 Informative References...............................21
|
|||
|
A. Mixed-case annotation.....................................22
|
|||
|
B. Disclaimer and license....................................22
|
|||
|
C. Punycode sample implementation............................23
|
|||
|
Author's Address.............................................34
|
|||
|
Full Copyright Statement.....................................35
|
|||
|
|
|||
|
1. Introduction
|
|||
|
|
|||
|
[IDNA] describes an architecture for supporting internationalized
|
|||
|
domain names. Labels containing non-ASCII characters can be
|
|||
|
represented by ACE labels, which begin with a special ACE prefix and
|
|||
|
contain only ASCII characters. The remainder of the label after the
|
|||
|
prefix is a Punycode encoding of a Unicode string satisfying certain
|
|||
|
constraints. For the details of the prefix and constraints, see
|
|||
|
[IDNA] and [NAMEPREP].
|
|||
|
|
|||
|
Punycode is an instance of a more general algorithm called
|
|||
|
Bootstring, which allows strings composed from a small set of "basic"
|
|||
|
code points to uniquely represent any string of code points drawn
|
|||
|
from a larger set. Punycode is Bootstring with particular parameter
|
|||
|
values appropriate for IDNA.
|
|||
|
|
|||
|
1.1 Features
|
|||
|
|
|||
|
Bootstring has been designed to have the following features:
|
|||
|
|
|||
|
* Completeness: Every extended string (sequence of arbitrary code
|
|||
|
points) can be represented by a basic string (sequence of basic
|
|||
|
code points). Restrictions on what strings are allowed, and on
|
|||
|
length, can be imposed by higher layers.
|
|||
|
|
|||
|
* Uniqueness: There is at most one basic string that represents a
|
|||
|
given extended string.
|
|||
|
|
|||
|
* Reversibility: Any extended string mapped to a basic string can
|
|||
|
be recovered from that basic string.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Costello Standards Track [Page 2]
|
|||
|
|
|||
|
RFC 3492 IDNA Punycode March 2003
|
|||
|
|
|||
|
|
|||
|
* Efficient encoding: The ratio of basic string length to extended
|
|||
|
string length is small. This is important in the context of
|
|||
|
domain names because RFC 1034 [RFC1034] restricts the length of a
|
|||
|
domain label to 63 characters.
|
|||
|
|
|||
|
* Simplicity: The encoding and decoding algorithms are reasonably
|
|||
|
simple to implement. The goals of efficiency and simplicity are
|
|||
|
at odds; Bootstring aims at a good balance between them.
|
|||
|
|
|||
|
* Readability: Basic code points appearing in the extended string
|
|||
|
are represented as themselves in the basic string (although the
|
|||
|
main purpose is to improve efficiency, not readability).
|
|||
|
|
|||
|
Punycode can also support an additional feature that is not used by
|
|||
|
the ToASCII and ToUnicode operations of [IDNA]. When extended
|
|||
|
strings are case-folded prior to encoding, the basic string can use
|
|||
|
mixed case to tell how to convert the folded string into a mixed-case
|
|||
|
string. See appendix A "Mixed-case annotation".
|
|||
|
|
|||
|
1.2 Interaction of protocol parts
|
|||
|
|
|||
|
Punycode is used by the IDNA protocol [IDNA] for converting domain
|
|||
|
labels into ASCII; it is not designed for any other purpose. It is
|
|||
|
explicitly not designed for processing arbitrary free text.
|
|||
|
|
|||
|
2. Terminology
|
|||
|
|
|||
|
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
|
|||
|
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
|
|||
|
document are to be interpreted as described in BCP 14, RFC 2119
|
|||
|
[RFC2119].
|
|||
|
|
|||
|
A code point is an integral value associated with a character in a
|
|||
|
coded character set.
|
|||
|
|
|||
|
As in the Unicode Standard [UNICODE], Unicode code points are denoted
|
|||
|
by "U+" followed by four to six hexadecimal digits, while a range of
|
|||
|
code points is denoted by two hexadecimal numbers separated by "..",
|
|||
|
with no prefixes.
|
|||
|
|
|||
|
The operators div and mod perform integer division; (x div y) is the
|
|||
|
quotient of x divided by y, discarding the remainder, and (x mod y)
|
|||
|
is the remainder, so (x div y) * y + (x mod y) == x. Bootstring uses
|
|||
|
these operators only with nonnegative operands, so the quotient and
|
|||
|
remainder are always nonnegative.
|
|||
|
|
|||
|
The break statement jumps out of the innermost loop (as in C).
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Costello Standards Track [Page 3]
|
|||
|
|
|||
|
RFC 3492 IDNA Punycode March 2003
|
|||
|
|
|||
|
|
|||
|
An overflow is an attempt to compute a value that exceeds the maximum
|
|||
|
value of an integer variable.
|
|||
|
|
|||
|
3. Bootstring description
|
|||
|
|
|||
|
Bootstring represents an arbitrary sequence of code points (the
|
|||
|
"extended string") as a sequence of basic code points (the "basic
|
|||
|
string"). This section describes the representation. Section 6
|
|||
|
"Bootstring algorithms" presents the algorithms as pseudocode.
|
|||
|
Sections 7.1 "Decoding traces" and 7.2 "Encoding traces" trace the
|
|||
|
algorithms for sample inputs.
|
|||
|
|
|||
|
The following sections describe the four techniques used in
|
|||
|
Bootstring. "Basic code point segregation" is a very simple and
|
|||
|
efficient encoding for basic code points occurring in the extended
|
|||
|
string: they are simply copied all at once. "Insertion unsort
|
|||
|
coding" encodes the non-basic code points as deltas, and processes
|
|||
|
the code points in numerical order rather than in order of
|
|||
|
appearance, which typically results in smaller deltas. The deltas
|
|||
|
are represented as "generalized variable-length integers", which use
|
|||
|
basic code points to represent nonnegative integers. The parameters
|
|||
|
of this integer representation are dynamically adjusted using "bias
|
|||
|
adaptation", to improve efficiency when consecutive deltas have
|
|||
|
similar magnitudes.
|
|||
|
|
|||
|
3.1 Basic code point segregation
|
|||
|
|
|||
|
All basic code points appearing in the extended string are
|
|||
|
represented literally at the beginning of the basic string, in their
|
|||
|
original order, followed by a delimiter if (and only if) the number
|
|||
|
of basic code points is nonzero. The delimiter is a particular basic
|
|||
|
code point, which never appears in the remainder of the basic string.
|
|||
|
The decoder can therefore find the end of the literal portion (if
|
|||
|
there is one) by scanning for the last delimiter.
|
|||
|
|
|||
|
3.2 Insertion unsort coding
|
|||
|
|
|||
|
The remainder of the basic string (after the last delimiter if there
|
|||
|
is one) represents a sequence of nonnegative integral deltas as
|
|||
|
generalized variable-length integers, described in section 3.3. The
|
|||
|
meaning of the deltas is best understood in terms of the decoder.
|
|||
|
|
|||
|
The decoder builds the extended string incrementally. Initially, the
|
|||
|
extended string is a copy of the literal portion of the basic string
|
|||
|
(excluding the last delimiter). The decoder inserts non-basic code
|
|||
|
points, one for each delta, into the extended string, ultimately
|
|||
|
arriving at the final decoded string.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Costello Standards Track [Page 4]
|
|||
|
|
|||
|
RFC 3492 IDNA Punycode March 2003
|
|||
|
|
|||
|
|
|||
|
At the heart of this process is a state machine with two state
|
|||
|
variables: an index i and a counter n. The index i refers to a
|
|||
|
position in the extended string; it ranges from 0 (the first
|
|||
|
position) to the current length of the extended string (which refers
|
|||
|
to a potential position beyond the current end). If the current
|
|||
|
state is <n,i>, the next state is <n,i+1> if i is less than the
|
|||
|
length of the extended string, or <n+1,0> if i equals the length of
|
|||
|
the extended string. In other words, each state change causes i to
|
|||
|
increment, wrapping around to zero if necessary, and n counts the
|
|||
|
number of wrap-arounds.
|
|||
|
|
|||
|
Notice that the state always advances monotonically (there is no way
|
|||
|
for the decoder to return to an earlier state). At each state, an
|
|||
|
insertion is either performed or not performed. At most one
|
|||
|
insertion is performed in a given state. An insertion inserts the
|
|||
|
value of n at position i in the extended string. The deltas are a
|
|||
|
run-length encoding of this sequence of events: they are the lengths
|
|||
|
of the runs of non-insertion states preceeding the insertion states.
|
|||
|
Hence, for each delta, the decoder performs delta state changes, then
|
|||
|
an insertion, and then one more state change. (An implementation
|
|||
|
need not perform each state change individually, but can instead use
|
|||
|
division and remainder calculations to compute the next insertion
|
|||
|
state directly.) It is an error if the inserted code point is a
|
|||
|
basic code point (because basic code points were supposed to be
|
|||
|
segregated as described in section 3.1).
|
|||
|
|
|||
|
The encoder's main task is to derive the sequence of deltas that will
|
|||
|
cause the decoder to construct the desired string. It can do this by
|
|||
|
repeatedly scanning the extended string for the next code point that
|
|||
|
the decoder would need to insert, and counting the number of state
|
|||
|
changes the decoder would need to perform, mindful of the fact that
|
|||
|
the decoder's extended string will include only those code points
|
|||
|
that have already been inserted. Section 6.3 "Encoding procedure"
|
|||
|
gives a precise algorithm.
|
|||
|
|
|||
|
3.3 Generalized variable-length integers
|
|||
|
|
|||
|
In a conventional integer representation the base is the number of
|
|||
|
distinct symbols for digits, whose values are 0 through base-1. Let
|
|||
|
digit_0 denote the least significant digit, digit_1 the next least
|
|||
|
significant, and so on. The value represented is the sum over j of
|
|||
|
digit_j * w(j), where w(j) = base^j is the weight (scale factor) for
|
|||
|
position j. For example, in the base 8 integer 437, the digits are
|
|||
|
7, 3, and 4, and the weights are 1, 8, and 64, so the value is 7 +
|
|||
|
3*8 + 4*64 = 287. This representation has two disadvantages: First,
|
|||
|
there are multiple encodings of each value (because there can be
|
|||
|
extra zeros in the most significant positions), which is inconvenient
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Costello Standards Track [Page 5]
|
|||
|
|
|||
|
RFC 3492 IDNA Punycode March 2003
|
|||
|
|
|||
|
|
|||
|
when unique encodings are needed. Second, the integer is not self-
|
|||
|
delimiting, so if multiple integers are concatenated the boundaries
|
|||
|
between them are lost.
|
|||
|
|
|||
|
The generalized variable-length representation solves these two
|
|||
|
problems. The digit values are still 0 through base-1, but now the
|
|||
|
integer is self-delimiting by means of thresholds t(j), each of which
|
|||
|
is in the range 0 through base-1. Exactly one digit, the most
|
|||
|
significant, satisfies digit_j < t(j). Therefore, if several
|
|||
|
integers are concatenated, it is easy to separate them, starting with
|
|||
|
the first if they are little-endian (least significant digit first),
|
|||
|
or starting with the last if they are big-endian (most significant
|
|||
|
digit first). As before, the value is the sum over j of digit_j *
|
|||
|
w(j), but the weights are different:
|
|||
|
|
|||
|
w(0) = 1
|
|||
|
w(j) = w(j-1) * (base - t(j-1)) for j > 0
|
|||
|
|
|||
|
For example, consider the little-endian sequence of base 8 digits
|
|||
|
734251... Suppose the thresholds are 2, 3, 5, 5, 5, 5... This
|
|||
|
implies that the weights are 1, 1*(8-2) = 6, 6*(8-3) = 30, 30*(8-5) =
|
|||
|
90, 90*(8-5) = 270, and so on. 7 is not less than 2, and 3 is not
|
|||
|
less than 3, but 4 is less than 5, so 4 is the last digit. The value
|
|||
|
of 734 is 7*1 + 3*6 + 4*30 = 145. The next integer is 251, with
|
|||
|
value 2*1 + 5*6 + 1*30 = 62. Decoding this representation is very
|
|||
|
similar to decoding a conventional integer: Start with a current
|
|||
|
value of N = 0 and a weight w = 1. Fetch the next digit d and
|
|||
|
increase N by d * w. If d is less than the current threshold (t)
|
|||
|
then stop, otherwise increase w by a factor of (base - t), update t
|
|||
|
for the next position, and repeat.
|
|||
|
|
|||
|
Encoding this representation is similar to encoding a conventional
|
|||
|
integer: If N < t then output one digit for N and stop, otherwise
|
|||
|
output the digit for t + ((N - t) mod (base - t)), then replace N
|
|||
|
with (N - t) div (base - t), update t for the next position, and
|
|||
|
repeat.
|
|||
|
|
|||
|
For any particular set of values of t(j), there is exactly one
|
|||
|
generalized variable-length representation of each nonnegative
|
|||
|
integral value.
|
|||
|
|
|||
|
Bootstring uses little-endian ordering so that the deltas can be
|
|||
|
separated starting with the first. The t(j) values are defined in
|
|||
|
terms of the constants base, tmin, and tmax, and a state variable
|
|||
|
called bias:
|
|||
|
|
|||
|
t(j) = base * (j + 1) - bias,
|
|||
|
clamped to the range tmin through tmax
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Costello Standards Track [Page 6]
|
|||
|
|
|||
|
RFC 3492 IDNA Punycode March 2003
|
|||
|
|
|||
|
|
|||
|
The clamping means that if the formula yields a value less than tmin
|
|||
|
or greater than tmax, then t(j) = tmin or tmax, respectively. (In
|
|||
|
the pseudocode in section 6 "Bootstring algorithms", the expression
|
|||
|
base * (j + 1) is denoted by k for performance reasons.) These t(j)
|
|||
|
values cause the representation to favor integers within a particular
|
|||
|
range determined by the bias.
|
|||
|
|
|||
|
3.4 Bias adaptation
|
|||
|
|
|||
|
After each delta is encoded or decoded, bias is set for the next
|
|||
|
delta as follows:
|
|||
|
|
|||
|
1. Delta is scaled in order to avoid overflow in the next step:
|
|||
|
|
|||
|
let delta = delta div 2
|
|||
|
|
|||
|
But when this is the very first delta, the divisor is not 2, but
|
|||
|
instead a constant called damp. This compensates for the fact
|
|||
|
that the second delta is usually much smaller than the first.
|
|||
|
|
|||
|
2. Delta is increased to compensate for the fact that the next delta
|
|||
|
will be inserting into a longer string:
|
|||
|
|
|||
|
let delta = delta + (delta div numpoints)
|
|||
|
|
|||
|
numpoints is the total number of code points encoded/decoded so
|
|||
|
far (including the one corresponding to this delta itself, and
|
|||
|
including the basic code points).
|
|||
|
|
|||
|
3. Delta is repeatedly divided until it falls within a threshold, to
|
|||
|
predict the minimum number of digits needed to represent the next
|
|||
|
delta:
|
|||
|
|
|||
|
while delta > ((base - tmin) * tmax) div 2
|
|||
|
do let delta = delta div (base - tmin)
|
|||
|
|
|||
|
4. The bias is set:
|
|||
|
|
|||
|
let bias =
|
|||
|
(base * the number of divisions performed in step 3) +
|
|||
|
(((base - tmin + 1) * delta) div (delta + skew))
|
|||
|
|
|||
|
The motivation for this procedure is that the current delta
|
|||
|
provides a hint about the likely size of the next delta, and so
|
|||
|
t(j) is set to tmax for the more significant digits starting with
|
|||
|
the one expected to be last, tmin for the less significant digits
|
|||
|
up through the one expected to be third-last, and somewhere
|
|||
|
between tmin and tmax for the digit expected to be second-last
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Costello Standards Track [Page 7]
|
|||
|
|
|||
|
RFC 3492 IDNA Punycode March 2003
|
|||
|
|
|||
|
|
|||
|
(balancing the hope of the expected-last digit being unnecessary
|
|||
|
against the danger of it being insufficient).
|
|||
|
|
|||
|
4. Bootstring parameters
|
|||
|
|
|||
|
Given a set of basic code points, one needs to be designated as the
|
|||
|
delimiter. The base cannot be greater than the number of
|
|||
|
distinguishable basic code points remaining. The digit-values in the
|
|||
|
range 0 through base-1 need to be associated with distinct non-
|
|||
|
delimiter basic code points. In some cases multiple code points need
|
|||
|
to have the same digit-value; for example, uppercase and lowercase
|
|||
|
versions of the same letter need to be equivalent if basic strings
|
|||
|
are case-insensitive.
|
|||
|
|
|||
|
The initial value of n cannot be greater than the minimum non-basic
|
|||
|
code point that could appear in extended strings.
|
|||
|
|
|||
|
The remaining five parameters (tmin, tmax, skew, damp, and the
|
|||
|
initial value of bias) need to satisfy the following constraints:
|
|||
|
|
|||
|
0 <= tmin <= tmax <= base-1
|
|||
|
skew >= 1
|
|||
|
damp >= 2
|
|||
|
initial_bias mod base <= base - tmin
|
|||
|
|
|||
|
Provided the constraints are satisfied, these five parameters affect
|
|||
|
efficiency but not correctness. They are best chosen empirically.
|
|||
|
|
|||
|
If support for mixed-case annotation is desired (see appendix A),
|
|||
|
make sure that the code points corresponding to 0 through tmax-1 all
|
|||
|
have both uppercase and lowercase forms.
|
|||
|
|
|||
|
5. Parameter values for Punycode
|
|||
|
|
|||
|
Punycode uses the following Bootstring parameter values:
|
|||
|
|
|||
|
base = 36
|
|||
|
tmin = 1
|
|||
|
tmax = 26
|
|||
|
skew = 38
|
|||
|
damp = 700
|
|||
|
initial_bias = 72
|
|||
|
initial_n = 128 = 0x80
|
|||
|
|
|||
|
Although the only restriction Punycode imposes on the input integers
|
|||
|
is that they be nonnegative, these parameters are especially designed
|
|||
|
to work well with Unicode [UNICODE] code points, which are integers
|
|||
|
in the range 0..10FFFF (but not D800..DFFF, which are reserved for
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Costello Standards Track [Page 8]
|
|||
|
|
|||
|
RFC 3492 IDNA Punycode March 2003
|
|||
|
|
|||
|
|
|||
|
use by the UTF-16 encoding of Unicode). The basic code points are
|
|||
|
the ASCII [ASCII] code points (0..7F), of which U+002D (-) is the
|
|||
|
delimiter, and some of the others have digit-values as follows:
|
|||
|
|
|||
|
code points digit-values
|
|||
|
------------ ----------------------
|
|||
|
41..5A (A-Z) = 0 to 25, respectively
|
|||
|
61..7A (a-z) = 0 to 25, respectively
|
|||
|
30..39 (0-9) = 26 to 35, respectively
|
|||
|
|
|||
|
Using hyphen-minus as the delimiter implies that the encoded string
|
|||
|
can end with a hyphen-minus only if the Unicode string consists
|
|||
|
entirely of basic code points, but IDNA forbids such strings from
|
|||
|
being encoded. The encoded string can begin with a hyphen-minus, but
|
|||
|
IDNA prepends a prefix. Therefore IDNA using Punycode conforms to
|
|||
|
the RFC 952 rule that host name labels neither begin nor end with a
|
|||
|
hyphen-minus [RFC952].
|
|||
|
|
|||
|
A decoder MUST recognize the letters in both uppercase and lowercase
|
|||
|
forms (including mixtures of both forms). An encoder SHOULD output
|
|||
|
only uppercase forms or only lowercase forms, unless it uses mixed-
|
|||
|
case annotation (see appendix A).
|
|||
|
|
|||
|
Presumably most users will not manually write or type encoded strings
|
|||
|
(as opposed to cutting and pasting them), but those who do will need
|
|||
|
to be alert to the potential visual ambiguity between the following
|
|||
|
sets of characters:
|
|||
|
|
|||
|
G 6
|
|||
|
I l 1
|
|||
|
O 0
|
|||
|
S 5
|
|||
|
U V
|
|||
|
Z 2
|
|||
|
|
|||
|
Such ambiguities are usually resolved by context, but in a Punycode
|
|||
|
encoded string there is no context apparent to humans.
|
|||
|
|
|||
|
6. Bootstring algorithms
|
|||
|
|
|||
|
Some parts of the pseudocode can be omitted if the parameters satisfy
|
|||
|
certain conditions (for which Punycode qualifies). These parts are
|
|||
|
enclosed in {braces}, and notes immediately following the pseudocode
|
|||
|
explain the conditions under which they can be omitted.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Costello Standards Track [Page 9]
|
|||
|
|
|||
|
RFC 3492 IDNA Punycode March 2003
|
|||
|
|
|||
|
|
|||
|
Formally, code points are integers, and hence the pseudocode assumes
|
|||
|
that arithmetic operations can be performed directly on code points.
|
|||
|
In some programming languages, explicit conversion between code
|
|||
|
points and integers might be necessary.
|
|||
|
|
|||
|
6.1 Bias adaptation function
|
|||
|
|
|||
|
function adapt(delta,numpoints,firsttime):
|
|||
|
if firsttime then let delta = delta div damp
|
|||
|
else let delta = delta div 2
|
|||
|
let delta = delta + (delta div numpoints)
|
|||
|
let k = 0
|
|||
|
while delta > ((base - tmin) * tmax) div 2 do begin
|
|||
|
let delta = delta div (base - tmin)
|
|||
|
let k = k + base
|
|||
|
end
|
|||
|
return k + (((base - tmin + 1) * delta) div (delta + skew))
|
|||
|
|
|||
|
It does not matter whether the modifications to delta and k inside
|
|||
|
adapt() affect variables of the same name inside the
|
|||
|
encoding/decoding procedures, because after calling adapt() the
|
|||
|
caller does not read those variables before overwriting them.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Costello Standards Track [Page 10]
|
|||
|
|
|||
|
RFC 3492 IDNA Punycode March 2003
|
|||
|
|
|||
|
|
|||
|
6.2 Decoding procedure
|
|||
|
|
|||
|
let n = initial_n
|
|||
|
let i = 0
|
|||
|
let bias = initial_bias
|
|||
|
let output = an empty string indexed from 0
|
|||
|
consume all code points before the last delimiter (if there is one)
|
|||
|
and copy them to output, fail on any non-basic code point
|
|||
|
if more than zero code points were consumed then consume one more
|
|||
|
(which will be the last delimiter)
|
|||
|
while the input is not exhausted do begin
|
|||
|
let oldi = i
|
|||
|
let w = 1
|
|||
|
for k = base to infinity in steps of base do begin
|
|||
|
consume a code point, or fail if there was none to consume
|
|||
|
let digit = the code point's digit-value, fail if it has none
|
|||
|
let i = i + digit * w, fail on overflow
|
|||
|
let t = tmin if k <= bias {+ tmin}, or
|
|||
|
tmax if k >= bias + tmax, or k - bias otherwise
|
|||
|
if digit < t then break
|
|||
|
let w = w * (base - t), fail on overflow
|
|||
|
end
|
|||
|
let bias = adapt(i - oldi, length(output) + 1, test oldi is 0?)
|
|||
|
let n = n + i div (length(output) + 1), fail on overflow
|
|||
|
let i = i mod (length(output) + 1)
|
|||
|
{if n is a basic code point then fail}
|
|||
|
insert n into output at position i
|
|||
|
increment i
|
|||
|
end
|
|||
|
|
|||
|
The full statement enclosed in braces (checking whether n is a basic
|
|||
|
code point) can be omitted if initial_n exceeds all basic code points
|
|||
|
(which is true for Punycode), because n is never less than initial_n.
|
|||
|
|
|||
|
In the assignment of t, where t is clamped to the range tmin through
|
|||
|
tmax, "+ tmin" can always be omitted. This makes the clamping
|
|||
|
calculation incorrect when bias < k < bias + tmin, but that cannot
|
|||
|
happen because of the way bias is computed and because of the
|
|||
|
constraints on the parameters.
|
|||
|
|
|||
|
Because the decoder state can only advance monotonically, and there
|
|||
|
is only one representation of any delta, there is therefore only one
|
|||
|
encoded string that can represent a given sequence of integers. The
|
|||
|
only error conditions are invalid code points, unexpected end-of-
|
|||
|
input, overflow, and basic code points encoded using deltas instead
|
|||
|
of appearing literally. If the decoder fails on these errors as
|
|||
|
shown above, then it cannot produce the same output for two distinct
|
|||
|
inputs. Without this property it would have been necessary to re-
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Costello Standards Track [Page 11]
|
|||
|
|
|||
|
RFC 3492 IDNA Punycode March 2003
|
|||
|
|
|||
|
|
|||
|
encode the output and verify that it matches the input in order to
|
|||
|
guarantee the uniqueness of the encoding.
|
|||
|
|
|||
|
6.3 Encoding procedure
|
|||
|
|
|||
|
let n = initial_n
|
|||
|
let delta = 0
|
|||
|
let bias = initial_bias
|
|||
|
let h = b = the number of basic code points in the input
|
|||
|
copy them to the output in order, followed by a delimiter if b > 0
|
|||
|
{if the input contains a non-basic code point < n then fail}
|
|||
|
while h < length(input) do begin
|
|||
|
let m = the minimum {non-basic} code point >= n in the input
|
|||
|
let delta = delta + (m - n) * (h + 1), fail on overflow
|
|||
|
let n = m
|
|||
|
for each code point c in the input (in order) do begin
|
|||
|
if c < n {or c is basic} then increment delta, fail on overflow
|
|||
|
if c == n then begin
|
|||
|
let q = delta
|
|||
|
for k = base to infinity in steps of base do begin
|
|||
|
let t = tmin if k <= bias {+ tmin}, or
|
|||
|
tmax if k >= bias + tmax, or k - bias otherwise
|
|||
|
if q < t then break
|
|||
|
output the code point for digit t + ((q - t) mod (base - t))
|
|||
|
let q = (q - t) div (base - t)
|
|||
|
end
|
|||
|
output the code point for digit q
|
|||
|
let bias = adapt(delta, h + 1, test h equals b?)
|
|||
|
let delta = 0
|
|||
|
increment h
|
|||
|
end
|
|||
|
end
|
|||
|
increment delta and n
|
|||
|
end
|
|||
|
|
|||
|
The full statement enclosed in braces (checking whether the input
|
|||
|
contains a non-basic code point less than n) can be omitted if all
|
|||
|
code points less than initial_n are basic code points (which is true
|
|||
|
for Punycode if code points are unsigned).
|
|||
|
|
|||
|
The brace-enclosed conditions "non-basic" and "or c is basic" can be
|
|||
|
omitted if initial_n exceeds all basic code points (which is true for
|
|||
|
Punycode), because the code point being tested is never less than
|
|||
|
initial_n.
|
|||
|
|
|||
|
In the assignment of t, where t is clamped to the range tmin through
|
|||
|
tmax, "+ tmin" can always be omitted. This makes the clamping
|
|||
|
calculation incorrect when bias < k < bias + tmin, but that cannot
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Costello Standards Track [Page 12]
|
|||
|
|
|||
|
RFC 3492 IDNA Punycode March 2003
|
|||
|
|
|||
|
|
|||
|
happen because of the way bias is computed and because of the
|
|||
|
constraints on the parameters.
|
|||
|
|
|||
|
The checks for overflow are necessary to avoid producing invalid
|
|||
|
output when the input contains very large values or is very long.
|
|||
|
|
|||
|
The increment of delta at the bottom of the outer loop cannot
|
|||
|
overflow because delta < length(input) before the increment, and
|
|||
|
length(input) is already assumed to be representable. The increment
|
|||
|
of n could overflow, but only if h == length(input), in which case
|
|||
|
the procedure is finished anyway.
|
|||
|
|
|||
|
6.4 Overflow handling
|
|||
|
|
|||
|
For IDNA, 26-bit unsigned integers are sufficient to handle all valid
|
|||
|
IDNA labels without overflow, because any string that needed a 27-bit
|
|||
|
delta would have to exceed either the code point limit (0..10FFFF) or
|
|||
|
the label length limit (63 characters). However, overflow handling
|
|||
|
is necessary because the inputs are not necessarily valid IDNA
|
|||
|
labels.
|
|||
|
|
|||
|
If the programming language does not provide overflow detection, the
|
|||
|
following technique can be used. Suppose A, B, and C are
|
|||
|
representable nonnegative integers and C is nonzero. Then A + B
|
|||
|
overflows if and only if B > maxint - A, and A + (B * C) overflows if
|
|||
|
and only if B > (maxint - A) div C, where maxint is the greatest
|
|||
|
integer for which maxint + 1 cannot be represented. Refer to
|
|||
|
appendix C "Punycode sample implementation" for demonstrations of
|
|||
|
this technique in the C language.
|
|||
|
|
|||
|
The decoding and encoding algorithms shown in sections 6.2 and 6.3
|
|||
|
handle overflow by detecting it whenever it happens. Another
|
|||
|
approach is to enforce limits on the inputs that prevent overflow
|
|||
|
from happening. For example, if the encoder were to verify that no
|
|||
|
input code points exceed M and that the input length does not exceed
|
|||
|
L, then no delta could ever exceed (M - initial_n) * (L + 1), and
|
|||
|
hence no overflow could occur if integer variables were capable of
|
|||
|
representing values that large. This prevention approach would
|
|||
|
impose more restrictions on the input than the detection approach
|
|||
|
does, but might be considered simpler in some programming languages.
|
|||
|
|
|||
|
In theory, the decoder could use an analogous approach, limiting the
|
|||
|
number of digits in a variable-length integer (that is, limiting the
|
|||
|
number of iterations in the innermost loop). However, the number of
|
|||
|
digits that suffice to represent a given delta can sometimes
|
|||
|
represent much larger deltas (because of the adaptation), and hence
|
|||
|
this approach would probably need integers wider than 32 bits.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Costello Standards Track [Page 13]
|
|||
|
|
|||
|
RFC 3492 IDNA Punycode March 2003
|
|||
|
|
|||
|
|
|||
|
Yet another approach for the decoder is to allow overflow to occur,
|
|||
|
but to check the final output string by re-encoding it and comparing
|
|||
|
to the decoder input. If and only if they do not match (using a
|
|||
|
case-insensitive ASCII comparison) overflow has occurred. This
|
|||
|
delayed-detection approach would not impose any more restrictions on
|
|||
|
the input than the immediate-detection approach does, and might be
|
|||
|
considered simpler in some programming languages.
|
|||
|
|
|||
|
In fact, if the decoder is used only inside the IDNA ToUnicode
|
|||
|
operation [IDNA], then it need not check for overflow at all, because
|
|||
|
ToUnicode performs a higher level re-encoding and comparison, and a
|
|||
|
mismatch has the same consequence as if the Punycode decoder had
|
|||
|
failed.
|
|||
|
|
|||
|
7. Punycode examples
|
|||
|
|
|||
|
7.1 Sample strings
|
|||
|
|
|||
|
In the Punycode encodings below, the ACE prefix is not shown.
|
|||
|
Backslashes show where line breaks have been inserted in strings too
|
|||
|
long for one line.
|
|||
|
|
|||
|
The first several examples are all translations of the sentence "Why
|
|||
|
can't they just speak in <language>?" (courtesy of Michael Kaplan's
|
|||
|
"provincial" page [PROVINCIAL]). Word breaks and punctuation have
|
|||
|
been removed, as is often done in domain names.
|
|||
|
|
|||
|
(A) Arabic (Egyptian):
|
|||
|
u+0644 u+064A u+0647 u+0645 u+0627 u+0628 u+062A u+0643 u+0644
|
|||
|
u+0645 u+0648 u+0634 u+0639 u+0631 u+0628 u+064A u+061F
|
|||
|
Punycode: egbpdaj6bu4bxfgehfvwxn
|
|||
|
|
|||
|
(B) Chinese (simplified):
|
|||
|
u+4ED6 u+4EEC u+4E3A u+4EC0 u+4E48 u+4E0D u+8BF4 u+4E2D u+6587
|
|||
|
Punycode: ihqwcrb4cv8a8dqg056pqjye
|
|||
|
|
|||
|
(C) Chinese (traditional):
|
|||
|
u+4ED6 u+5011 u+7232 u+4EC0 u+9EBD u+4E0D u+8AAA u+4E2D u+6587
|
|||
|
Punycode: ihqwctvzc91f659drss3x8bo0yb
|
|||
|
|
|||
|
(D) Czech: Pro<ccaron>prost<ecaron>nemluv<iacute><ccaron>esky
|
|||
|
U+0050 u+0072 u+006F u+010D u+0070 u+0072 u+006F u+0073 u+0074
|
|||
|
u+011B u+006E u+0065 u+006D u+006C u+0075 u+0076 u+00ED u+010D
|
|||
|
u+0065 u+0073 u+006B u+0079
|
|||
|
Punycode: Proprostnemluvesky-uyb24dma41a
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Costello Standards Track [Page 14]
|
|||
|
|
|||
|
RFC 3492 IDNA Punycode March 2003
|
|||
|
|
|||
|
|
|||
|
(E) Hebrew:
|
|||
|
u+05DC u+05DE u+05D4 u+05D4 u+05DD u+05E4 u+05E9 u+05D5 u+05D8
|
|||
|
u+05DC u+05D0 u+05DE u+05D3 u+05D1 u+05E8 u+05D9 u+05DD u+05E2
|
|||
|
u+05D1 u+05E8 u+05D9 u+05EA
|
|||
|
Punycode: 4dbcagdahymbxekheh6e0a7fei0b
|
|||
|
|
|||
|
(F) Hindi (Devanagari):
|
|||
|
u+092F u+0939 u+0932 u+094B u+0917 u+0939 u+093F u+0928 u+094D
|
|||
|
u+0926 u+0940 u+0915 u+094D u+092F u+094B u+0902 u+0928 u+0939
|
|||
|
u+0940 u+0902 u+092C u+094B u+0932 u+0938 u+0915 u+0924 u+0947
|
|||
|
u+0939 u+0948 u+0902
|
|||
|
Punycode: i1baa7eci9glrd9b2ae1bj0hfcgg6iyaf8o0a1dig0cd
|
|||
|
|
|||
|
(G) Japanese (kanji and hiragana):
|
|||
|
u+306A u+305C u+307F u+3093 u+306A u+65E5 u+672C u+8A9E u+3092
|
|||
|
u+8A71 u+3057 u+3066 u+304F u+308C u+306A u+3044 u+306E u+304B
|
|||
|
Punycode: n8jok5ay5dzabd5bym9f0cm5685rrjetr6pdxa
|
|||
|
|
|||
|
(H) Korean (Hangul syllables):
|
|||
|
u+C138 u+ACC4 u+C758 u+BAA8 u+B4E0 u+C0AC u+B78C u+B4E4 u+C774
|
|||
|
u+D55C u+AD6D u+C5B4 u+B97C u+C774 u+D574 u+D55C u+B2E4 u+BA74
|
|||
|
u+C5BC u+B9C8 u+B098 u+C88B u+C744 u+AE4C
|
|||
|
Punycode: 989aomsvi5e83db1d2a355cv1e0vak1dwrv93d5xbh15a0dt30a5j\
|
|||
|
psd879ccm6fea98c
|
|||
|
|
|||
|
(I) Russian (Cyrillic):
|
|||
|
U+043F u+043E u+0447 u+0435 u+043C u+0443 u+0436 u+0435 u+043E
|
|||
|
u+043D u+0438 u+043D u+0435 u+0433 u+043E u+0432 u+043E u+0440
|
|||
|
u+044F u+0442 u+043F u+043E u+0440 u+0443 u+0441 u+0441 u+043A
|
|||
|
u+0438
|
|||
|
Punycode: b1abfaaepdrnnbgefbaDotcwatmq2g4l
|
|||
|
|
|||
|
(J) Spanish: Porqu<eacute>nopuedensimplementehablarenEspa<ntilde>ol
|
|||
|
U+0050 u+006F u+0072 u+0071 u+0075 u+00E9 u+006E u+006F u+0070
|
|||
|
u+0075 u+0065 u+0064 u+0065 u+006E u+0073 u+0069 u+006D u+0070
|
|||
|
u+006C u+0065 u+006D u+0065 u+006E u+0074 u+0065 u+0068 u+0061
|
|||
|
u+0062 u+006C u+0061 u+0072 u+0065 u+006E U+0045 u+0073 u+0070
|
|||
|
u+0061 u+00F1 u+006F u+006C
|
|||
|
Punycode: PorqunopuedensimplementehablarenEspaol-fmd56a
|
|||
|
|
|||
|
(K) Vietnamese:
|
|||
|
T<adotbelow>isaoh<odotbelow>kh<ocirc>ngth<ecirchookabove>ch\
|
|||
|
<ihookabove>n<oacute>iti<ecircacute>ngVi<ecircdotbelow>t
|
|||
|
U+0054 u+1EA1 u+0069 u+0073 u+0061 u+006F u+0068 u+1ECD u+006B
|
|||
|
u+0068 u+00F4 u+006E u+0067 u+0074 u+0068 u+1EC3 u+0063 u+0068
|
|||
|
u+1EC9 u+006E u+00F3 u+0069 u+0074 u+0069 u+1EBF u+006E u+0067
|
|||
|
U+0056 u+0069 u+1EC7 u+0074
|
|||
|
Punycode: TisaohkhngthchnitingVit-kjcr8268qyxafd2f1b9g
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Costello Standards Track [Page 15]
|
|||
|
|
|||
|
RFC 3492 IDNA Punycode March 2003
|
|||
|
|
|||
|
|
|||
|
The next several examples are all names of Japanese music artists,
|
|||
|
song titles, and TV programs, just because the author happens to have
|
|||
|
them handy (but Japanese is useful for providing examples of single-
|
|||
|
row text, two-row text, ideographic text, and various mixtures
|
|||
|
thereof).
|
|||
|
|
|||
|
(L) 3<nen>B<gumi><kinpachi><sensei>
|
|||
|
u+0033 u+5E74 U+0042 u+7D44 u+91D1 u+516B u+5148 u+751F
|
|||
|
Punycode: 3B-ww4c5e180e575a65lsy2b
|
|||
|
|
|||
|
(M) <amuro><namie>-with-SUPER-MONKEYS
|
|||
|
u+5B89 u+5BA4 u+5948 u+7F8E u+6075 u+002D u+0077 u+0069 u+0074
|
|||
|
u+0068 u+002D U+0053 U+0055 U+0050 U+0045 U+0052 u+002D U+004D
|
|||
|
U+004F U+004E U+004B U+0045 U+0059 U+0053
|
|||
|
Punycode: -with-SUPER-MONKEYS-pc58ag80a8qai00g7n9n
|
|||
|
|
|||
|
(N) Hello-Another-Way-<sorezore><no><basho>
|
|||
|
U+0048 u+0065 u+006C u+006C u+006F u+002D U+0041 u+006E u+006F
|
|||
|
u+0074 u+0068 u+0065 u+0072 u+002D U+0057 u+0061 u+0079 u+002D
|
|||
|
u+305D u+308C u+305E u+308C u+306E u+5834 u+6240
|
|||
|
Punycode: Hello-Another-Way--fc4qua05auwb3674vfr0b
|
|||
|
|
|||
|
(O) <hitotsu><yane><no><shita>2
|
|||
|
u+3072 u+3068 u+3064 u+5C4B u+6839 u+306E u+4E0B u+0032
|
|||
|
Punycode: 2-u9tlzr9756bt3uc0v
|
|||
|
|
|||
|
(P) Maji<de>Koi<suru>5<byou><mae>
|
|||
|
U+004D u+0061 u+006A u+0069 u+3067 U+004B u+006F u+0069 u+3059
|
|||
|
u+308B u+0035 u+79D2 u+524D
|
|||
|
Punycode: MajiKoi5-783gue6qz075azm5e
|
|||
|
|
|||
|
(Q) <pafii>de<runba>
|
|||
|
u+30D1 u+30D5 u+30A3 u+30FC u+0064 u+0065 u+30EB u+30F3 u+30D0
|
|||
|
Punycode: de-jg4avhby1noc0d
|
|||
|
|
|||
|
(R) <sono><supiido><de>
|
|||
|
u+305D u+306E u+30B9 u+30D4 u+30FC u+30C9 u+3067
|
|||
|
Punycode: d9juau41awczczp
|
|||
|
|
|||
|
The last example is an ASCII string that breaks the existing rules
|
|||
|
for host name labels. (It is not a realistic example for IDNA,
|
|||
|
because IDNA never encodes pure ASCII labels.)
|
|||
|
|
|||
|
(S) -> $1.00 <-
|
|||
|
u+002D u+003E u+0020 u+0024 u+0031 u+002E u+0030 u+0030 u+0020
|
|||
|
u+003C u+002D
|
|||
|
Punycode: -> $1.00 <--
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Costello Standards Track [Page 16]
|
|||
|
|
|||
|
RFC 3492 IDNA Punycode March 2003
|
|||
|
|
|||
|
|
|||
|
7.2 Decoding traces
|
|||
|
|
|||
|
In the following traces, the evolving state of the decoder is shown
|
|||
|
as a sequence of hexadecimal values, representing the code points in
|
|||
|
the extended string. An asterisk appears just after the most
|
|||
|
recently inserted code point, indicating both n (the value preceeding
|
|||
|
the asterisk) and i (the position of the value just after the
|
|||
|
asterisk). Other numerical values are decimal.
|
|||
|
|
|||
|
Decoding trace of example B from section 7.1:
|
|||
|
|
|||
|
n is 128, i is 0, bias is 72
|
|||
|
input is "ihqwcrb4cv8a8dqg056pqjye"
|
|||
|
there is no delimiter, so extended string starts empty
|
|||
|
delta "ihq" decodes to 19853
|
|||
|
bias becomes 21
|
|||
|
4E0D *
|
|||
|
delta "wc" decodes to 64
|
|||
|
bias becomes 20
|
|||
|
4E0D 4E2D *
|
|||
|
delta "rb" decodes to 37
|
|||
|
bias becomes 13
|
|||
|
4E3A * 4E0D 4E2D
|
|||
|
delta "4c" decodes to 56
|
|||
|
bias becomes 17
|
|||
|
4E3A 4E48 * 4E0D 4E2D
|
|||
|
delta "v8a" decodes to 599
|
|||
|
bias becomes 32
|
|||
|
4E3A 4EC0 * 4E48 4E0D 4E2D
|
|||
|
delta "8d" decodes to 130
|
|||
|
bias becomes 23
|
|||
|
4ED6 * 4E3A 4EC0 4E48 4E0D 4E2D
|
|||
|
delta "qg" decodes to 154
|
|||
|
bias becomes 25
|
|||
|
4ED6 4EEC * 4E3A 4EC0 4E48 4E0D 4E2D
|
|||
|
delta "056p" decodes to 46301
|
|||
|
bias becomes 84
|
|||
|
4ED6 4EEC 4E3A 4EC0 4E48 4E0D 4E2D 6587 *
|
|||
|
delta "qjye" decodes to 88531
|
|||
|
bias becomes 90
|
|||
|
4ED6 4EEC 4E3A 4EC0 4E48 4E0D 8BF4 * 4E2D 6587
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Costello Standards Track [Page 17]
|
|||
|
|
|||
|
RFC 3492 IDNA Punycode March 2003
|
|||
|
|
|||
|
|
|||
|
Decoding trace of example L from section 7.1:
|
|||
|
|
|||
|
n is 128, i is 0, bias is 72
|
|||
|
input is "3B-ww4c5e180e575a65lsy2b"
|
|||
|
literal portion is "3B-", so extended string starts as:
|
|||
|
0033 0042
|
|||
|
delta "ww4c" decodes to 62042
|
|||
|
bias becomes 27
|
|||
|
0033 0042 5148 *
|
|||
|
delta "5e" decodes to 139
|
|||
|
bias becomes 24
|
|||
|
0033 0042 516B * 5148
|
|||
|
delta "180e" decodes to 16683
|
|||
|
bias becomes 67
|
|||
|
0033 5E74 * 0042 516B 5148
|
|||
|
delta "575a" decodes to 34821
|
|||
|
bias becomes 82
|
|||
|
0033 5E74 0042 516B 5148 751F *
|
|||
|
delta "65l" decodes to 14592
|
|||
|
bias becomes 67
|
|||
|
0033 5E74 0042 7D44 * 516B 5148 751F
|
|||
|
delta "sy2b" decodes to 42088
|
|||
|
bias becomes 84
|
|||
|
0033 5E74 0042 7D44 91D1 * 516B 5148 751F
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Costello Standards Track [Page 18]
|
|||
|
|
|||
|
RFC 3492 IDNA Punycode March 2003
|
|||
|
|
|||
|
|
|||
|
7.3 Encoding traces
|
|||
|
|
|||
|
In the following traces, code point values are hexadecimal, while
|
|||
|
other numerical values are decimal.
|
|||
|
|
|||
|
Encoding trace of example B from section 7.1:
|
|||
|
|
|||
|
bias is 72
|
|||
|
input is:
|
|||
|
4ED6 4EEC 4E3A 4EC0 4E48 4E0D 8BF4 4E2D 6587
|
|||
|
there are no basic code points, so no literal portion
|
|||
|
next code point to insert is 4E0D
|
|||
|
needed delta is 19853, encodes as "ihq"
|
|||
|
bias becomes 21
|
|||
|
next code point to insert is 4E2D
|
|||
|
needed delta is 64, encodes as "wc"
|
|||
|
bias becomes 20
|
|||
|
next code point to insert is 4E3A
|
|||
|
needed delta is 37, encodes as "rb"
|
|||
|
bias becomes 13
|
|||
|
next code point to insert is 4E48
|
|||
|
needed delta is 56, encodes as "4c"
|
|||
|
bias becomes 17
|
|||
|
next code point to insert is 4EC0
|
|||
|
needed delta is 599, encodes as "v8a"
|
|||
|
bias becomes 32
|
|||
|
next code point to insert is 4ED6
|
|||
|
needed delta is 130, encodes as "8d"
|
|||
|
bias becomes 23
|
|||
|
next code point to insert is 4EEC
|
|||
|
needed delta is 154, encodes as "qg"
|
|||
|
bias becomes 25
|
|||
|
next code point to insert is 6587
|
|||
|
needed delta is 46301, encodes as "056p"
|
|||
|
bias becomes 84
|
|||
|
next code point to insert is 8BF4
|
|||
|
needed delta is 88531, encodes as "qjye"
|
|||
|
bias becomes 90
|
|||
|
output is "ihqwcrb4cv8a8dqg056pqjye"
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Costello Standards Track [Page 19]
|
|||
|
|
|||
|
RFC 3492 IDNA Punycode March 2003
|
|||
|
|
|||
|
|
|||
|
Encoding trace of example L from section 7.1:
|
|||
|
|
|||
|
bias is 72
|
|||
|
input is:
|
|||
|
0033 5E74 0042 7D44 91D1 516B 5148 751F
|
|||
|
basic code points (0033, 0042) are copied to literal portion: "3B-"
|
|||
|
next code point to insert is 5148
|
|||
|
needed delta is 62042, encodes as "ww4c"
|
|||
|
bias becomes 27
|
|||
|
next code point to insert is 516B
|
|||
|
needed delta is 139, encodes as "5e"
|
|||
|
bias becomes 24
|
|||
|
next code point to insert is 5E74
|
|||
|
needed delta is 16683, encodes as "180e"
|
|||
|
bias becomes 67
|
|||
|
next code point to insert is 751F
|
|||
|
needed delta is 34821, encodes as "575a"
|
|||
|
bias becomes 82
|
|||
|
next code point to insert is 7D44
|
|||
|
needed delta is 14592, encodes as "65l"
|
|||
|
bias becomes 67
|
|||
|
next code point to insert is 91D1
|
|||
|
needed delta is 42088, encodes as "sy2b"
|
|||
|
bias becomes 84
|
|||
|
output is "3B-ww4c5e180e575a65lsy2b"
|
|||
|
|
|||
|
8. Security Considerations
|
|||
|
|
|||
|
Users expect each domain name in DNS to be controlled by a single
|
|||
|
authority. If a Unicode string intended for use as a domain label
|
|||
|
could map to multiple ACE labels, then an internationalized domain
|
|||
|
name could map to multiple ASCII domain names, each controlled by a
|
|||
|
different authority, some of which could be spoofs that hijack
|
|||
|
service requests intended for another. Therefore Punycode is
|
|||
|
designed so that each Unicode string has a unique encoding.
|
|||
|
|
|||
|
However, there can still be multiple Unicode representations of the
|
|||
|
"same" text, for various definitions of "same". This problem is
|
|||
|
addressed to some extent by the Unicode standard under the topic of
|
|||
|
canonicalization, and this work is leveraged for domain names by
|
|||
|
Nameprep [NAMEPREP].
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Costello Standards Track [Page 20]
|
|||
|
|
|||
|
RFC 3492 IDNA Punycode March 2003
|
|||
|
|
|||
|
|
|||
|
9. References
|
|||
|
|
|||
|
9.1 Normative References
|
|||
|
|
|||
|
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
|
|||
|
Requirement Levels", BCP 14, RFC 2119, March 1997.
|
|||
|
|
|||
|
9.2 Informative References
|
|||
|
|
|||
|
[RFC952] Harrenstien, K., Stahl, M. and E. Feinler, "DOD Internet
|
|||
|
Host Table Specification", RFC 952, October 1985.
|
|||
|
|
|||
|
[RFC1034] Mockapetris, P., "Domain Names - Concepts and
|
|||
|
Facilities", STD 13, RFC 1034, November 1987.
|
|||
|
|
|||
|
[IDNA] Faltstrom, P., Hoffman, P. and A. Costello,
|
|||
|
"Internationalizing Domain Names in Applications
|
|||
|
(IDNA)", RFC 3490, March 2003.
|
|||
|
|
|||
|
[NAMEPREP] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep
|
|||
|
Profile for Internationalized Domain Names (IDN)", RFC
|
|||
|
3491, March 2003.
|
|||
|
|
|||
|
[ASCII] Cerf, V., "ASCII format for Network Interchange", RFC
|
|||
|
20, October 1969.
|
|||
|
|
|||
|
[PROVINCIAL] Kaplan, M., "The 'anyone can be provincial!' page",
|
|||
|
http://www.trigeminal.com/samples/provincial.html.
|
|||
|
|
|||
|
[UNICODE] The Unicode Consortium, "The Unicode Standard",
|
|||
|
http://www.unicode.org/unicode/standard/standard.html.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Costello Standards Track [Page 21]
|
|||
|
|
|||
|
RFC 3492 IDNA Punycode March 2003
|
|||
|
|
|||
|
|
|||
|
A. Mixed-case annotation
|
|||
|
|
|||
|
In order to use Punycode to represent case-insensitive strings,
|
|||
|
higher layers need to case-fold the strings prior to Punycode
|
|||
|
encoding. The encoded string can use mixed case as an annotation
|
|||
|
telling how to convert the folded string into a mixed-case string for
|
|||
|
display purposes. Note, however, that mixed-case annotation is not
|
|||
|
used by the ToASCII and ToUnicode operations specified in [IDNA], and
|
|||
|
therefore implementors of IDNA can disregard this appendix.
|
|||
|
|
|||
|
Basic code points can use mixed case directly, because the decoder
|
|||
|
copies them verbatim, leaving lowercase code points lowercase, and
|
|||
|
leaving uppercase code points uppercase. Each non-basic code point
|
|||
|
is represented by a delta, which is represented by a sequence of
|
|||
|
basic code points, the last of which provides the annotation. If it
|
|||
|
is uppercase, it is a suggestion to map the non-basic code point to
|
|||
|
uppercase (if possible); if it is lowercase, it is a suggestion to
|
|||
|
map the non-basic code point to lowercase (if possible).
|
|||
|
|
|||
|
These annotations do not alter the code points returned by decoders;
|
|||
|
the annotations are returned separately, for the caller to use or
|
|||
|
ignore. Encoders can accept annotations in addition to code points,
|
|||
|
but the annotations do not alter the output, except to influence the
|
|||
|
uppercase/lowercase form of ASCII letters.
|
|||
|
|
|||
|
Punycode encoders and decoders need not support these annotations,
|
|||
|
and higher layers need not use them.
|
|||
|
|
|||
|
B. Disclaimer and license
|
|||
|
|
|||
|
Regarding this entire document or any portion of it (including the
|
|||
|
pseudocode and C code), the author makes no guarantees and is not
|
|||
|
responsible for any damage resulting from its use. The author grants
|
|||
|
irrevocable permission to anyone to use, modify, and distribute it in
|
|||
|
any way that does not diminish the rights of anyone else to use,
|
|||
|
modify, and distribute it, provided that redistributed derivative
|
|||
|
works do not contain misleading author or version information.
|
|||
|
Derivative works need not be licensed under similar terms.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Costello Standards Track [Page 22]
|
|||
|
|
|||
|
RFC 3492 IDNA Punycode March 2003
|
|||
|
|
|||
|
|
|||
|
C. Punycode sample implementation
|
|||
|
|
|||
|
/*
|
|||
|
punycode.c from RFC 3492
|
|||
|
http://www.nicemice.net/idn/
|
|||
|
Adam M. Costello
|
|||
|
http://www.nicemice.net/amc/
|
|||
|
|
|||
|
This is ANSI C code (C89) implementing Punycode (RFC 3492).
|
|||
|
|
|||
|
*/
|
|||
|
|
|||
|
|
|||
|
/************************************************************/
|
|||
|
/* Public interface (would normally go in its own .h file): */
|
|||
|
|
|||
|
#include <limits.h>
|
|||
|
|
|||
|
enum punycode_status {
|
|||
|
punycode_success,
|
|||
|
punycode_bad_input, /* Input is invalid. */
|
|||
|
punycode_big_output, /* Output would exceed the space provided. */
|
|||
|
punycode_overflow /* Input needs wider integers to process. */
|
|||
|
};
|
|||
|
|
|||
|
#if UINT_MAX >= (1 << 26) - 1
|
|||
|
typedef unsigned int punycode_uint;
|
|||
|
#else
|
|||
|
typedef unsigned long punycode_uint;
|
|||
|
#endif
|
|||
|
|
|||
|
enum punycode_status punycode_encode(
|
|||
|
punycode_uint input_length,
|
|||
|
const punycode_uint input[],
|
|||
|
const unsigned char case_flags[],
|
|||
|
punycode_uint *output_length,
|
|||
|
char output[] );
|
|||
|
|
|||
|
/* punycode_encode() converts Unicode to Punycode. The input */
|
|||
|
/* is represented as an array of Unicode code points (not code */
|
|||
|
/* units; surrogate pairs are not allowed), and the output */
|
|||
|
/* will be represented as an array of ASCII code points. The */
|
|||
|
/* output string is *not* null-terminated; it will contain */
|
|||
|
/* zeros if and only if the input contains zeros. (Of course */
|
|||
|
/* the caller can leave room for a terminator and add one if */
|
|||
|
/* needed.) The input_length is the number of code points in */
|
|||
|
/* the input. The output_length is an in/out argument: the */
|
|||
|
/* caller passes in the maximum number of code points that it */
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Costello Standards Track [Page 23]
|
|||
|
|
|||
|
RFC 3492 IDNA Punycode March 2003
|
|||
|
|
|||
|
|
|||
|
/* can receive, and on successful return it will contain the */
|
|||
|
/* number of code points actually output. The case_flags array */
|
|||
|
/* holds input_length boolean values, where nonzero suggests that */
|
|||
|
/* the corresponding Unicode character be forced to uppercase */
|
|||
|
/* after being decoded (if possible), and zero suggests that */
|
|||
|
/* it be forced to lowercase (if possible). ASCII code points */
|
|||
|
/* are encoded literally, except that ASCII letters are forced */
|
|||
|
/* to uppercase or lowercase according to the corresponding */
|
|||
|
/* uppercase flags. If case_flags is a null pointer then ASCII */
|
|||
|
/* letters are left as they are, and other code points are */
|
|||
|
/* treated as if their uppercase flags were zero. The return */
|
|||
|
/* value can be any of the punycode_status values defined above */
|
|||
|
/* except punycode_bad_input; if not punycode_success, then */
|
|||
|
/* output_size and output might contain garbage. */
|
|||
|
|
|||
|
enum punycode_status punycode_decode(
|
|||
|
punycode_uint input_length,
|
|||
|
const char input[],
|
|||
|
punycode_uint *output_length,
|
|||
|
punycode_uint output[],
|
|||
|
unsigned char case_flags[] );
|
|||
|
|
|||
|
/* punycode_decode() converts Punycode to Unicode. The input is */
|
|||
|
/* represented as an array of ASCII code points, and the output */
|
|||
|
/* will be represented as an array of Unicode code points. The */
|
|||
|
/* input_length is the number of code points in the input. The */
|
|||
|
/* output_length is an in/out argument: the caller passes in */
|
|||
|
/* the maximum number of code points that it can receive, and */
|
|||
|
/* on successful return it will contain the actual number of */
|
|||
|
/* code points output. The case_flags array needs room for at */
|
|||
|
/* least output_length values, or it can be a null pointer if the */
|
|||
|
/* case information is not needed. A nonzero flag suggests that */
|
|||
|
/* the corresponding Unicode character be forced to uppercase */
|
|||
|
/* by the caller (if possible), while zero suggests that it be */
|
|||
|
/* forced to lowercase (if possible). ASCII code points are */
|
|||
|
/* output already in the proper case, but their flags will be set */
|
|||
|
/* appropriately so that applying the flags would be harmless. */
|
|||
|
/* The return value can be any of the punycode_status values */
|
|||
|
/* defined above; if not punycode_success, then output_length, */
|
|||
|
/* output, and case_flags might contain garbage. On success, the */
|
|||
|
/* decoder will never need to write an output_length greater than */
|
|||
|
/* input_length, because of how the encoding is defined. */
|
|||
|
|
|||
|
/**********************************************************/
|
|||
|
/* Implementation (would normally go in its own .c file): */
|
|||
|
|
|||
|
#include <string.h>
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Costello Standards Track [Page 24]
|
|||
|
|
|||
|
RFC 3492 IDNA Punycode March 2003
|
|||
|
|
|||
|
|
|||
|
/*** Bootstring parameters for Punycode ***/
|
|||
|
|
|||
|
enum { base = 36, tmin = 1, tmax = 26, skew = 38, damp = 700,
|
|||
|
initial_bias = 72, initial_n = 0x80, delimiter = 0x2D };
|
|||
|
|
|||
|
/* basic(cp) tests whether cp is a basic code point: */
|
|||
|
#define basic(cp) ((punycode_uint)(cp) < 0x80)
|
|||
|
|
|||
|
/* delim(cp) tests whether cp is a delimiter: */
|
|||
|
#define delim(cp) ((cp) == delimiter)
|
|||
|
|
|||
|
/* decode_digit(cp) returns the numeric value of a basic code */
|
|||
|
/* point (for use in representing integers) in the range 0 to */
|
|||
|
/* base-1, or base if cp is does not represent a value. */
|
|||
|
|
|||
|
static punycode_uint decode_digit(punycode_uint cp)
|
|||
|
{
|
|||
|
return cp - 48 < 10 ? cp - 22 : cp - 65 < 26 ? cp - 65 :
|
|||
|
cp - 97 < 26 ? cp - 97 : base;
|
|||
|
}
|
|||
|
|
|||
|
/* encode_digit(d,flag) returns the basic code point whose value */
|
|||
|
/* (when used for representing integers) is d, which needs to be in */
|
|||
|
/* the range 0 to base-1. The lowercase form is used unless flag is */
|
|||
|
/* nonzero, in which case the uppercase form is used. The behavior */
|
|||
|
/* is undefined if flag is nonzero and digit d has no uppercase form. */
|
|||
|
|
|||
|
static char encode_digit(punycode_uint d, int flag)
|
|||
|
{
|
|||
|
return d + 22 + 75 * (d < 26) - ((flag != 0) << 5);
|
|||
|
/* 0..25 map to ASCII a..z or A..Z */
|
|||
|
/* 26..35 map to ASCII 0..9 */
|
|||
|
}
|
|||
|
|
|||
|
/* flagged(bcp) tests whether a basic code point is flagged */
|
|||
|
/* (uppercase). The behavior is undefined if bcp is not a */
|
|||
|
/* basic code point. */
|
|||
|
|
|||
|
#define flagged(bcp) ((punycode_uint)(bcp) - 65 < 26)
|
|||
|
|
|||
|
/* encode_basic(bcp,flag) forces a basic code point to lowercase */
|
|||
|
/* if flag is zero, uppercase if flag is nonzero, and returns */
|
|||
|
/* the resulting code point. The code point is unchanged if it */
|
|||
|
/* is caseless. The behavior is undefined if bcp is not a basic */
|
|||
|
/* code point. */
|
|||
|
|
|||
|
static char encode_basic(punycode_uint bcp, int flag)
|
|||
|
{
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Costello Standards Track [Page 25]
|
|||
|
|
|||
|
RFC 3492 IDNA Punycode March 2003
|
|||
|
|
|||
|
|
|||
|
bcp -= (bcp - 97 < 26) << 5;
|
|||
|
return bcp + ((!flag && (bcp - 65 < 26)) << 5);
|
|||
|
}
|
|||
|
|
|||
|
/*** Platform-specific constants ***/
|
|||
|
|
|||
|
/* maxint is the maximum value of a punycode_uint variable: */
|
|||
|
static const punycode_uint maxint = -1;
|
|||
|
/* Because maxint is unsigned, -1 becomes the maximum value. */
|
|||
|
|
|||
|
/*** Bias adaptation function ***/
|
|||
|
|
|||
|
static punycode_uint adapt(
|
|||
|
punycode_uint delta, punycode_uint numpoints, int firsttime )
|
|||
|
{
|
|||
|
punycode_uint k;
|
|||
|
|
|||
|
delta = firsttime ? delta / damp : delta >> 1;
|
|||
|
/* delta >> 1 is a faster way of doing delta / 2 */
|
|||
|
delta += delta / numpoints;
|
|||
|
|
|||
|
for (k = 0; delta > ((base - tmin) * tmax) / 2; k += base) {
|
|||
|
delta /= base - tmin;
|
|||
|
}
|
|||
|
|
|||
|
return k + (base - tmin + 1) * delta / (delta + skew);
|
|||
|
}
|
|||
|
|
|||
|
/*** Main encode function ***/
|
|||
|
|
|||
|
enum punycode_status punycode_encode(
|
|||
|
punycode_uint input_length,
|
|||
|
const punycode_uint input[],
|
|||
|
const unsigned char case_flags[],
|
|||
|
punycode_uint *output_length,
|
|||
|
char output[] )
|
|||
|
{
|
|||
|
punycode_uint n, delta, h, b, out, max_out, bias, j, m, q, k, t;
|
|||
|
|
|||
|
/* Initialize the state: */
|
|||
|
|
|||
|
n = initial_n;
|
|||
|
delta = out = 0;
|
|||
|
max_out = *output_length;
|
|||
|
bias = initial_bias;
|
|||
|
|
|||
|
/* Handle the basic code points: */
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Costello Standards Track [Page 26]
|
|||
|
|
|||
|
RFC 3492 IDNA Punycode March 2003
|
|||
|
|
|||
|
|
|||
|
for (j = 0; j < input_length; ++j) {
|
|||
|
if (basic(input[j])) {
|
|||
|
if (max_out - out < 2) return punycode_big_output;
|
|||
|
output[out++] =
|
|||
|
case_flags ? encode_basic(input[j], case_flags[j]) : input[j];
|
|||
|
}
|
|||
|
/* else if (input[j] < n) return punycode_bad_input; */
|
|||
|
/* (not needed for Punycode with unsigned code points) */
|
|||
|
}
|
|||
|
|
|||
|
h = b = out;
|
|||
|
|
|||
|
/* h is the number of code points that have been handled, b is the */
|
|||
|
/* number of basic code points, and out is the number of characters */
|
|||
|
/* that have been output. */
|
|||
|
|
|||
|
if (b > 0) output[out++] = delimiter;
|
|||
|
|
|||
|
/* Main encoding loop: */
|
|||
|
|
|||
|
while (h < input_length) {
|
|||
|
/* All non-basic code points < n have been */
|
|||
|
/* handled already. Find the next larger one: */
|
|||
|
|
|||
|
for (m = maxint, j = 0; j < input_length; ++j) {
|
|||
|
/* if (basic(input[j])) continue; */
|
|||
|
/* (not needed for Punycode) */
|
|||
|
if (input[j] >= n && input[j] < m) m = input[j];
|
|||
|
}
|
|||
|
|
|||
|
/* Increase delta enough to advance the decoder's */
|
|||
|
/* <n,i> state to <m,0>, but guard against overflow: */
|
|||
|
|
|||
|
if (m - n > (maxint - delta) / (h + 1)) return punycode_overflow;
|
|||
|
delta += (m - n) * (h + 1);
|
|||
|
n = m;
|
|||
|
|
|||
|
for (j = 0; j < input_length; ++j) {
|
|||
|
/* Punycode does not need to check whether input[j] is basic: */
|
|||
|
if (input[j] < n /* || basic(input[j]) */ ) {
|
|||
|
if (++delta == 0) return punycode_overflow;
|
|||
|
}
|
|||
|
|
|||
|
if (input[j] == n) {
|
|||
|
/* Represent delta as a generalized variable-length integer: */
|
|||
|
|
|||
|
for (q = delta, k = base; ; k += base) {
|
|||
|
if (out >= max_out) return punycode_big_output;
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Costello Standards Track [Page 27]
|
|||
|
|
|||
|
RFC 3492 IDNA Punycode March 2003
|
|||
|
|
|||
|
|
|||
|
t = k <= bias /* + tmin */ ? tmin : /* +tmin not needed */
|
|||
|
k >= bias + tmax ? tmax : k - bias;
|
|||
|
if (q < t) break;
|
|||
|
output[out++] = encode_digit(t + (q - t) % (base - t), 0);
|
|||
|
q = (q - t) / (base - t);
|
|||
|
}
|
|||
|
|
|||
|
output[out++] = encode_digit(q, case_flags && case_flags[j]);
|
|||
|
bias = adapt(delta, h + 1, h == b);
|
|||
|
delta = 0;
|
|||
|
++h;
|
|||
|
}
|
|||
|
}
|
|||
|
|
|||
|
++delta, ++n;
|
|||
|
}
|
|||
|
|
|||
|
*output_length = out;
|
|||
|
return punycode_success;
|
|||
|
}
|
|||
|
|
|||
|
/*** Main decode function ***/
|
|||
|
|
|||
|
enum punycode_status punycode_decode(
|
|||
|
punycode_uint input_length,
|
|||
|
const char input[],
|
|||
|
punycode_uint *output_length,
|
|||
|
punycode_uint output[],
|
|||
|
unsigned char case_flags[] )
|
|||
|
{
|
|||
|
punycode_uint n, out, i, max_out, bias,
|
|||
|
b, j, in, oldi, w, k, digit, t;
|
|||
|
|
|||
|
/* Initialize the state: */
|
|||
|
|
|||
|
n = initial_n;
|
|||
|
out = i = 0;
|
|||
|
max_out = *output_length;
|
|||
|
bias = initial_bias;
|
|||
|
|
|||
|
/* Handle the basic code points: Let b be the number of input code */
|
|||
|
/* points before the last delimiter, or 0 if there is none, then */
|
|||
|
/* copy the first b code points to the output. */
|
|||
|
|
|||
|
for (b = j = 0; j < input_length; ++j) if (delim(input[j])) b = j;
|
|||
|
if (b > max_out) return punycode_big_output;
|
|||
|
|
|||
|
for (j = 0; j < b; ++j) {
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Costello Standards Track [Page 28]
|
|||
|
|
|||
|
RFC 3492 IDNA Punycode March 2003
|
|||
|
|
|||
|
|
|||
|
if (case_flags) case_flags[out] = flagged(input[j]);
|
|||
|
if (!basic(input[j])) return punycode_bad_input;
|
|||
|
output[out++] = input[j];
|
|||
|
}
|
|||
|
|
|||
|
/* Main decoding loop: Start just after the last delimiter if any */
|
|||
|
/* basic code points were copied; start at the beginning otherwise. */
|
|||
|
|
|||
|
for (in = b > 0 ? b + 1 : 0; in < input_length; ++out) {
|
|||
|
|
|||
|
/* in is the index of the next character to be consumed, and */
|
|||
|
/* out is the number of code points in the output array. */
|
|||
|
|
|||
|
/* Decode a generalized variable-length integer into delta, */
|
|||
|
/* which gets added to i. The overflow checking is easier */
|
|||
|
/* if we increase i as we go, then subtract off its starting */
|
|||
|
/* value at the end to obtain delta. */
|
|||
|
|
|||
|
for (oldi = i, w = 1, k = base; ; k += base) {
|
|||
|
if (in >= input_length) return punycode_bad_input;
|
|||
|
digit = decode_digit(input[in++]);
|
|||
|
if (digit >= base) return punycode_bad_input;
|
|||
|
if (digit > (maxint - i) / w) return punycode_overflow;
|
|||
|
i += digit * w;
|
|||
|
t = k <= bias /* + tmin */ ? tmin : /* +tmin not needed */
|
|||
|
k >= bias + tmax ? tmax : k - bias;
|
|||
|
if (digit < t) break;
|
|||
|
if (w > maxint / (base - t)) return punycode_overflow;
|
|||
|
w *= (base - t);
|
|||
|
}
|
|||
|
|
|||
|
bias = adapt(i - oldi, out + 1, oldi == 0);
|
|||
|
|
|||
|
/* i was supposed to wrap around from out+1 to 0, */
|
|||
|
/* incrementing n each time, so we'll fix that now: */
|
|||
|
|
|||
|
if (i / (out + 1) > maxint - n) return punycode_overflow;
|
|||
|
n += i / (out + 1);
|
|||
|
i %= (out + 1);
|
|||
|
|
|||
|
/* Insert n at position i of the output: */
|
|||
|
|
|||
|
/* not needed for Punycode: */
|
|||
|
/* if (decode_digit(n) <= base) return punycode_invalid_input; */
|
|||
|
if (out >= max_out) return punycode_big_output;
|
|||
|
|
|||
|
if (case_flags) {
|
|||
|
memmove(case_flags + i + 1, case_flags + i, out - i);
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Costello Standards Track [Page 29]
|
|||
|
|
|||
|
RFC 3492 IDNA Punycode March 2003
|
|||
|
|
|||
|
|
|||
|
/* Case of last character determines uppercase flag: */
|
|||
|
case_flags[i] = flagged(input[in - 1]);
|
|||
|
}
|
|||
|
|
|||
|
memmove(output + i + 1, output + i, (out - i) * sizeof *output);
|
|||
|
output[i++] = n;
|
|||
|
}
|
|||
|
|
|||
|
*output_length = out;
|
|||
|
return punycode_success;
|
|||
|
}
|
|||
|
|
|||
|
/******************************************************************/
|
|||
|
/* Wrapper for testing (would normally go in a separate .c file): */
|
|||
|
|
|||
|
#include <assert.h>
|
|||
|
#include <stdio.h>
|
|||
|
#include <stdlib.h>
|
|||
|
#include <string.h>
|
|||
|
|
|||
|
/* For testing, we'll just set some compile-time limits rather than */
|
|||
|
/* use malloc(), and set a compile-time option rather than using a */
|
|||
|
/* command-line option. */
|
|||
|
|
|||
|
enum {
|
|||
|
unicode_max_length = 256,
|
|||
|
ace_max_length = 256
|
|||
|
};
|
|||
|
|
|||
|
static void usage(char **argv)
|
|||
|
{
|
|||
|
fprintf(stderr,
|
|||
|
"\n"
|
|||
|
"%s -e reads code points and writes a Punycode string.\n"
|
|||
|
"%s -d reads a Punycode string and writes code points.\n"
|
|||
|
"\n"
|
|||
|
"Input and output are plain text in the native character set.\n"
|
|||
|
"Code points are in the form u+hex separated by whitespace.\n"
|
|||
|
"Although the specification allows Punycode strings to contain\n"
|
|||
|
"any characters from the ASCII repertoire, this test code\n"
|
|||
|
"supports only the printable characters, and needs the Punycode\n"
|
|||
|
"string to be followed by a newline.\n"
|
|||
|
"The case of the u in u+hex is the force-to-uppercase flag.\n"
|
|||
|
, argv[0], argv[0]);
|
|||
|
exit(EXIT_FAILURE);
|
|||
|
}
|
|||
|
|
|||
|
static void fail(const char *msg)
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Costello Standards Track [Page 30]
|
|||
|
|
|||
|
RFC 3492 IDNA Punycode March 2003
|
|||
|
|
|||
|
|
|||
|
{
|
|||
|
fputs(msg,stderr);
|
|||
|
exit(EXIT_FAILURE);
|
|||
|
}
|
|||
|
|
|||
|
static const char too_big[] =
|
|||
|
"input or output is too large, recompile with larger limits\n";
|
|||
|
static const char invalid_input[] = "invalid input\n";
|
|||
|
static const char overflow[] = "arithmetic overflow\n";
|
|||
|
static const char io_error[] = "I/O error\n";
|
|||
|
|
|||
|
/* The following string is used to convert printable */
|
|||
|
/* characters between ASCII and the native charset: */
|
|||
|
|
|||
|
static const char print_ascii[] =
|
|||
|
"\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n"
|
|||
|
"\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n"
|
|||
|
" !\"#$%&'()*+,-./"
|
|||
|
"0123456789:;<=>?"
|
|||
|
"@ABCDEFGHIJKLMNO"
|
|||
|
"PQRSTUVWXYZ[\\]^_"
|
|||
|
"`abcdefghijklmno"
|
|||
|
"pqrstuvwxyz{|}~\n";
|
|||
|
|
|||
|
int main(int argc, char **argv)
|
|||
|
{
|
|||
|
enum punycode_status status;
|
|||
|
int r;
|
|||
|
unsigned int input_length, output_length, j;
|
|||
|
unsigned char case_flags[unicode_max_length];
|
|||
|
|
|||
|
if (argc != 2) usage(argv);
|
|||
|
if (argv[1][0] != '-') usage(argv);
|
|||
|
if (argv[1][2] != 0) usage(argv);
|
|||
|
|
|||
|
if (argv[1][1] == 'e') {
|
|||
|
punycode_uint input[unicode_max_length];
|
|||
|
unsigned long codept;
|
|||
|
char output[ace_max_length+1], uplus[3];
|
|||
|
int c;
|
|||
|
|
|||
|
/* Read the input code points: */
|
|||
|
|
|||
|
input_length = 0;
|
|||
|
|
|||
|
for (;;) {
|
|||
|
r = scanf("%2s%lx", uplus, &codept);
|
|||
|
if (ferror(stdin)) fail(io_error);
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Costello Standards Track [Page 31]
|
|||
|
|
|||
|
RFC 3492 IDNA Punycode March 2003
|
|||
|
|
|||
|
|
|||
|
if (r == EOF || r == 0) break;
|
|||
|
|
|||
|
if (r != 2 || uplus[1] != '+' || codept > (punycode_uint)-1) {
|
|||
|
fail(invalid_input);
|
|||
|
}
|
|||
|
|
|||
|
if (input_length == unicode_max_length) fail(too_big);
|
|||
|
|
|||
|
if (uplus[0] == 'u') case_flags[input_length] = 0;
|
|||
|
else if (uplus[0] == 'U') case_flags[input_length] = 1;
|
|||
|
else fail(invalid_input);
|
|||
|
|
|||
|
input[input_length++] = codept;
|
|||
|
}
|
|||
|
|
|||
|
/* Encode: */
|
|||
|
|
|||
|
output_length = ace_max_length;
|
|||
|
status = punycode_encode(input_length, input, case_flags,
|
|||
|
&output_length, output);
|
|||
|
if (status == punycode_bad_input) fail(invalid_input);
|
|||
|
if (status == punycode_big_output) fail(too_big);
|
|||
|
if (status == punycode_overflow) fail(overflow);
|
|||
|
assert(status == punycode_success);
|
|||
|
|
|||
|
/* Convert to native charset and output: */
|
|||
|
|
|||
|
for (j = 0; j < output_length; ++j) {
|
|||
|
c = output[j];
|
|||
|
assert(c >= 0 && c <= 127);
|
|||
|
if (print_ascii[c] == 0) fail(invalid_input);
|
|||
|
output[j] = print_ascii[c];
|
|||
|
}
|
|||
|
|
|||
|
output[j] = 0;
|
|||
|
r = puts(output);
|
|||
|
if (r == EOF) fail(io_error);
|
|||
|
return EXIT_SUCCESS;
|
|||
|
}
|
|||
|
|
|||
|
if (argv[1][1] == 'd') {
|
|||
|
char input[ace_max_length+2], *p, *pp;
|
|||
|
punycode_uint output[unicode_max_length];
|
|||
|
|
|||
|
/* Read the Punycode input string and convert to ASCII: */
|
|||
|
|
|||
|
fgets(input, ace_max_length+2, stdin);
|
|||
|
if (ferror(stdin)) fail(io_error);
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Costello Standards Track [Page 32]
|
|||
|
|
|||
|
RFC 3492 IDNA Punycode March 2003
|
|||
|
|
|||
|
|
|||
|
if (feof(stdin)) fail(invalid_input);
|
|||
|
input_length = strlen(input) - 1;
|
|||
|
if (input[input_length] != '\n') fail(too_big);
|
|||
|
input[input_length] = 0;
|
|||
|
|
|||
|
for (p = input; *p != 0; ++p) {
|
|||
|
pp = strchr(print_ascii, *p);
|
|||
|
if (pp == 0) fail(invalid_input);
|
|||
|
*p = pp - print_ascii;
|
|||
|
}
|
|||
|
|
|||
|
/* Decode: */
|
|||
|
|
|||
|
output_length = unicode_max_length;
|
|||
|
status = punycode_decode(input_length, input, &output_length,
|
|||
|
output, case_flags);
|
|||
|
if (status == punycode_bad_input) fail(invalid_input);
|
|||
|
if (status == punycode_big_output) fail(too_big);
|
|||
|
if (status == punycode_overflow) fail(overflow);
|
|||
|
assert(status == punycode_success);
|
|||
|
|
|||
|
/* Output the result: */
|
|||
|
|
|||
|
for (j = 0; j < output_length; ++j) {
|
|||
|
r = printf("%s+%04lX\n",
|
|||
|
case_flags[j] ? "U" : "u",
|
|||
|
(unsigned long) output[j] );
|
|||
|
if (r < 0) fail(io_error);
|
|||
|
}
|
|||
|
|
|||
|
return EXIT_SUCCESS;
|
|||
|
}
|
|||
|
|
|||
|
usage(argv);
|
|||
|
return EXIT_SUCCESS; /* not reached, but quiets compiler warning */
|
|||
|
}
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Costello Standards Track [Page 33]
|
|||
|
|
|||
|
RFC 3492 IDNA Punycode March 2003
|
|||
|
|
|||
|
|
|||
|
Author's Address
|
|||
|
|
|||
|
Adam M. Costello
|
|||
|
University of California, Berkeley
|
|||
|
http://www.nicemice.net/amc/
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Costello Standards Track [Page 34]
|
|||
|
|
|||
|
RFC 3492 IDNA Punycode March 2003
|
|||
|
|
|||
|
|
|||
|
Full Copyright Statement
|
|||
|
|
|||
|
Copyright (C) The Internet Society (2003). All Rights Reserved.
|
|||
|
|
|||
|
This document and translations of it may be copied and furnished to
|
|||
|
others, and derivative works that comment on or otherwise explain it
|
|||
|
or assist in its implementation may be prepared, copied, published
|
|||
|
and distributed, in whole or in part, without restriction of any
|
|||
|
kind, provided that the above copyright notice and this paragraph are
|
|||
|
included on all such copies and derivative works. However, this
|
|||
|
document itself may not be modified in any way, such as by removing
|
|||
|
the copyright notice or references to the Internet Society or other
|
|||
|
Internet organizations, except as needed for the purpose of
|
|||
|
developing Internet standards in which case the procedures for
|
|||
|
copyrights defined in the Internet Standards process must be
|
|||
|
followed, or as required to translate it into languages other than
|
|||
|
English.
|
|||
|
|
|||
|
The limited permissions granted above are perpetual and will not be
|
|||
|
revoked by the Internet Society or its successors or assigns.
|
|||
|
|
|||
|
This document and the information contained herein is provided on an
|
|||
|
"AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
|
|||
|
TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
|
|||
|
BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
|
|||
|
HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
|
|||
|
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
|
|||
|
|
|||
|
Acknowledgement
|
|||
|
|
|||
|
Funding for the RFC Editor function is currently provided by the
|
|||
|
Internet Society.
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
|
|||
|
Costello Standards Track [Page 35]
|
|||
|
|