ISOIEC 10646 Universal Coded Character Set (UCS)
ISOIEC 10646 Universal Coded Character Set (UCS)
ISOIEC 10646 Universal Coded Character Set (UCS)
ISO/IEC International
Standard
ISO/IEC 10646
Final Committee Draft
Information technology –
Universal Coded
Character Set (UCS)
Technologie de l’information – Jeu
universel de caractères codés (JUC)
PDF disclaimer
This PDF file may contain embedded typefaces. In accordance with Adobe's licensing policy, this file may be printed or viewed
but shall not be edited unless the typefaces which are embedded are licensed to and installed on the computer performing the
editing. In downloading this file, parties accept therein the responsibility of not infringing Adobe's licensing policy. The ISO
Central Secretariat accepts no liability in this area.
Details of the software products used to create this PDF file can be found in the General Info relative to the file; the PDF-
creation parameters were optimized for printing. Every care has been taken to ensure that the file is suitable for use by ISO
member bodies. In the unlikely event that a problem relating to it is found, please inform the Central Secretariat at the address
given below.
© ISO/IEC 2010
All rights reserved. Unless otherwise specified, no part of this publication may be reproduced or utilized in any form or by any
means, electronic or mechanical, including photocopying and microfilm, without permission in writing from either ISO at the ad-
dress below or ISO's member body
in the country of the requester.
Printed in Switzerland
CONTENTS
Foreword.................................................................................................................................................. 7
Introduction .............................................................................................................................................. 8
1 Scope ............................................................................................................................................... 9
2 Conformance .................................................................................................................................... 9
2.1 General .................................................................................................................................. 9
2.2 Conformance of information interchange ............................................................................... 9
2.3 Conformance of devices ...................................................................................................... 10
3 Normative references ..................................................................................................................... 10
4 Terms and definitions ..................................................................................................................... 11
5 General structure of the UCS ......................................................................................................... 17
6 Basic structure and nomenclature .................................................................................................. 17
6.1 Structure ............................................................................................................................... 17
6.2 Coding of characters ............................................................................................................ 19
6.3 Types of code points ............................................................................................................ 19
6.4 Naming of characters ........................................................................................................... 20
6.5 Short identifiers for code points (UIDs) ................................................................................ 20
6.6 UCS Sequence Identifiers .................................................................................................... 21
6.7 Octet sequence identifiers ................................................................................................... 21
7 Revision and updating of the UCS ................................................................................................. 21
8 Subsets........................................................................................................................................... 22
8.1 Limited subset ...................................................................................................................... 22
8.2 Selected subset.................................................................................................................... 22
9 UCS encoding forms ...................................................................................................................... 22
9.1 UTF-8 ................................................................................................................................... 22
9.2 UTF-16 ................................................................................................................................. 23
9.3 UTF-32 (UCS-4) ................................................................................................................... 24
10 UCS Encoding schemes ................................................................................................................ 24
10.1 UTF-8 ................................................................................................................................... 24
10.2 UTF-16BE ............................................................................................................................ 24
10.3 UTF-16LE............................................................................................................................. 24
10.4 UTF-16 ................................................................................................................................. 24
10.5 UTF-32BE ............................................................................................................................ 25
10.6 UTF-32LE............................................................................................................................. 25
10.7 UTF-32 ................................................................................................................................. 25
11 Use of control functions with the UCS ............................................................................................ 25
12 Declaration of identification of features .......................................................................................... 26
12.1 Purpose and context of identification ................................................................................... 26
12.2 Identification of a UCS encoding form ................................................................................. 26
12.3 Identification of subsets of graphic characters ..................................................................... 27
12.4 Identification of control function set...................................................................................... 27
12.5 Identification of the coding system of ISO/IEC 2022 ........................................................... 28
13 Structure of the code charts and lists ............................................................................................. 28
27 Structure of the Supplementary Multilingual Plane for scripts and symbols (SMP) ....................... 51
28 Structure of the Supplementary Ideographic Plane (SIP) .............................................................. 53
29 Structure of the Supplementary Special-purpose Plane (SSP) ..................................................... 53
30 Code charts and lists of character names ...................................................................................... 53
30.1 Code chart............................................................................................................................ 54
30.2 Character names list ............................................................................................................ 54
30.3 Pointers to code charts and lists of character names .......................................................... 55
Annex A (normative) Collections of graphic characters for subsets ................................................. 2065
A.1 Collections of coded graphic characters .......................................................................... 2065
A.2 Blocks lists ....................................................................................................................... 2070
A.3 Fixed collections of the whole UCS (except Unicode collections) ................................... 2072
A.4 CJK collections................................................................................................................. 2075
A.5 Other collections .............................................................................................................. 2076
A.6 Unicode collections .......................................................................................................... 2079
Annex B (normative) List of combining characters ............................................................................ 2088
Annex C (normative) Transformation format for planes 1 to 10 of the UCS (UTF-16) ..................... 2089
Annex D (normative) UCS Transformation Format 8 (UTF-8) .......................................................... 2090
Annex E (normative) Mirrored characters in bidirectional context..................................................... 2091
Annex F (informative) Format characters .......................................................................................... 2092
F.1 General format characters ............................................................................................... 2092
F.2 Script-specific format characters...................................................................................... 2094
F.3 Interlinear annotation characters ..................................................................................... 2095
F.4 Subtending format characters .......................................................................................... 2095
F.5 Contiguity operators ......................................................................................................... 2095
F.6 Western musical symbols ................................................................................................ 2096
F.7 Language tagging using Tag characters.......................................................................... 2097
Annex G (informative) Alphabetically sorted list of character names ................................................ 2099
Annex H (informative) The use of “signatures” to identify UCS ........................................................ 2100
Annex I (informative) Ideographic description characters ................................................................. 2101
Annex J (informative) Recommendation for combined receiving/originating devices with internal
storage........................................................................................................................................ 2104
Annex K (informative) Notations of octet value representations ....................................................... 2105
Annex L (informative) Character naming guidelines ......................................................................... 2106
Annex M (informative) Sources of characters ................................................................................... 2109
Annex N (informative) External references to character repertoires ................................................. 2121
N.1 Methods of reference to character repertoires and their coding ...................................... 2121
N.2 Identification of ASN.1 character abstract syntaxes ........................................................ 2121
N.3 Identification of ASN.1 character transfer syntaxes ......................................................... 2122
Annex P (informative) Additional information on CJK Unified Ideographs ........................................ 2123
Annex Q (informative) Code mapping table for Hangul syllables...................................................... 2124
Annex R (informative) Names of Hangul syllables ............................................................................ 2125
Annex S (informative) Procedure for the unification and arrangement of CJK Ideographs .............. 2126
S.1 Unification procedure ....................................................................................................... 2126
Foreword
ISO (the International Organization for Standardization) and IEC (the International Electrotechnical Com-
mission) form the specialized system for worldwide standardization. National bodies that are members of
ISO or IEC participate in the development of International Standards through technical committees estab-
lished by the respective organization to deal with particular fields of technical activity. ISO and IEC techni-
cal committees collaborate in fields of mutual interest. Other international organizations, governmental
and non-governmental, in liaison with ISO and IEC, also take part in the work. In the field of information
technology, ISO and IEC have established a joint technical committee, ISO/IEC JTC1.
International Standards are drafted in accordance with the rules given in the ISO/IEC Directives, Part 2.
The main task of the joint technical committee is to prepare International Standards. Draft International
Standards adopted by the joint technical committee are circulated to national bodies for voting. Publication
as an International Standard requires approval by at least 75% of the national bodies casting a vote.
Attention is drawn to the possibility that some of the elements of ISO/IEC 10646 may be the subject of
patent rights. ISO and IEC shall not be held responsible for identifying any or all such patent rights.
International Standard ISO/IEC 10646 was prepared by Joint Technical Committee ISO/IEC JTC1, Infor-
mation technology, Subcommittee SC 2, Coded Character sets.
This second edition of ISO/IEC 10646 cancels and replaces ISO/IEC 10646:2003. It also incorporates
ISO/IEC 10646:2003 Amd.1:2005, Amd.2:2006, Amd.3:2007, Amd.4:2008, Amd.5:2009, Amd.6:2010,
Amd.7:2011, and Amd.8 2011.
NOTE – Amendment 7 and 8 are still in progress. The text in this document is synchronized with their contents and will be up-
dated accordingly.
Introduction
ISO/IEC 10646 specifies the Universal Coded Character Set (UCS). It is applicable to the representation,
transmission, interchange, processing, storage, input and presentation of the written form of the lan-
guages of the world as well as additional symbols.
By defining a consistent way of encoding multilingual text it enables the exchange of data internationally.
The information technology industry gains data stability, greater global interoperability and data inter-
change. ISO/IEC 10646 has been widely adopted in new Internet protocols and implemented in modern
operating systems and computer languages. This edition covers over 109 000 characters from the world’s
scripts.
ISO/IEC 10646 contains material which may only be available to users who obtain their copy in a machine
readable format. That material consists of the following printable files:
• EmojiSrc.txt
• CJKU_SR.txt
• CJKC_SR.txt
• NUSI.txt
• IICORE.txt
• JIEx.txt
• Allnames.txt
• HangulSy.txt.
1 Scope
ISO/IEC 10646 specifies the Universal Coded Character Set (UCS). It is applicable to the representation,
transmission, interchange, processing, storage, input, and presentation of the written form of the lan-
guages of the world as well as of additional symbols.
This document
A graphic character will be assigned only one code point in the standard, located either in the BMP or in
one of the supplementary planes.
NOTE – The Unicode Standard, Version 6.0 includes a set of characters, names, and coded representations that are identical
with those in this International Standard. It additionally provides details of character properties, processing algorithms, and
definitions that are useful to implementers.
2 Conformance
2.1 General
Whenever private use characters are used as specified in ISO/IEC 10646, the characters themselves
shall not be covered by these conformance requirements.
a) all the coded representations of graphic characters within that CC-data-element conform to clause 6,
to an identified encoding form chosen from clause 9, and to an identified encoding scheme chosen
from clause 10;
b) all the graphic characters represented within that CC-data-element are taken from those within an
identified subset (see 8);
c) all the coded representations of control functions within that CC-data-element conform to clause 11.
A claim of conformance shall identify the adopted encoding form, the adopted encoding scheme, and the
adopted subset by means of a list of collections and/or characters.
A claim of conformance shall identify the document that contains the description specified in a) below, and
shall identify the adopted encoding form(s), the adopted encoding scheme(s), and the adopted subset (by
means of a list of collections and/or characters), and the selection of control functions adopted in accor-
dance with clause 11.
a) Device description: A device that conforms to ISO/IEC 10646 shall be the subject of a description
that identifies the means by which the user may supply characters to the device and/or may recognize
them when they are made available to the user, as specified respectively, in subclauses b) and c) be-
low.
b) Originating device: An originating device shall allow its user to supply any characters from an
adopted subset, and be capable of transmitting their coded representations within a CC-data-element
in accordance with the adopted encoding form and adopted encoding scheme. As such, the originat-
ing device shall not emit ill-formed CC-data-elements.
c) Receiving device: A receiving device shall be capable of receiving and interpreting any coded repre-
sentation of characters that are within a CC-data-element in accordance with the adopted encoding
form and the adopted encoding scheme, and shall make any corresponding characters from the
adopted subset available to the user in such a way that the user can identify them. The receiving de-
vice shall treat ill-formed CC-data-elements as an error condition and shall not interpret such data as
character sequences.
Any corresponding characters that are not within the adopted subset shall be indicated to the user. The
way used for indicating them need not distinguish them from each other.
NOTE 1 – The manner in which a user is notified of either an error condition or characters not within the adopted subset is not
specified by this standard.
NOTE 2 – See also Annex J for receiving devices with retransmission capability.
3 Normative references
The following normative documents contain provisions which, through reference in this text, constitute
provisions of ISO/IEC 10646. For dated references, subsequent amendments to, or revisions of, any of
these publications do not apply. However, parties to agreements based on ISO/IEC 10646 are encour-
aged to investigate the possibility of applying the most recent editions of the normative documents indi-
cated below. For undated references, the latest edition of the normative document referred to applies.
Members of ISO and IEC maintain registers of currently valid International Standards.
ISO/IEC 2022:1994 Information technology — Character code structure and extension techniques.
ISO/IEC 6429:1992 Information technology — Control functions for coded character sets.
Unicode Standard Annex, UAX#9, The Unicode Bidirectional Algorithm:
http://www.unicode.org/reports/tr9/tr9-21.html.
Unicode Standard Annex, UAX#15, Unicode Normalization Forms:
http://www.unicode.org/reports/tr15/tr15-32.html.
Editor’s Note: The versions for the Unicode Standard Annexes mentioned above will be updated as ap-
propriate in future phases of this standard process.
4.1
Base character
A graphic character which is not a combining character
NOTE 1 – Most graphic characters are base characters. This sense of graphic combination does not preclude the presentation
of base characters from adopting different contextual forms or from participating in ligatures.
NOTE 2 – A base character typically does not graphically combine with preceding characters. They may be exceptions for
some complex writing systems.
4.2
Basic Multilingual Plane
BMP
Plane 00 of the UCS codespace
4.3
Block
A contiguous range of code points to which a set of characters that share common characteristics, such
as a script, are allocated; a block does not overlap another block; one or more of the code points within a
block may have no character allocated to them
4.4
Canonical form
The form with which characters of this coded character set is specified using a single code point within the
UCS codespace
NOTE – The canonical form should not be confused with an encoding form which describes the relationship between UCS
code points and one or several code units (see 4.23).
4.5
CC-data-element
coded-character-data-element
An element of interchanged information that is specified to consist of a sequence of code units, in accor-
dance with one or more identified standards for coded character sets; such sequence may contain code
units associated with any type of code points
NOTE – Unlike previous editions of the standard, this version no longer uses implementation levels. Its definition of CC-data-
element content corresponds to the former unrestricted implementation level 3. Other definitions of CC-data-element content,
previously known as level 1 and 2, are deprecated. To maintain compatibility with these previous editions, in the context of
identification of coded representation in standards such as ISO/IEC 8824 and ISO/IEC 8825, the concept of implementation
level may still be referenced as ‘Implementation level 3’. See Annex N.
4.6
Character
A member of a set of elements used for the organization, control, or representation of textual data; a
character may be represented by a sequence of one or several coded characters
4.7
Character boundary
Within a CC-data-element the demarcation between the last code unit of a coded character and the first
code unit of the next coded character
4.8
Code chart
Code table
A rectangular array showing the representation of coded characters allocated within a range of the UCS
codespace
4.9
Coded character
An association between a character and a code point
4.10
Coded character set
A set of coded characters
4.11
Code point
Code position
Any value in the UCS codespace; the term code point is preferred
4.12
Code unit
The minimal bit combination that can represent a unit of encoded text for processing or interchange
NOTE – Examples of code units are octets (8-bit code units) used in the UTF-8 encoding form, 16-bit code units in the UTF-16
encoding form, and 32-bit code units in the UTF-32 encoding form.
4.13
Collection
A numbered and named set of entities; for a non extended collection, these entities consist only of those
coded characters whose code points lie within one or more identified ranges (see also 4.25 for extended
collection)
NOTE – If any of the identified ranges include code points to which no character is allocated, the repertoire of the collection will
change if an additional character is assigned to any of those code points at a future amendment of this International Standard.
However it is intended that the collection number and name will remain unchanged in future editions of this International Stan-
dard.
4.14
Combining character
Characters which have General Category values of Spacing Combining Mark (Mc), Non Spacing Mark
(Mn), and Enclosing Mark (Me)
NOTE – These characters are intended for combination with the preceding non-combining graphic character, or with a se-
quence of combining characters preceded by a non-combining character (see also 4.17).
4.15
Combining class
Value associated with each combining character determining its typographical interaction and its canoni-
cal ordering within a sequence of combining characters
4.16
Compatibility character
A graphic character included as a coded character of ISO/IEC 10646 primarily for compatibility with exist-
ing coded character sets
4.17
Composite sequence
A sequence of graphic characters consisting of a base character followed by one or more combining
characters, ZERO WIDTH JOINER, or ZERO WIDTH NON-JOINER (see also 4.14)
NOTE 1 – A graphic symbol for a composite sequence generally consists of the combination of the graphic symbols of each
character in the sequence.
NOTE 2 – A composite sequence may be used to represent characters not encoded in the repertoire of ISO/IEC 10646
4.18
Control character
A control function the coded representation of which consists of a single code point
NOTE – Although control characters are often ‘named’ using terms such as DELETE, FORM FEED, ESC, these qualifiers do
not correspond to formal character names. See 11 for a list of the long names used by ISO/IEC 6429 in association with the
control characters.
4.19
Control function
An action that affects the recording, processing, transmission, or interpretation of data, and that is repre-
sented by a CC-data-element
4.20
Decomposition mapping
A mapping from a character to a sequence of one or more characters that is a canonical or compatibility
equivalent
4.21
Default state
The state that is assumed when no state has been explicitly specified (see F.2.1 and F.2.3)
4.22
Device
A component of information processing equipment which can transmit and/or receive coded information
within CC-data-elements (It may be an input/output device in the conventional sense, or a process such
as an application program or gateway function.)
4.23
Encoding form
An encoding form determines how each UCS code point for a UCS character is to be expressed as one or
more code units used by the encoding form. ISO/IEC 10646 specifies UTF-8, UTF-16, and UTF-32
4.24
Encoding scheme
An encoding scheme specifies the serialization of the code units from the encoding form into octets
NOTE – Some of the UCS encoding schemes have the same labels as the UCS encoding form. However they are used in dif-
ferent context. UCS encoding forms refer to in-memory and application interface representation of textual data. UCS encoding
schemes refer to octet-serialized textual data.
4.25
Extended collection
A collection for which the entities can also consist of sequences of code points that are in normalization
form NFC (see 21); the sequences of code points are referenced by Named UCS Sequence Identifiers
(NUSI) (see 125).
NOTE – Some collections such as 3 LATIN EXTENDED-A, 4 LATIN EXTENDED-B, 15 ARABIC EXTENDED, and many more,
have the term ‘extended’ in their name. This does not make them extended collections
4.26
Fixed collection
A collection in which every code point within the identified range(s) has a character allocated to it, and
which is intended to remain unchanged in future editions of this International Standard
4.27
Format character
A character whose primary function is to affect the layout or processing of characters around it; it gener-
ally does not have a visible representation of its own
4.28
General Category
GC
Value assigned to each UCS code point which determines its major class, such as letter, punctuation, and
symbol; each value is defined as General Category property using a two-letter abbreviation in the Unicode
Standard (see reference to the Unicode Standard General Category in 3)
NOTE – When referred as a group containing all GC values sharing the same first letter, the group may be described using the
first letter only. For example, ‘L’ stands for all letters ‘Lu’, ‘Ll’, ‘Lt’, ‘Lm’, and ‘Lo’.
4.29
Graphic character
A character, other than a control function or a format character, that has a visual representation normally
handwritten, printed, or displayed
4.30
Graphic symbol
The visual representation of a graphic character or of a composite sequence
4.31
High-surrogate code point
A code point in the range D800 to DBFF reserved for the use of UTF-16
4.32
High-surrogate code unit
A 16-bit code unit in the range D800 to DBFF used in UTF-16 as the leading code unit of a surrogate pair
(see 9.2)
4.33
ill-formed CC-data-element
A UCS CC-data-element that purports to be in a UCS encoding form which does not conform to the speci-
fication of that encoding form (for example, an unpaired surrogate code unit is an ill-formed CC-data-
element)
4.34
ill-formed CC-data-element subset
A non-empty subset of a CC-data-element X which does not contain any code unit which also belong to
any minimal well-formed CC-data-element subset of X
NOTE – An ill-formed CC-data-element subset cannot overlap with a minimal well-formed CC-data-element.
4.35
Interchange
The transfer of character coded data from one user to another, using telecommunication means or inter-
changeable media; interchange implies data serialization and the usage of a UCS encoding scheme
4.36
Interworking
The process of permitting two or more systems, each employing different coded character sets, to mean-
ingfully interchange character coded data; conversion between the two codes may be involved
4.37
ISO/IEC 10646-1
A former subdivision of the standard. It is also referred to as Part 1 of ISO/IEC 10646 and contained the
specification of the overall architecture and the Basic Multilingual Plane (BMP). There are a First and a
Second Edition of ISO/IEC 10646-1.
4.38
ISO/IEC 10646-2
A former subdivision of the standard. It is also referred to as Part 2 of ISO/IEC 10646 and contained the
specification of the Supplementary Multilingual Plane (SMP), the Supplementary Ideographic Plane (SIP)
and the Supplementary Special-purpose Plane (SSP). There is only a First Edition of ISO/IEC 10646-2.
4.39
Low-surrogate code point
A code point in the range DC00 to DFFF reserved for the use of UTF-16
4.40
Low-surrogate code unit
A 16-bit code unit in the range DC00 to DFFF used in UTF-16 as the trailing code unit of a surrogate pair
(see 9.2)
4.41
Minimal well-formed CC-data-element
A well-formed CC-data-element that maps to a single UCS scalar value
4.42
Mirrored character
A character whose image is mirrored horizontally in text that is laid out from right to left
4.43
Octet
A 8-bit code unit; the value is expressed in hexadecimal notation from 00 to FF in ISO/IEC 10646 (see
Annex K)
4.44
Plane
A subdivision of the UCS codespace consisting of 65536 code points. The UCS codespace contain 17
planes
4.45
Presentation
to present
The process of writing, printing, or displaying a graphic symbol
4.46
Presentation form
In the presentation of some scripts, a form of a graphic symbol representing a character that depends on
the position of the character relative to other characters
4.47
Private use plane
A plane within this coded character set; the contents of which is not specified in ISO/IEC 10646. Planes
0F and 10 are private use planes
4.48
Repertoire
A specified set of characters that are represented in a coded character set
4.49
Row
A subdivision of a plane; a multiple of 256 code points
4.50
Script
A set of graphic characters used for the written form of one or more languages
4.51
Supplementary plane
A plane other than Plane 00 of the UCS codespace; a plane that accommodates characters which have
not been allocated to the Basic Multilingual Plane
4.52
Supplementary Multilingual Plane for scripts and symbols
SMP
Plane 01 of the UCS codespace
4.53
Supplementary Ideographic Plane
SIP
Plane 02 of the UCS codespace
4.54
Supplementary Special-purpose Plane
SSP
Plane 0E of the UCS codespace
4.55
Surrogate pair
A representation for a single character that consists of a sequence of two 16-bit code units, where the first
value of the pair is a high-surrogate code unit and the second value is a low-surrogate code unit
4.56
Tertiary Ideographic Plane
TIP
Plane 03 of the UCS codespace
4.57
UCS codespace
The UCS codespace consists of the integers from 0 to 10FFFF (hexadecimal) available for assigning the
repertoire of the UCS characters
4.58
UCS scalar value
Any UCS code point except high-surrogate and low-surrogate code points
4.59
Unpaired surrogate code unit
A surrogate code unit in a CC-data element that is either a high-surrogate code unit that is not immedi-
ately followed by a low-surrogate unit, or a low-surrogate code unit that is not immediately preceded by a
high-surrogate code unit
4.60
User
A person or other entity that invokes the service provided by a device (This entity may be a process such
as an application program if the “device” is a code converter or a gateway function, for example.)
4.61
Well-formed CC-data-element
A UCS CC-data-element that purports to be in a UCS encoding form which conforms to the specification
of that encoding form and contains no ill-formed CC-data-element subset
The canonical form of this coded character set – the way in which it is to be conceived – uses the UCS
codespace which consists of the integers from 0 to 10FFFF.
Subsets of the coding space may be used in order to give a sub-repertoire of graphic characters.
Reserved planes 04 to 0D
TIP (plane 3)
SIP (plane 2)
SMP (plane 1)
D7FF
F900...FFFF
When a single character is to be identified in term of its code point, it is represented by a six digit form of
the integer such as
When referring to characters within plane 00, the leading two digits may be omitted; for characters within
planes 01 to 0F, the leading digit may be omitted, such as
Basic Type Brief Description General Category Character status Code point status
Letter, mark, number, punctua-
Graphic L, M, N, P, S, Zs
tion, symbols, and spaces
Invisible, but affects neighbour-
Format Cf, Zl, Zp
ing characters
Assigned to character
Control functions consisting of a
Control Cc
single code point
Assigned code point
Usage defined by private
Private use Co
agreement outside this standard
Permanently reserved for UTF-
Surrogate Cs
16
Not assigned to char-
Permanently reserved for inter-
Noncharacter acter
nal usage Cn
Reserved Reserved for future assignment Unassigned code point
Surrogate, noncharacter, and reserved code points are not assigned to characters and are subject to re-
striction in interchange. For example, surrogate code points do not have well-formed representations in
any UCS encoding form.
Private use characters are not constrained in any way by ISO/IEC 10646. Private use characters can be
used to provide user-defined characters. For example, this is a common requirement for users of ideo-
graphic scripts.
NOTE – For meaningful interchange of private use characters, an agreement, independent of ISO/IEC 10646, is necessary be-
tween sender and recipient.
The list of character names, except for CJK unified ideographs and Hangul syllables, is provided in 30.
NOTE – The list of character names is also part of the Unicode character Database in:
http://www.unicode.org/Public/UNIDATA/NamesList.txt with the syntax described in:
http://www.unicode.org/Public/UNIDATA/NamesList.html.
The following alternative forms of notation of a short identifier are defined here.
a) The six-digit form of short identifier consists of the sequence of six hexadecimal digits that represents
the code point of the character (see 6.2).
b) The four-to-five-digit form of short identifier shall consist of the last four to five digits of the six-digit
form. Leading zeroes beyond four digits are suppressed.
c) The character “+” (PLUS SIGN) may, as an option, precede the digit form of short identifier.
d) The prefix letter “U” (LATIN CAPITAL LETTER U) may, as an option, precede any of the three forms
of short identifier defined in a) to c) above.
The capital letters A to F, and U that appear within short identifiers may be replaced by the corresponding
small letters.
where UID1, UID2, etc. represent the short identifiers of the corresponding code points, in the same order
as those code points appear in the sequence. If each of the code points in such a sequence has a charac-
ter allocated to it, the USI can be used to identify the sequence of characters allocated at those code
points. The syntax for UID1, UID2, etc. is specified in 6.5. A COMMA character (optionally followed by a
SPACE character) separates the UIDs. The UCS Sequence Identifier includes at least two UIDs; it begins
with a LESS-THAN SIGN and is terminated by a GREATER-THAN SIGN.
NOTE – UCS Sequences Identifiers cannot be used for specification of subset content. They may be used outside this stan-
dard to identify: composite sequences for mapping purposes, font repertoire, etc.
where xx1, xx2, and xxn, represents the first, second, and nth octets using two hexadecimal digits for each
octet.
The names and code points allocation of all characters in this coded character set shall remain un-
changed in all future editions and amendments of this standard. This also includes character name ali-
ases.
NOTE – Character name alias are created to denote errors in the character names which cannot be fixed after publication of
the standard.
8 Subsets
ISO/IEC 10646 provides the specification of subsets of coded graphic characters for use in interchange,
by originating devices, and by receiving devices.
There are two alternatives for the specification of subsets: limited subset and selected subset. An adopted
subset may comprise either of them, or a combination of the two.
A claim of conformance referring to a limited subset shall list the graphic characters in the subset by the
names of graphic characters or code points as defined in ISO/IEC 10646.
A claim of conformance referring to a selected subset shall list the collections chosen as defined in
ISO/IEC 10646.
9.1 UTF-8
UTF-8 is the UCS encoding form that assigns each UCS scalar value to an octet sequence of one to four
octets, as specified in table 2.
• UCS characters from the BASIC LATIN collection are represented in UTF-8 in accordance with
ISO/IEC 4873, i.e. single octets with values ranging from 20 to 7E.
• Control functions in code points from 0000 to 001F, and the control character in code point 007F, are
represented without the padding octets specified in clause 11, i.e. as single octets with values ranging
from 00 to 1F, and 7F respectively in accordance with ISO/IEC 4873 and with the 8-bit structure of
ISO/IEC 2022.
• Octet values 00 to 7F do not otherwise occur in the UTF-8 coded representation of any character.
This provides compatibility with existing file-handling systems and communications sub-systems
which parse CC-sequences for these octet values.
• The first octet in the UTF-8 coded representation of any character can be directly identified when a
CC-data-element is examined, one octet at a time, starting from an arbitrary location. It indicates the
number of continuing octets (if any) in the multi-octet sequence that constitutes the code unit repre-
sentation of that character.
Table 2 specifies the bit distribution for the UTF-8 encoding form, showing the ranges of UCS scalar val-
ues corresponding to one, two, three, and four octet sequences.
Because surrogate code points are not UCS scalar values, any UTF-8 sequence that would otherwise
map to code points D800-DFFF is ill-formed.
Table 3 lists all the ranges (inclusive) of the octet sequences that are well-formed in UTF-8. Any UTF-8
sequence that does not match the patterns listed in table 3 is ill-formed
As a consequence of the well-formedness conditions specified in table 9.2, the following octet values are
disallowed in UTF-8: C0-C1, F5-FE
9.2 UTF-16
UTF-16 is the UCS encoding form that assigns each UCS scalar value to a sequence of one to two un-
signed 16-bit code units, as specified in table 4.
In the UTF-16 encoding form, code points in the range 0000-D7FF and E000-FFFF are represented as a
single 16-bit code unit; code points in the range 10000-10FFFF are represented as pairs of 16-bit code
units. These pairs of special code units are known as surrogate pairs.
The values of the code units used for surrogate pairs are disjoint from the code units used for the single
code unit representation, thus maintaining non-overlap for all code point representations in UTF-16.
UTF-16 optimizes the representation of characters in the BMP which contains the vast majority of com-
mon use characters.
Because surrogate code points are not UCS scalar values, unpaired surrogate code units are ill-formed.
Table 4 specifies the bit distribution for the UTF-16 encoding form. Calculation of the surrogate pair values
involves subtraction of 10000 hexadecimal to account for the starting offset to the scalar value (expressed
as ‘wwww = uuuuu-1’ in the table).
NOTE – Former editions of this standard included references to a two-octet BMP form called UCS-2 which would be a subset
of the UTF-16 encoding form restricted to the BMP UCS scalar values. The UCS-2 form is deprecated.
Because surrogate code points are not UCS scalar values, UTF-32 code units in the range 0000 D800-
0000 DFFF are ill-formed.
ISO/IEC 10646 specifies seven encoding schemes: UTF-8, UTF-16BE, UTF-16LE, UTF-16, UTF-32BE,
UTF-32LE, and UTF-32.
10.1 UTF-8
The UTF-8 encoding scheme serializes a UTF-8 code unit sequence in exactly the same order as the
code unit sequence itself.
When represented in UTF-8, the signature turns into the octet sequence <EF BB BF>. Its usage at the
beginning of a UTF-8 data stream is neither required or recommended but does not affect conformance.
10.2 UTF-16BE
The UTF-16BE encoding scheme serializes a UTF-16 CC-data-element by ordering octets in a way that
the more significant octet precedes the less significant octet (also known as big-endian ordering).
In UTF-16BE, an initial octet sequence of <FE FF> is interpreted as FEFF ZERO WIDTH NO-BREAK
SPACE and does not convey a signature meaning.
10.3 UTF-16LE
The UTF-16LE encoding scheme serializes a UTF-16 CC-data-element by ordering octets in a way that
the less significant octet precedes the more significant octet (also known as little-endian ordering).
In UTF-16LE, an initial octet sequence of <FF FE> is interpreted as FEFF ZERO WIDTH NO-BREAK
SPACE and does not convey a signature meaning.
10.4 UTF-16
The UTF-16 encoding scheme serializes a UTF-16 CC-data-element by ordering octets in a way that ei-
ther the less significant octet precedes or follows the more significant octet.
In the UTF-16 encoding scheme, the initial signature read as <FE FF> indicates that the more significant
octet precedes the less significant octet, and <FF FE> the reverse. The signature is not part of the textual
data.
In the absence of signature, the octet order of the UTF-16 encoding scheme is that the more significant
octet precedes the less significant octet.
10.5 UTF-32BE
The UTF-32BE encoding scheme serializes a UTF-32 CC-data-element by ordering octets in a way that
the more significant octets precede the less significant octets (also known as big-endian ordering).
In UTF-32BE, an initial octet sequence of <00 00 FE FF> is interpreted as FEFF ZERO WIDTH NO-
BREAK SPACE and does not convey a signature meaning.
10.6 UTF-32LE
The UTF-32LE encoding scheme serializes a UTF-32 CC-data-element by ordering octets in a way that
the less significant octets precede the more significant octets (also known as little-endian ordering).
In UTF-32LE, an initial octet sequence of <FF FE 00 00> is interpreted as FEFF ZERO WIDTH NO-
BREAK SPACE and does not convey a signature meaning.
10.7 UTF-32
The UTF-32 encoding scheme serializes a UTF-32 code unit sequence by ordering octets in a way that
either the less significant octets precede or follow the more significant octets.
In the absence of signature, the octet order of the UTF-32 encoding scheme is that the more significant
octets precede the less significant octets.
When a control character of ISO/IEC 6429 is used with this coded character set, its coded representation
as specified in ISO/IEC 6429 shall be padded to correspond with the number of octets in code unit of the
adopted encoded form (see 9). Thus, the least significant octet shall be the bit combination specified in
ISO/IEC 6429, and the more significant octet(s) shall be zeros.
For example, the control character FORM FEED is represented by “000C” in the UTF-16 encoding form,
and “0000 000C” in the UTF-32 encoding form.
For escape sequences, control sequences, and control strings (see ISO/IEC 6429) consisting of a coded
control character followed by additional bit combinations in the range 20 to 7F, each bit combination shall
be padded by octet(s) with value 00.
For example, the escape sequence “ESC 02/00 04/00” is represented by “1B 20 40” in the UTF-8 encod-
ing form, by “001B 0020 0040” in the UTF-16 encoding form, and “0000001B 00000020 00000040” in the
UTF-32 encoding form.
NOTE 1 – The term “character” appears in the definition of many of the control functions specified in ISO/IEC 6429, to identify
the elements on which the control functions will act. When such control functions are applied to coded characters according to
ISO/IEC 10646 the action of those control functions will depend on the type of element from ISO/IEC 10646 that has been
chosen, by the application, to be the element (or character) on which the control functions act. These elements may be chosen
to be characters (non-combining characters and/or combining characters) or may be chosen in other ways (such as composite
sequences) when applicable.
Code extension control functions for the ISO/IEC 2022 code extension techniques (such as designation
escape sequences, single shift, and locking shift) shall not be used with this coded character set.
NOTE 2 – The following list provides the long names from ISO/IEC 6429 used in association with the control characters.
0000 NULL 0002 START OF TEXT
0001 START OF HEADING 0003 END OF TEXT
However, some standards for interchange of coded information may permit, or require, that the coded
representation of the identification applicable to the CC-data-element forms a part of the interchanged
information. This clause specifies a coded representation for the identification of UCS and a subset of
ISO/IEC 10646, and also of a C0 and a C1 set of control functions from ISO/IEC 6429 for use in conjunc-
tion with ISO/IEC 10646. Such coded representations provide all or part of an identification data element,
which may be included in information interchange in accordance with the relevant standard.
In the context of these identifications, because the more significant octets shall precede the less signifi-
cant octets when serialized, the only encoding schemes that can be selected are UTF-8, UTF-16BE, and
UTF-32BE according to the relevant encoding forms (UTF-8, UTF-16, and UTF-32 respectively).
If two or more of the identifications are present, the order of those identifications shall follow the order as
specified in this clause.
NOTE – An alternative method of identification is described in Annex N.
If such an escape sequence appears within a CC-data-element conforming to ISO/IEC 2022, it shall con-
sist only of the sequences of bit combinations as shown above.
If such an escape sequence appears within a CC-data-element conforming to ISO/IEC 10646, it shall be
padded in accordance with clause 11.
Ps... means that there can be any number of selective parameters. The parameters are to be taken from
the subset collection numbers as shown in Annex A of ISO/IEC 10646. When there is more than one pa-
rameter, each parameter value is separated by an octet with value 03/11.
Parameter values are represented by digits where octet values 03/00 to 03/09 represent digits 0 to 9.
If such an escape sequence appears within a CC-data-element conforming to ISO/IEC 2022, it shall con-
sist only of the sequences of bit combinations as shown above.
If such a control sequence appears within a CC-data-element conforming to ISO/IEC 10646, it shall be
padded in accordance with clause 11.
If such an escape sequence appears within a CC-data-element conforming to ISO/IEC 10646, it shall be
padded in accordance with clause 11.
If such an escape sequence appears within a CC-data-element conforming to ISO/IEC 2022, it shall con-
sist only of the sequence of bit combinations as shown above.
NOTE – Escape sequence ESC 02/05 04/00 is normally used for return to the restored state of ISO/IEC 2022. The escape se-
quence ESC 02/05 04/00 specified here is sometimes not exactly as specified in ISO/IEC 2022 due to the presence of padding
octets. For this reason the escape sequences in clause 12.2 for the identification of UCS include the octet 02/15 to indicate
that the return does not always conform to that standard.
The graphic symbols are to be regarded as typical visual representations of the characters. ISO/IEC
10646 does not attempt to prescribe the exact shape of each character. The shape is affected by the de-
sign of the font employed, which is outside the scope of ISO/IEC 10646.
Graphic characters specified in ISO/IEC 10646 are uniquely identified by their names. This does not imply
that the graphic symbols by which they are commonly imaged are always different. Examples of graphic
characters with similar graphic symbols are LATIN CAPITAL LETTER A, GREEK CAPITAL LETTER
ALPHA and CYRILLIC CAPITAL LETTER A.
The meaning attributed to any character is not specified by ISO/IEC 10646; it may differ from country to
country, or from one application to another.
For the alphabetic scripts, the general principle has been to arrange the characters within any row in ap-
proximate alphabetic sequence; where the script has capital and small letters, these are arranged in pairs.
However, this general principle has been overridden in some cases. For example, for those scripts for
which a relevant standard exists, the characters are allocated according to that standard. This arrange-
ment within the code charts will aid conversion between the existing standards and this coded character
set. In general, however, it is anticipated that conversion between this coded character set and any other
coded character set will use a table lookup technique.
It is not intended, nor will it often be the case, that the characters needed by any one user will be found all
grouped together in one part of the code charts.
Furthermore, the user of any script will find that needed characters may have been coded elsewhere in
this coded character set. This especially applies to the digits, to the symbols, and to the use of Latin let-
ters in dual-script applications.
Therefore, in using this coded character set, the reader is advised to refer first to the block names list in
Annex A.2 or an overview of the Planes in figures 5 to 9, and then to turn to the specific code chart for the
relevant script and for symbols and digits. In addition, Annex G contains an alphabetically sorted list of
character names.
Rules to be used for constructing the names of blocks are given in 24.4.1.
Rules to be used for constructing the names of collections are given in 24.4.2.
This character mirroring is not limited to paired characters and shall be applied to all characters belonging
to that class.
EXAMPLE
In a right-to-left text segment, the GREATER-THAN SIGN (rendered as ">" in left-to-right text) may be rendered as the "<"
graphic symbol.
NOTE 2 – Many ancient scripts and some scripts in modern use can be written either right-to-left or left-to-right. It is often cus-
tomary for one of these scripts to use the appropriately mirrored graphical symbol for any character represented by a graphic
symbol that is not symmetric around the vertical axis. In such cases, it is up to the rendering system to display the graphic im-
age appropriate for the writing direction employed. The directionality of the representative graphic symbol shown in the charac-
ter code charts matches the default writing direction for the script. Characters belonging to these scripts have the
“Bidi_Mirrored’ property set to ‘N’ in the Unicode Standard (see reference to the Unicode Standard Bidi Mirrored property in 3).
Examples of such scripts include, but are not limited to, Old Italic, an ancient script for which the default writing direction in this
standard is left-to-right, and Cypriot, an ancient script for which the default writing direction in this standard is right-to-left.
16 Special characters
There are some characters that do not have printable graphic symbols or are otherwise special in some
ways.
symbol for the letter A or E and the preceding consonant. When rendered in visible form it is generally shown as a narrow
space between the letters, but it may sometimes be shown as a distinct graphic symbol to assist the user.
NOTE 2 – The character 202F NARROW NO-BREAK-SPACE is a non-breaking space. It is similar to 00A0 NO-BREAK
SPACE, except that it is rendered with a narrower width. When used with the Mongolian script this character is usually ren-
dered at one-third of the width of a normal space, and it separates a suffix from the Mongolian word-stem. This allows for the
normal rules of Mongolian character shaping to apply, while indicating that there is no word boundary at that position.
Only the variation sequences defined or referenced in this clause indicate a specific variant form of
graphic symbol; all other such sequences are undefined. Furthermore, variation selectors following other
base characters and any non-base characters have no effect on the selection of the graphic symbol for
that character.
The following list provides a description of the variant appearances corresponding to the use of appropri-
ate variation selectors with all allowed base mathematical symbols.
NOTE 3 – The VARIATION SELECTOR-1 is the only variation selector used with mathematical symbols.
Sequence Description of variant appearance
(UID notation)
<2229, FE00> INTERSECTION with serifs
<222A, FE00> UNION with serifs
<2268, FE00> LESS-THAN BUT NOT EQUAL TO with vertical stroke
<2269, FE00> GREATER-THAN BUT NOT EQUAL TO with vertical stroke
<2272, FE00> LESS-THAN OR EQUIVALENT TO following the slant of the lower leg
<2273, FE00> GREATER-THAN OR EQUIVALENT TO following the slant of the lower leg
<228A, FE00> SUBSET OF WITH NOT EQUAL TO with stroke through bottom members
<228B, FE00> SUPERSET OF WITH NOT EQUAL TO with stroke through bottom members
<2293, FE00> SQUARE CAP with serifs
<2294, FE00> SQUARE CUP with serifs
<2295, FE00> CIRCLED PLUS with white rim
<2297, FE00> CIRCLED TIMES with white rim
<229C, FE00> CIRCLED EQUALS equal sign touching the circle
<22DA, FE00> LESS-THAN EQUAL TO OR GREATER-THAN with slanted equal
<22DB, FE00> GREATER-THAN EQUAL TO OR LESS-THAN with slanted equal
<2A3C, FE00> INTERIOR PRODUCT tall variant with narrow foot
<2A3D, FE00> RIGHTHAND INTERIOR PRODUCT tall variant with narrow foot
<2A9D, FE00> SIMILAR OR LESS-THAN with similar following the slant of the upper leg
<2A9E, FE00> SIMILAR OR GREATER-THAN with similar following the slant of the upper leg
<2AAC, FE00> SMALLER THAN OR EQUAL TO with slanted equal
<2AAD, FE00> LARGER THAN OR EQUAL TO with slanted equal
<2ACB, FE00> SUBSET OF ABOVE NOT EQUAL TO with stroke through bottom members
<2ACC, FE00> SUPERSET OF ABOVE NOT EQUAL TO with stroke through bottom members
The following list provides a description of the variant appearances corresponding to the use of appropri-
ate variation selectors with all allowed base Mongolian characters. Only some presentation forms of the
base Mongolian characters used with the Mongolian free variation selectors produce variant appearances.
NOTE 4 – The Mongolian characters have various presentation forms depending on their position in a CC-data element. These
presentations forms are called isolate, initial, medial and final.
Sequence position Description of variant appearance
(UID notation)
<1820, 180B> isolate, medial, final MONGOLIAN LETTER A second form
<1820, 180C> medial MONGOLIAN LETTER A third form
<1821, 180B> initial, final MONGOLIAN LETTER E second form
<1822, 180B> medial MONGOLIAN LETTER I second form
<1823, 180B> medial,final MONGOLIAN LETTER O second form
<1824, 180B> medial MONGOLIAN LETTER U second form
<1825, 180B> medial,final MONGOLIAN LETTER OE second form
<1825, 180C> medial MONGOLIAN LETTER OE third form
<1826, 180B> isolate, medial, final MONGOLIAN LETTER UE second form
<1826, 180C> medial MONGOLIAN LETTER UE third form
<1828, 180B> initial, medial MONGOLIAN LETTER NA second form
<1828, 180C> medial MONGOLIAN LETTER NA third form
<1828, 180D> medial MONGOLIAN LETTER NA separate form
<182A, 180B> final MONGOLIAN LETTER BA alternative form
<182C, 180B> initial, medial MONGOLIAN LETTER QA second form
<182C, 180B> isolate MONGOLIAN LETTER QA feminine second form
<182C, 180C> medial MONGOLIAN LETTER QA third form
<182C, 180D> medial MONGOLIAN LETTER QA fourth form
<182D, 180B> initial, medial MONGOLIAN LETTER GA second form
<182D, 180B> final MONGOLIAN LETTER GA feminine form
<182D, 180C> medial MONGOLIAN LETTER GA third form
<182D, 180D> medial MONGOLIAN LETTER GA feminine form
<1830, 180B> final MONGOLIAN LETTER SA second form
<1830, 180C> final MONGOLIAN LETTER SA third form
<1832, 180B> medial MONGOLIAN LETTER TA second form
<1833, 180B> initial, medial, final MONGOLIAN LETTER DA second form
<1835, 180B> final MONGOLIAN LETTER JA second form
<1836, 180B> initial, medial MONGOLIAN LETTER YA second form
<1836, 180C> medial MONGOLIAN LETTER YA third form
<1838, 180B> final MONGOLIAN LETTER WA second form
<1844, 180B> medial MONGOLIAN LETTER TODO E second form
<1845, 180B> medial MONGOLIAN LETTER TODO I second form
<1846, 180B> medial MONGOLIAN LETTER TODO O second form
<1847, 180B> isolate, medial, final MONGOLIAN LETTER TODO U second form
<1847, 180C> medial MONGOLIAN LETTER TODO U third form
<1848, 180B> medial MONGOLIAN LETTER TODO OE second form
<1849, 180B> isolate, medial MONGOLIAN LETTER TODO UE second form
<184D, 180B> initial, medial MONGOLIAN LETTER TODO QA feminine form
<184E, 180B> medial MONGOLIAN LETTER TODO GA second form
<185D, 180B> medial, final MONGOLIAN LETTER SIBE E second form
<185E, 180B> medial, final MONGOLIAN LETTER SIBE I second form
The following list provides a description of the variant appearances corresponding to the use of appropri-
ate variation selectors with all allowed base Phags-pa characters. These variation selector sequences do
not select fixed visual representation; rather, they select a representation that is reversed from the normal
form predicted by the preceding character.
The rules for the superimposition, choice of differently shaped characters, or combination into ligatures, or
conjuncts, which are often of extreme complexity, are not specified in ISO/IEC 10646.
In general, presentation forms are not intended to be used as a substitute for the nominal forms of the
graphic characters specified elsewhere within this coded character set. However, specific applications
may encode these presentation forms instead of the nominal forms for specific reasons among which is
compatibility with existing devices. The rules for searching, sorting, and other processing operations on
presentation forms are outside the scope of ISO/IEC 10646.
Within the BMP these characters are mostly allocated to code points within rows from FB to FF.
18 Compatibility characters
Compatibility characters are included in ISO/IEC 10646 primarily for compatibility with existing coded
character sets to allow two-way code conversion without loss of information.
Within the BMP many of these characters are allocated to code points within rows F9, FA, FE, and FF,
and within rows 31 and 33. Some compatibility characters are also allocated within other rows.
NOTE 1 – There are twelve code points in the row FA of the BMP which are allocated to CJK Unified Ideographs.
Within the Supplementary Ideographic Plane (SIP) these characters are allocated to code points within
rows F8 to FA.
The CJK compatibility ideographs are ideographs that should have been unified with one of the CJK uni-
fied ideographs, per the unification rule described in Annex S. However, they are included in this Interna-
tional Standard as separate characters, because, based on various national, cultural, or historical reasons
for some specific country and region, some national and regional standards assign separate code points
for them.
NOTE 2 – For this reason, compatibility ideographs should only be used for maintaining and guaranteeing a round trip conver-
sion with the specific national, regional, or other standard. Other usage is strongly discouraged.
19 Order of characters
Usually, coded characters appear in a CC-data-element in logical order (logical or backing store order
corresponds approximately to the order in which characters are entered from the keyboard, after correc-
tions such as insertions, deletions, and overtyping have taken place). This applies even when characters
of different dominant direction are mixed: left-to-right (Greek, Latin, Thai) with right-to-left (Arabic, He-
brew), or with vertical (Mongolian) script.
Some characters may not appear linearly in final rendered text. For example, the medial form of
DEVANAGARI VOWEL SIGN I is displayed before the character that it logically follows in the CC-data-
element.
20 Combining characters
This clause specifies the use of combining characters (see 4.14).
If a combining character is to be regarded as a composite sequence in its own right, it shall be coded as a
composite sequence by association with the character 00A0 NO-BREAK SPACE. For example, grave
accent can be composed as 00A0 NO-BREAK SPACE followed by 0300 COMBINING GRAVE ACCENT.
NOTE – Indic combining marks for vowels form a special category of combining characters, since the presentation can depend
on more than one of the surrounding characters. Thus it might not be desirable to associate these Indic combining marks with
the character NO-BREAK SPACE.
a) If the combining characters can interact in presentation (for example, COMBINING MACRON and
COMBINING DIAERESIS), then the position of the combining characters in the resulting graphic dis-
play is determined by the order of the coded representation of the combining characters. The presen-
tations of combining characters are to be positioned from the base character outward. For example,
combining characters placed above a base character are stacked vertically, starting with the first en-
countered in the sequence of coded representations and continuing for as many marks above as are
required by the coded combining characters following the coded base character. For combining char-
acters placed below a base character, the situation is inverted, with the combining characters starting
from the base character and stacking downward.
An example of multiple combining characters above the base character is found in Thai, where a con-
sonant letter can have above it one of the vowels 0E34 to 0E37 and, above that, one of four tone
marks 0E48 to 0E4B. The order of the coded representation is: base consonant, followed by a vowel,
followed by a tone mark.
b) Some specific combining characters override the default stacking behaviour by being positioned hori-
zontally rather than stacking, or by forming a ligature with an adjacent combining character. When po-
sitioned horizontally, the order of coded representations is reflected by positioning in the dominant or-
der of the script with which they are used. For example, horizontal accents in a left-to-right script are
coded left-to-right.
Prominent characters that show such override behaviour are associated with specific scripts or alpha-
bets. For example, the COMBINING GREEK KORONIS (0343) requires that, together with a following
acute or grave accent, they be rendered side-by-side above a letter, rather than the accent marks be-
ing stacked above the COMBINING GREEK KORONIS. The order of the coded representations is: the
letter itself, followed by that of the breathing mark, followed by that of the accent marks. Two Vietnam-
ese tone marks, which have the same graphic appearance as the Latin acute and grave accent
marks, do not stack above the three Vietnamese vowel letters which already contain the circumflex
diacritic (â, ê, ô). Instead, they form ligatures with the circumflex component of the vowel letters.
c) If the combining characters do not interact in presentation (for example, when one combining charac-
ter is above a graphic character and another is below), the resultant graphic symbol from the base
character and combining characters in different orders may appear the same. For example, the coded
representations of LATIN SMALL LETTER A, followed by COMBINING CARON, followed by
COMBINING OGONEK may result in the same graphic symbol as the coded representations of LATIN
SMALL LETTER A, followed by COMBINING OGONEK, followed by COMBINING CARON.
Combining characters in Hebrew or Arabic scripts do not normally interact. Therefore, the sequence of
their coded representations in a composite sequence does not affect its graphic symbol. The rules for
forming the combined graphic symbol are beyond the scope of ISO/IEC 10646.
Other collections of characters listed in Annex A comprise only combining characters, for example collec-
tion 7 (COMBINING DIACRITICAL MARKS).
21 Normalization forms
Normalization forms are the mechanisms allowing the selection of a unique coded representation among
alternative; but equivalent coded text representations of the same text. Normalization forms for use with
ISO/IEC 10646 are specified in the Unicode Standard UAX#15 (see 3). There are four normalization
forms:
An incomplete syllable is a string of one or more characters which does not constitute a complete syllable
(for example, a Choseong alone, a Jungseong alone, a Jongseong alone, or a Jungseong followed by a
Jongseong). An incomplete syllable which starts with a Jungseong shall be preceded by a CHOSEONG
FILLER (115F). An incomplete syllable composed of a Jongseong alone shall be preceded by a
CHOSEONG FILLER (115F) and JUNGSEONG FILLER (1160). An incomplete syllable composed of a
Choseong alone shall be followed by a JUNGSEONG FILLER (1160).
NOTE 1 – Hangul Jamo are not combining characters.
NOTE 2 – When a combining character such as HANGUL SINGLE DOT TONE MARK (302E) is intended to apply to a se-
quence of Hangul Jamo it should be placed at the end of the sequence, after the Hangul Jamo character which completes the
syllable block.
NOTE 3 – Hangul text can be represented in several different ways in this standard. Korean Standard KS X 1026-1: In-
formation Technology - Universal Multiple-Octet Coded Character set (UCS) - Hangul - Part 1, Hangul processing guide for in-
formation interchange, provides guidelines on how to ensure interoperability in information interchange.
22.2 Features of scripts used in India and some other South Asian countries
In the code charts for Rows 09 to 0D and 0F, and for the MYANMAR block in Row 10, of the BMP (see
30) the graphic symbols shown for some characters appear to be formed as compounds of the graphic
symbols for two other characters in the same table.
EXAMPLE 1 Row 09 Devanagari
The graphic symbol for 0906 DEVANAGARI LETTER AA appears as if it is constructed from the graphic symbols for 0905
DEVANAGARI LETTER A and 093E DEVANAGARI VOWEL SIGN AA
EXAMPLE 2 Row 0D Malayalam
The graphic symbol for 0D08 MALAYALAM LETTER II appears as if it is constructed from the graphic symbols for 0D07
MALAYALAM LETTER I and 0D57 MALAYALAM AU LENGTH MARK
In such cases a single coded character may appear to the user to be equivalent to the sequence of two
coded characters whose graphic symbols, when combined, are visually similar to the graphic symbol of
that single character, as in a composite sequence (see 4.17).
A “unique-spelling” rule is defined as follows. According to this rule, no coded character from a table for
Rows 09 to 0D or 0F, or for the MYANMAR block in Row 10, with the list of exceptions mentioned below,
shall be regarded as equivalent to a sequence of two or more other coded characters taken from the same
table.
• 1st field: UCS code point or sequence (hhhh | hhhhh ) (<space> (hhhh | hhhhh))*
• 2nd field: DoCoMo Shift-Jis code (hhhh)
• 3rd field: KDDI Shift-Jis code (hhhh)
• 4th field: SoftBank Shift-JIS code (hhhh)
The format definition uses ‘h’ as a decimal unit and <space> as the SPACE character.
The source reference information establishes the character identity for CJK Ideographs. A source refer-
ence is established by associating a CJK Ideograph code point with one or several values in the source
standards listed in 0 and 23.4. Such a source standard originates from the following categories:
• Hanzi G sources,
• Hanzi H sources,
• Hanzi M sources,
• Hanzi T sources,
• Kanji J sources,
• Hanja K sources,
• Hanja KP sources,
• ChuNom V sources, and
• Unicode U sources
For a given code point, only one source reference can be created for each of the source standard cate-
gory (G, H, M, T, J, K, KP, V, and U). In order to provide a comprehensive coverage for a source standard
category, when a source standard is referenced, all its unique associations with existing CJK Ideographs
are documented.
The following list identifies all sources referenced by the CJK Ideographs in both the BMP and the SIP.
G0 GB2312-80
G1 GB12345-90 with 58 Hong Kong and 92 Korean “Idu” characters
G3 GB7589-87 unsimplified forms
G5 GB7590-87 unsimplified forms
G7 General Purpose Hanzi List for Modern Chinese Language, and General List of Simplified
Hanzi
GS Singapore Characters
G8 GB8565-88
G9 GB18030-2000
GE GB16500-95
G_4K Siku Quanshu ﹙四庫全書﹚
G_BK Chinese Encyclopedia ﹙中國大百科全書﹚
G_CH Ci Hai ﹙辞海﹚
G_CY Ci Yuan ﹙辭源﹚
G_CYY Chinese Academy of Surveying and Mapping Ideographs (中国测绘科学院用字﹚
G_FZ Founder Press System ﹙方正排版系统﹚
G_GFHZB ZhongHua ZiHai (中华字海), XianDai HanYu CiDian (现代汉语词典), or Ci-Hai (辞海)
G_GH Gudai Hanyu Cidian ﹙古代汉语词典﹚
G_GJZ Commercial Press Ideographs ﹙商务印书馆用字﹚
G_HC Hanyu Dacidian ﹙漢語大詞典﹚
G_HZ Hanyu Dazidian ideographs ﹙漢語大字典﹚
G_IDC ID system of the Ministry of Public Security of China, 2009
G_KX Kangxi Dictionary ideographs﹙康熙字典﹚9th edition (1958) including the addendum
﹙康熙字典﹚補遺
G_XC Xiandai Hanyu Cidian (现代汉语词典﹚
G_ZFY Hanyu Fangyan Dacidian ﹙汉语方言大辞典﹚
G_ZJW Yinzhou Jinwen Jicheng Yinde ﹙殷周金文集成引得﹚
The Hanzi H source is
J0 JIS X 0208-1990
J1 JIS X 0212-1990
J3 JIS X 0213:2000 level-3
J3A JIS X 0213:2004 level-3
J4 JIS X 0213:2000 level-4
JA Unified Japanese IT Vendors Contemporary Ideographs, 1993
JH Hanyo-Denshi Program (汎用電子情報交換環境整備プログラム), 2009
JK Japanese KOKUJI Collection
J_ARIB Association of Radio Industries and Businesses (ARIB) ARIB STD-B24 Version 5.1,
March 14 2007
The Hanja K sources are
V0 TCVN 5773:1993
V1 TCVN 6056:1995
V2 VHN 01:1998
V3 VHN 02: 1998
V4 Dictionary on Nom 2006, Dictionary on Nom of Tay ethnic 2006, Lookup Table for Nom in
the South 1994
The Unicode U source is
UTC The Unicode Technical Report #45, U-source Ideographs, June 2008
NOTE 3 – Even if source references get updated, the source reference information is not updated. The updated source refer-
ences may only identify characters not previously covered by the older version.
The content linked to is a plain text file, using ISO/IEC 646-IRV characters with LINE FEED as end of line
mark, that specifies, after a 13-lines header, as many lines as CJK Unified Ideographs in the sum of the
two planes; each containing the following information organized in fields delimited by ‘;’ (empty fields use
no character):
The graphic representation for the radical is shown immediately below the code point, along with the radi-
cal number and the stroke count. That stroke count does not include the radical itself.
The code chart for the CJK UNIFIED IDEOGRAPHS block (4E00-9FFF) uses a fixed column format (i.e.
source references from a given source always appear in the same column) while the code charts for the
other CJK Unified blocks show graphic symbols per the following order of appearance: G, T, J, K, KP, V,
H, M, and U.
The following figure shows an example for characters 4E00-4E09 and 4E12-4E1A.
HEX C J K V HEX C J K V
4E00 一 一 一 一 一 4E12 丒 丒 丒
⼀ 1.0 G0-523B T1-4421 J0-306C K0-6C69 V1-4A21 ⼀ 1.3 GE-2123 T4-2139 J1-3025
4E01 丁 丁 丁 丁 丁 4E13 专
⼀ 1.1 G0-3621 T1-4421 J0-437A K0-6F4B V1-4A22 ⼀ 1.3 G0-5728
4E02 丂 丂 丂 4E14 且 且 且 且 且
⼀ 1.1 G5-3021 T4-2126 J1-3021 ⼀ 1.4 G0-4752 T1-4562 J0-336E K0-7326 V1-4A2D
4E03 七 七 七 七 七 4E15 丕 丕 丕 丕 丕
⼀ 1.1 G0-465F T1-4424 J0-3C37 K0-7652 V1-4A23 ⼀ 1.4 G0-5287 T1-4561 J0-5023 K0-5D60 V1-4A2E
4E04 丄 丄 丄 4E16 世 世 世 世 世
⼀ 1.1 G0-523B T1-4421 J0-306C ⼀ 1.4 G0-4A40 T1-4560 J0-4024 K0-6126 V1-4A2F
丄 4E17 丗 丗 丗
H-9E93 ⼀ 1.3 GE-2124 T4-2155 J0-5242
4E05 丅 丅 丅 4E17 丗 丗 丗
⼀ 1.1 GE-2122 T3-2125 J1-3023 ⼀ 1.3 GE-2124 T4-2155 J0-5242
4E06 丆 丆 4E18 丘 丘 丘 丘 丘
⼀ 1.1 G1-7D3D K2-2121 ⼀ 1.4 G0-4770 T1-4563 J0-3556 K0-4E78 V1-4A30
4E07 万 万 万 万 万 4E19 丙 丙 丙 丙 丙
⼀ 1.2 G0-4D72 T2-2126 J0-4B7C K0-5832 V1-4A24 ⼀ 1.4 G0-317B T1-455F J0-4A3A K0-5C30 V1-4A31
4E08 丈 丈 丈 丈 丈 4E1A 业
⼀ 1.2 G0-5549 T1-4437 J0-3E66 K0-6D5B V1-4A25 ⼀ 1.4 G0-5235
4E09 三 三 三 三 三 业
⼀ 1.2 G0-487D T1-4435 J0-3B30 K0-5F32 V1-4A26 H-9EB2
The following figure shows an example for characters 41CB-41CC, 41DC, and 41EE.
41CC 䇌 䇌 41DD 䇝 䇝 䇮
立 117.7 GKX0871.17 T3-3D6F 竹 118.4 G3-634A T4-2E72 V2-7F50
23.3.3 Source reference presentation for CJK UNIFIED IDEOGRAPHS EXTENSION B, C, and D
The following figure shows the presentation for the CJK UNIFIED IDEOGRAPH EXTENSION B, C, and D
blocks. Up to two sources per characters are represented in a single row. If more than two sources exist,
an additional row is used.
The following figure shows an example for characters 2000F-20010, 200021, 20032-20033, and 20043-
20044.
The names given by this standard to these entities shall follow the rules for name formation and name
uniqueness specified in this clause. This specification applies to the entity names in the English language
version of this standard.
NOTE 1 – In a version of such a standard in another language a) these rules may be amended to permit names to be gener-
ated using words and syntax that are considered appropriate within that language; b) the entity names from this version of the
standard may be replaced by equivalent unique names constructed according to the rules amended as in a) above.
NOTE 2 – Additional guidelines for constructing entity names are given in Annex L for information.
An entity name shall not contain two or more consecutive SPACE characters or consecutive HYPHEN-
MINUS characters. A collection name shall not contain two or more consecutive FULL STOP characters.
FULL STOP may appear only in between two alpha-numeric characters (LATIN CAPITAL LETTER A
through LATIN CAPITAL LETTER Z, DIGIT ZERO through DIGIT NINE) in a collection name.
EXAMPLE 2 The following collection name contains FULL STOP in between two Digits, DIGIT FOUR and DIGIT ONE:
UNICODE 4.1
EXAMPLE 3 The following collection name contains FULL STOP in between one Latin letter, LATIN CAPITAL LETTER D, and
a Digit, DIGIT SEVEN:
BMP-AMD.7
24.5.3 Character names, character name aliases, and named UCS sequence identifiers
Character names, character name aliases and named UCS sequence identifiers, taken together, consti-
tute a name space. Each character name, character name aliases, or named UCS sequence identifier
shall be unique and distinct from all other character names, character name aliases, or named UCS se-
quence identifiers.
For character names and named UCS sequence identifiers, two names shall be considered unique and
distinct if they are different even when SPACE and medial HYPHEN-MINUS characters are ignored and
even when the words "LETTER", "CHARACTER", and "DIGIT" are ignored in comparison of the names.
EXAMPLE 3 The following hypothetical character names would not be unique and distinct:
MANICHAEAN CHARACTER A
MANICHAEAN LETTER A
EXAMPLE 4: The following two actual character names are unique and distinct, because they differ by a HYPHEN-MINUS that
is not a medial HYPHEN-MINUS:
TIBETAN LETTER A
TIBETAN LETTER -A
The following two character names shall be considered unique and distinct:
HANGUL JUNGSEONG OE
HANGUL JUNGSEONG O-E
NOTE 2 – These two character names are explicitly handled as an exception, because they were defined in an earlier version
of this International Standard before the introduction of the name uniqueness requirement. This pair is, has been, and will be
the only exception to the uniqueness rule in this International Standard.
For CJK Ideographs within the BMP, the coded representation is their two-octet value expressed as four
hexadecimal digits. For example, the first CJK Ideograph character in the BMP has the name “CJK
UNIFIED IDEOGRAPH-3400”.
For CJK Ideographs within the SIP, the coded representation is their five hexadecimal digit value. For
example, the first CJK Ideograph character in the SIP has the name “CJK UNIFIED IDEOGRAPH-20000”.
1) Obtain the code point of the Hangul syllable character. It is of the form h1h2h3h4 where h1, h2, h3, and h4
are hexadecimal digits representing the number h1h2h3h4 lying within the range AC00 to D7A3.
2) Derive the decimal numbers d1, d2, d3, d4 that are numerically equal to the hexadecimal digits h1, h2, h3,
h4 respectively.
3) Calculate the character index C from the formula
C = 4096 × (d1 - 10) + 256 × (d2 - 12) + 16 × d3 + d4
4) Calculate the syllable component indices I, P, F from the following formulae
I = C / 588 (Note: 0 ≤ I ≤ 18)
P = (C % 588) / 28 (Note: 0 ≤ P ≤ 20)
F = C % 28 (Note: 0 ≤ F ≤ 27)
where “/” indicates integer division (i.e. x / y is the integer quotient of the division), and “%” indicates
the modulo operation (i.e. x % y is the remainder after the integer division x / y).
5) Obtain the Latin character strings that correspond to the three indices I, P, F from columns 2, 3, and 4
respectively of table 1 below (for I = 11 and for F = 0 the corresponding strings are null). Concatenate
these three strings in left-to-right order to make a single string, the syllable-name.
6) The character name for the character code point h1h2h3h4 is then
HANGUL SYLLABLE s-n
where “s-n” indicates the syllable-name string derived in step 5.
EXAMPLE
For the character with code point D4DE:
d1 = 13, d2 = 4, d3 = 13, d4 = 14.
C = 10462
I = 17, P = 16, F = 18.
The corresponding Latin character strings are P, WI, BS. The syllable-name is PWIBS, and the character name is HANGUL
SYLLABLE PWIBS
For each Hangul syllable, character short additional information is defined and available in Annex R
along with each character name. This additional information consists of an alternative transliteration of
the Hangul syllable into Latin characters. They are also derived from their code point values by a simi-
lar numerical procedure described below.
The USI value corresponding to each NUSI is written using the coded representation determined by the
normalization form NFC (see 21). Each named UCS sequence has a unique code representation. All al-
lowed named UCS sequence identifiers for use with ISO/IEC 10646 are specified in the content linked
below. All other such named sequences are undefined.
The content linked to is a plain text file, using ISO/IEC 646-IRV characters with LINE FEED as end of line
mark, that specifies after a 5-lines header, Named UCS Sequence Identifiers; each line containing the
following information organized in fields delimited by ‘;’:
NOTE 2 – The content is also available as a separate viewable file in the same file directory as this document. The file is
named “NUSI.txt”.
NOTE 3 – All the allowed Named UCS Sequence Identifiers for use with ISO/IEC 10646 are also specified in the Unicode
Standard UAX#34 (Unicode Named Character Sequences: http://www.unicode.org/reports/tr34/).
Row
00
.. Rows 00 to 33
..
..
(see figure 6)
33
34
.. CJK Unified Ideographs Extension A
..
Row
00 Controls Basic Latin Controls Latin-1 Supplement
01 Latin Extended-A Latin Extended-B
02 Latin Extended-B IPA (Intl. Phonetic Alphabet) Extensions Spacing Modifier Letters
03 Combining Diacritical Marks Greek and Coptic
04 Cyrillic
05 Cyrillic Supplement Armenian Hebrew
06 Arabic
07 Syriac Arabic Sup. Thaana Nko
08 Samaritan Mandaic
09 Devanagari Bengali
0A Gurmukhi Gujarati
0B Oriya Tamil
0C Telugu Kannada
0D Malayalam Sinhala
0E Thai Lao
0F Tibetan
10 Myanmar Georgian
11 Hangul Jamo
12 Ethiopic
13 Ethiopic Sup. Cherokee
14.. Unified Canadian Aboriginal Syllabics
16 Ogham Runic
17 Tagalog Hanunoo Buhid Tagbanwa Khmer
18 Mongolian UCAS Extended
19 Limbu Tai Le New Tai Lue * Khmer Symb.
1A Buginese Thai Tham
1B Balinese Sundanese Batak
1C Lepcha Ol Chiki Vedic Extensions
1D Phonetic Extensions Phonetic Extensions Sup. Combining Diacritical M Sup.
1E Latin Extended Additional
1F Greek Extended
20 General Punctuation Super-/Subscripts Currency Symbols Comb. Mks. Symb.
21 Letterlike Symbols Number Forms Arrows
22 Mathematical Operators
23 Miscellaneous Technical
24 Control Pictures O.C.R. Enclosed Alphanumerics
25 Box Drawing Block Elements Geometric Shapes
26 Miscellaneous Symbols
27 Dingbats Misc. Math. Symbols-A SAA
28 Braille Patterns
29 Supplemental Arrows-B Miscellaneous Mathematical Symbols-B
2A Supplemental Mathematical Operators
2B Miscellaneous Symbols and Arrows
2C Glagolitic Latin Ext-C Coptic
2D Georgian Sup. Tifinagh Ethiopic Extended Cyrillic Ext-A
2E Supplemental Punctuation CJK Radicals Supplement
2F Kangxi Radicals Ideog. Descr.
30 CJK Symbols And Punctuation Hiragana Katakana
31 Bopomofo Hangul Compatibility Jamo Kanbun Bopomofo E. CJK Strokes KPE
32 Enclosed CJK Letters And Months
33 CJK Compatibility
27 Structure of the Supplementary Multilingual Plane for scripts and symbols (SMP)
Because another supplementary plane is reserved for additional CJK Ideographs, the SMP (plane 1) is not
used to date for encoding CJK Ideographs. Instead, the SMP is used for encoding graphic characters
used in other scripts of the world that are not encoded in the BMP. Most, but not all, of the scripts encoded
to date in the SMP are not in use as living scripts by modern user communities.
NOTE 1 – The following subdivision of the SMP has been proposed:
Alphabetic scripts,
Hieroglyphic, ideographic and syllabaries,
Non CJK ideographic scripts,
Newly invented scripts,
Symbol sets
An overview of the Supplementary Multilingual Plane for scripts and symbols is shown in figure 7.
Row
00 Linear B Syllabary Linear B Ideograms
01 Aegean Numbers Ancient Greek Numbers Ancient Symbols Phaistos Disc
02 Lycian Carian
03 Old Italic Gothic Ugaritic Old Persian
04 Deseret Shavian Osmanya
…
08 Cypriot Syllabary Imp Aram
09 Phoenician Lydian
0A Kharoshthi Old South Arabian
0B Avestan I Parthian Inscript. Pahlavi
0C Old Turkic
0D
0E Rumi Numeral S.
0F
10 Brahmi Kaithi
…
20
… Cuneiform
23
24 Cuneiform Numbers and Punctuation
…
30
Egyptian Hieroglyphs
…
34
…
68
Bamum Supplement
…
6A
…
B0
B1
…
D0 Byzantine Musical Symbols
D1 Western Musical Symbols
D2 Ancient Greek Musical Not.
D3 Tai Xuan Jing Symbols Counting Rod Num
D4
… Mathematical Alphanumeric Symbols
D7
…
F0 Mahjong Tiles Domino Tiles Playing Cards
F1 Enclosed Alphanumeric Supplement
F2 Enclosed Ideographic Supplement
F3
… Miscellaneous Pictographic Symbols
F5
F6 Emoticons Transport and Map Symbols
F7 Alchemical Symbols
F8
…
FF
Figure 7 – Overview of the Supplementary Multilingual Plane for scripts and symbols
The SIP is also used for compatibility CJK ideographs. These ideographs are compatibility characters as
specified in 18.
Row
00
.. CJK Unified Ideographs Extension B
..
A6
A7
.. CJK Unified Ideographs Extension C
B7
B7 CJK Unified Ideographs Ext D
B8 …
B9
F8
CJK Compatibility Ideographs Supplement
..
FA
FB
..
FF
Row-octet
00 Tags
01 Variation Selectors Supplement
02
..
FF
Each code chart is followed by a corresponding character names list, except the CJK UNIFIED
IDEOGRAPHS blocks and the HANGUL SYLLABLES block.
NOTE – A block code chart and name list may be arranged in a single page if their contents allow it.
The following example describes various fragments of name lists including these informative items.
EXAMPLE
Annex A
(normative)
Collections of graphic characters for subsets
A.1 Collections of coded graphic characters
The collections listed below are ordered by collection number. An * in the “code points” column indicates
that the collection is a fixed collection.
1021 CUNEIFORM NUMBERS AND 1047 TRANSPORT AND MAP SYMBOLS 1F680-1F6FF
PUNCTUATION 12400-1247F 1048 ALCHEMICAL SYMBOLS 1F700-1F77F
1022 COUNTING ROD NUMERALS 1D360-1D37F 2001 CJK UNIFIED IDEOGRAPHS
1023 PHAISTOS DISC 101D0-101FF EXTENSION B 20000-2A6DF
The following collections specify characters used for alternate formats and script-specific formats. See
Annex F for more information.
The following specify collections that represented the whole UCS when they were created
NOTE 1 – The UNICODE collection incorporates all characters currently encoded in the standard
The following specify collections which are the union of particular collections defined above.
NOTE – The parenthetical annotation located in some block names is not part of these names.
301 BMP-AMD.7 is specified by the following ranges of code points as indicated for each row or contigu-
ous series of rows.
Plane 00
Row Values within row 09 01-03 05-39 3C-4D 50-54 58-70 81-83 85-8C
00 20-7E A0-FF 8F-90 93-A8 AA-B0 B2 B6-B9 BC BE-C4 C7-C8
01 00-F5 FA-FF CB-CD D7 DC-DD DF-E3 E6-FA
02 00-17 50-A8 B0-DE E0-E9 0A 02 05-0A 0F-10 13-28 2A-30 32-33 35-36 38-
03 00-45 60-61 74-75 7A 7E 84-8A 8C 8E-A1 A3- 39 3C 3E-42 47-48 4B-4D 59-5C 5E 66-74 81-
CE D0-D6 DA DC DE E0 E2-F3 83 85-8B 8D 8F-91 93-A8 AA-B0 B2-B3 B5-B9
04 01-0C 0E-4F 51-5C 5E-86 90-C4 C7-C8 CB-CC BC-C5 C7-C9 CB-CD D0 E0 E6-EF
D0-EB EE-F5 F8-F9 0B 01-03 05-0C 0F-10 13-28 2A-30 32-33 36-39
05 31-56 59-5F 61-87 89 91-A1 A3-B9 BB-C4 D0- 3C-43 47-48 4B-4D 56-57 5C-5D 5F-61 66-70
EA F0-F4 82-83 85-8A 8E-90 92-95 99-9A 9C 9E-9F A3-
06 0C 1B 1F 21-3A 40-52 60-6D 70-B7 BA-BE C0- A4 A8-AA AE-B5 B7-B9 BE-C2 C6-C8 CA-CD
CE D0-ED F0-F9 D7 E7-F2
302 BMP SECOND EDITION is specified by the following ranges of code points as indicated for each row
or contiguous series of rows.
Plane 00
Row Values within row 0A 02 05-0A 0F-10 13-28 2A-30 32-33 35-36 38-
00 20-7E A0-FF 39 3C 3E-42 47-48 4B-4D 59-5C 5E 66-74 81-
01 00-FF 83 85-8B 8D 8F-91 93-A8 AA-B0 B2-B3 B5-B9
02 00-1F 22-33 50-AD B0-EE BC-C5 C7-C9 CB-CD D0 E0 E6-EF
03 00-4E 60-62 74-75 7A 7E 84-8A 8C 8E-A1 A3- 0B 01-03 05-0C 0F-10 13-28 2A-30 32-33 36-39
CE D0-D7 DA-F3 3C-43 47-48 4B-4D 56-57 5C-5D 5F-61 66-70
04 00-86 88-89 8C-C4 C7-C8 CB-CC D0-F5 F8-F9 82-83 85-8A 8E-90 92-95 99-9A 9C 9E-9F A3-
05 31-56 59-5F 61-87 89-8A 91-A1 A3-B9 BB-C4 A4 A8-AA AE-B5 B7-B9 BE-C2 C6-C8 CA-CD
D0-EA F0-F4 D7 E7-F2
06 0C 1B 1F 21-3A 40-55 60-6D 70-ED F0-FE 0C 01-03 05-0C 0E-10 12-28 2A-33 35-39 3E-44
07 00-0D 0F-2C 30-4A 80-B0 46-48 4A-4D 55-56 60-61 66-6F 82-83 85-8C
09 01-03 05-39 3C-4D 50-54 58-70 81-83 85-8C 8E-90 92-A8 AA-B3 B5-B9 BE-C4 C6-C8 CA-CD
8F-90 93-A8 AA-B0 B2 B6-B9 BC BE-C4 C7-C8 D5-D6 DE E0-E1 E6-EF
CB-CD D7 DC-DD DF-E3 E6-FA
0D 02-03 05-0C 0E-10 12-28 2A-39 3E-43 46-48 24 00-26 40-4A 60-EA
4A-4D 57 60-61 66-6F 82-83 85-96 9A-B1 B3- 25 00-95 A0-F7
BB BD C0-C6 CA CF-D4 D6 D8-DF F2-F4 26 00-13 19-71
0E 01-3A 3F-5B 81-82 84 87-88 8A 8D 94-97 99- 27 01-04 06-09 0C-27 29-4B 4D 4F-52 56 58-5E
9F A1-A3 A5 A7 AA-AB AD-B9 BB-BD C0-C4 C6 61-67 76-94 98-AF B1-BE
C8-CD D0-D9 DC-DD 28 00-FF
0F 00-47 49-6A 71-8B 90-97 99-BC BE-CC CF 2E 80-99 9B-F3
10 00-21 23-27 29-2A 2C-32 36-39 40-59 A0-C5 2F 00-D5 F0-FB
D0-F6 FB 30 00-3A 3E-3F 41-94 99-9E A1-FE
11 00-59 5F-A2 A8-F9 31 05-2C 31-8E 90-B7
12 00-06 08-46 48 4A-4D 50-56 58 5A-5D 60-86 32 00-1C 20-43 60-7B 7F-B0 C0-CB D0-FE
88 8A-8D 90-AE B0 B2-B5 B8-BE C0 C2-C5 33 00-76 7B-DD E0-FE
C8-CE D0-D6 D8-EE F0-FF 34-4D 3400-4DB5
13 00-0E 10 12-15 18-1E 20-46 48-5A 61-7C A0- 4E-9F 4E00-9FA5
F4 A0-A3 A000-A3FF
14-15 1401-15FF A4 00-8C 90-A1 A4-B3 B5-C0 C2-C4 C6
16 00-76 80-9C A0-F0 AC-D7 AC00-D7A3
17 80-DC E0-E9 E0-F8 E000-F8FF
18 00-0E 10-19 20-77 80-A9 F9-FA F900-FA2D
1E 00-9B A0-F9 FB 00-06 13-17 1D-36 38-3C 3E 40-41 43-44 46-
1F 00-15 18-1D 20-45 48-4D 50-57 59 5B 5D 5F- B1 D3-FF
7D 80-B4 B6-C4 C6-D3 D6-DB DD-EF F2-F4 FC 00-FF
F6-FE FD 00-3F 50-8F 92-C7 F0-FB
20 00-46 48-4D 6A-70 74-8E A0-AF D0-E3 FE 20-23 30-44 49-52 54-66 68-6B 70-72 74 76-
21 00-3A 53-83 90-F3 FC FF
22 00-F1 FF 01-5E 61-BE C2-C7 CA-CF D2-D7 DA-DC E0-
23 00-7B 7D-9A E6 E8-EE F9-FD
Plane 00
Collection number and name
302 BMP SECOND EDITION
98 SUPPLEMENTAL ARROWS-A
99 SUPPLEMENTAL ARROWS-B
100 MISCELLANEOUS MATHEMATICAL SYMBOLS-B
101 SUPPLEMENTAL MATHEMATICAL OPERATORS
102 KATAKANA PHONETIC EXTENSIONS
103 VARIATION SELECTORS
108 KHMER SYMBOLS
111 YIJING HEXAGRAM SYMBOLS
Plane 01
Collection number and name
1003 DESERET
1011 SHAVIAN
Plane 02
Row Values within row
00-A6 0000-A6D6
F8-FA F800-FA1D
Plane 0E
Collection number and name
3003 VARIATION SELECTORS SUPPLEMENT
Plane 0F
Row Values within row
00-FF 0000-FFFD
Plane 10
Row Values within row
00-FF 0000-FFFD
The content linked to is a plain text file, using ISO/IEC 646-IRV characters with LINE FEED as end of line
mark, that specifies, after a 11-lines header, as many lines as IICORE characters; each containing the
following information in fixed length field.
The content linked to is a plain text file, using ISO/IEC 646-IRV characters with LINE FEED as end of line
mark, that specifies, after a 3-lines header, as many lines as characters in the collection; each containing
the following information in fixed length field:
The code points of this collection are identified by the J1 Kanji J sources in the Source Reference file for
CJK Unified Ideographs (CJKU_SR.txt). See 0 for further details.
Plane 00
Row Values within row
00 20-7E A0-FF
01 00-13 16-2B 2E-4D 50-7E
02 C7 D8-DB DD
20 15 18-19 1C-1D AC
21 22 26 5B-5E 90-93
26 6A
Plane 00
Row Values within row
00 20-7E A0-FF
01 00-7F 8F 92 B7 DE-EF FA-FF
02 18-1B 1E-1F 59 7C 92 BB-BD C6-C7 C9 D8-DD EE
03 74-75 7A 7E 84-8A 8C 8E-A1 A3-CE D7 DA-E1
04 00-5F 90-C4 C7-C8 CB-CC D0-EB EE-F5 F8-F9
1E 02-03 0A-0B 1E-1F 40-41 56-57 60-61 6A-6B 80-85 9B F2-F3
1F 00-15 18-1D 20-45 48-4D 50-57 59 5B 5D 5F-7D 80-B4 B6-C4 C6-D3 D6-DB DD-EF F2-F4 F6-FE
20 13-15 17-1E 20-22 26 30 32-33 39-3A 3C 3E 44 4A 7F 82 A3-A4 A7 AC AF
21 05 16 22 26 5B-5E 90-95 A8
22 00 02-03 06 08-09 0F 11-12 19-1A 1E-1F 27-2B 48 59 60-61 64-65 82-83 95 97
23 02 10 20-21 29-2A
25 00 02 0C 10 14 18 1C 24 2C 34 3C 50-6C 80 84 88 8C 90-93 A0 AC B2 BA BC C4 CA-CB D8-D9
26 3A-3C 40 42 60 63 65-66 6A-6B
FB 01-02
FF FD
Plane 00
Row Values within row
00 41-50 52-56 59-5A 61-70 72-76 79-7A C0-C1 C3 C8-C9 CC-CD D1-D3 D5 D9-DA DD E0-E1 E3 E8-E9 F1-
F3 F5 F9-FA FD
01 04-05 0C-0D 16-19 28 2E-2F 60-61 68-6B 72-73 7D-7E
1E BC-BD F8-F9
• All J0 Kanji J sources in the Source Reference file for CJK Unified Ideographs (CJKU_SR.txt). See 0
for further details.
• Ranges of code points arranged by planes:
Plane 00
Row Values within row 22 00 02-03 07-08 0B 12 1A 1D-1E 20 27-2C 34-
00 20-7E A2 A3 A5 A7-A8 AC B0-B1 B4 B6 D7 F7 35 3D 52 60-61 66-67 6A-6B 82-83 86-87 A5
03 91-A1 A3-A9 B1-C1 C3-C9 23 12
04 01 10-4F 51 25 00-03 0C 0F-10 13-14 17-18 1B-1D 20 23-25
20 10 14 16 18-19 1C-1D 20-21 25-26 30 32-33 28 2B-2C 2F-30 33-34 37-38 3B-3C 3F 42 4B
3B 3E A0-A1 B2-B3 BC-BD C6-C7 CB CE-CF EF
21 03 2B 90-93 D2 D4 26 05-06 40 42 6A 6D 6F
30 00-03 05-15 1C 41-93 9B-9E A1-F6 FB-FE
Plane 00
Row Values within row 22 05 09 13 1F 25-26 2E 43 45 48 62 76-77 84-
00 A0-A1 A4 A6 A9-AB AD-AF B2-B3 B7-D6 D8-F6 85 8A-8B 95-97 BF DA-DB
F8-FF 23 05-06 18 BE-CC CE
01 00-09 0C-0F 11-13 18-1D 24-25 27 2A-2B 34- 24 23 60-73 D0-E9 EB-FE
35 39-3A 3D-3E 41-44 47-48 4B-4D 50-55 58- 25 B1 B6-B7 C0-C1 C9 D0-D3 E6
65 6A-71 79-7E 93 C2 CD-CE D0-D2 D4 D6 D8 26 00-03 0E 16-17 1E 60-69 6B-6C 6E
DA DC F8-F9 FD 27 13 56 76-7F
02 50-5A 5C 5E-61 64-68 6C-73 75 79-7B 7D-7E 29 34-35 BF FA-FB
81-84 88-8E 90-92 94-95 98 9D A1-A2 C7-C8 30 16-19 1D 1F-20 33-35 3B-3D 94-96 9A 9F-A0
CC D0-D1 D8-D9 DB DD-DE E5-E9 F7-FA FF
03 00-04 06 08 0B-0C 0F 18-1A 1C-20 24-25 29- 31 F0-FF
2A 2C 2F-30 34 39-3D 61 C2 32 31-32 39 51-5F A4-A8 B1-BF D0-E3 E5 E9 EC-
1E 3E-3F ED FA
1F 70-73 33 03 0D 14 18 22-23 26-27 2B 36 3B 49-4A 4D
20 13 22 3C 3F 42 47-49 51 AC 51 57 7B-7E 8E-8F 9C-9E A1 C4 CB CD
21 0F 13 16 21 27 35 53-55 60-6B 70-7B 94 96- FE 45-46
99 C4 E6-E9 FF 5F-60
Planes 00-10
Collection number and name
285 BASIC JAPANESE
Plane 00
Row Values within row 4E 28 E1 FC
20 15 4F 00 03 39 56 8A 92 94 9A C9 CD FF
21 16 21 60-69 70-79 50 1E 22 40 42 46 70 94 D8 F4
22 11 1F 25 2E BF 51 4A 64 9D BE EC
24 60-73 52 15 9C A6 AF C0 DB
30 1D 1F 53 00 07 24 72 93 B2 DD
32 31-32 39 A4-A8 54 8A 9C A9 FF
33 03 0D 14 18 22-23 26-27 2B 36 3B 49-4A 4D 55 86
51 57 7B-7E 8E-8F 9C-9E A1 C4 CD 57 59 65 AC C7-C8
58 9E B2 7B 9E
59 0B 53 5B 5D 63 A4 BA 7D 48 5C A0 B7 D6
5B 56 C0 D8 EC 7E 52 8A
5C 1E A6 BA F5 7F 47 A1
5D 27 42 53 6D B8-B9 D0 83 01 62 7F C7 F6
5F 21 34 45 67 B7 DE 84 48 B4 DC
60 5D 85 8A D5 DE F2 85 53 59 6B B0
61 11 20 30 37 98 88 07 F5
62 13 A6 89 1C
63 F5 8A 12 37 79 A7 BE DF F6
64 60 9D CE 8B 53 7F
65 4E 8C F0 F4
66 00 09 15 1E 24 2E 31 3B 57 59 65 73 99 A0 8D 12 76
B2 BF FA-FB 8E CF
67 0E 66 BB C0 90 67 DE
68 01 44 52 C8 CF 91 15 27 D7 DA DE E4-E5 ED-EE
69 68 98 E2 92 06 0A 10 39-3A 3C 40 4E 51 59 67 77-78 88
6A 30 46 6B 73 7E E2 E4 A7 D0 D3 D5 D7 D9 E0 E7 F9 FB FF
6B D6 93 02 1D-1E 21 25 48 57 70 A4 C6 DE F8
6C 3F 5C 6F 86 DA 94 31 45 48
6D 04 6F 87 96 AC CF F2 F8 FC 95 92
6E 27 39 3C 5C BF 96 9D AF
6F 88 B5 F5 97 33 3B 43 4D 4F 51 55
70 05 07 28 85 AB BB 98 57 65
71 04 0F 46-47 5C C1 FE 99 27 9E
72 B1 BE 9A 4E D9 DC
73 24 77 BD C9 D2 D6 E3 F5 9B 72 75 8F B1 BB
74 07 26 29-2A 2E 62 89 9F 9C 00
75 01 2F 6F 9D 6B 70
76 82 9B-9C 9E A6 9E 19 D1
77 46 F9 29 DC
78 21 4E 64 7A FA 0E-2D
79 30 94 9B FF 01-5E 61-9F E0-E5
7A D1 E7 EB
Plane 00
Row Values within row
00 20-7E A0-FF
01 00-80 8F 97 9A-9B 9D-A1 AF-B0 B5-B7 CD-DC DE-F0 F4-F5 F8-FF
02 00-1B 1E-20 22-23 26-33 3A-3E 41-44 46-49 4C-4F 59 68 72 75 7C 89 92 94 B7 B9-BC BE-C1 C7-C8 CC-
CD D8-DB DD
03 00-04 06-11 13 15 1B 23-29 2D-2E 31-32 35 38 44 47-48 5C-61
1D 7D CD
1E 00-19 1C-2B 2E-73 76-99 9B 9E A0-F9
20 0C 11 13-15 18-1A 1C-1E 26 2F 32-33 39-3A 4A A5 AC
21 22 26 4D 5B-5E 90-93 9A-9B
22 12 15 60 64-65 6E-71
23 00
26 6A
2C 63 65-66
A7 88 8B-8C
Plane 00
Collection number and name
302 BMP SECOND EDITION
Plane 01
Row Values within row D4 00-54 56-9C 9E-9F A2 A5-A6 A9-AC AE-B9 BB
03 00-1E 20-23 30-4A BD-C0 C2-C3 C5-FF
04 00-25 28-4D D5 00-05 07-0A 0D-14 16-1C 1E-39 3B-3E 40-44
D0 00-F5 46 4A-50 52-FF
D1 00-26 2A-DD D6 00-A3 A8-FF
D7 00-C9 CE-FF
Plane 02
Row Values within row
00-A6 0000-A6D6
F8-FA F800-FA1D
Plane 0E
Row Values within row
00 01 20-7F
Plane 0F
Row Values within row
00-FF 0000-FFFD
Plane 10
Row Values within row
00-FF 0000-FFFD
Planes 00-10
Collection number and name
303 UNICODE 3.1
Plane 00
Collection number and name
98 SUPPLEMENTAL ARROWS-A
99 SUPPLEMENTAL ARROWS-B
100 MISCELLANEOUS MATHEMATICAL SYMBOLS-B
101 SUPPLEMENTAL MATHEMATICAL OPERATORS
102 KATAKANA PHONETIC EXTENSIONS
103 VARIATION SELECTORS
FA 30-6A FF 5F-60
FE 45-46 73
Plane 00-10
Collection number and name
305 UNICODE 4.0
Plane 00
Row Values within row 21 3C 4C
02 37-41 23 D1-DB
03 58-5C FC-FF 26 18 7E-7F 92-9C A2-B1
04 F6-F7 27 C0-C6
05 A2 C5-C7 2B 0E-13
06 0B 1E 59-5E 2C 00-2E 30-5E 80-EA F9-FF
07 50-6D 2D 00-25 30-65 6F 80-96 A0-A6 A8-AE B0-B6 B8-
09 7D CE BE C0-C6 C8-CE D0-D6 D8-DE
0B B6 E6 2E 00-17 1C-1D
0F D0-D1 31 C0-CF
10 F9-FA FC 32 7E
12 07 47 87 AF CF EF 9F A6-BB
13 0F 1F 47 5F-60 80-99 A7 00-16
19 80-A9 B0-C9 D0-D9 DE-DF A8 00-2B
1A 00-1B 1E-1F FA 70-D9
1D 6C-C3 FE 10-19
20 55-56 58-5E 90-94 B2-B5 EB
Plane 01
Row Values within row 0A 00-03 05-06 0C-13 15-17 19-33 38-3A 3F-47
01 40-8A 50-58
03 A0-C3 C8-D5 D2 00-45
D6 A4-A5
A.6.5 307 UNICODE 5.0
The fixed collection 307 UNICODE 5.0 consists of a fixed collection from A.1 and several ranges of code
points. The collection list is arranged by planes as follows.
Plane 00-10
Collection number and name
306 UNICODE 4.1
Plane 00
Row Values within row 20 EC-EF
02 42-4F 21 4D-4E 84
03 7B-7D 23 DC-E7
04 CF FA-FF 26 B2
05 10-13 BA 27 C7-CA
07 C0-FA 2B 14-1A 20-23
09 7B-7C 7E-7F 2C 60-6C 74-77
0C E2-E3 F1-F2 A7 17-1A 20-21
1B 00-4B 50-7C A8 40-77
1D C4-CA FE-FF
Plane 01
Row Values within row 24 00-62 70-73
09 00-19 1F D3 60-71
20-22 2000-22FF D7 CA-CB
23 00-6E
Plane 00
Row Values within row 1E 00-FF
00 20-7E A0-FF 1F 00-15 18-1D 20-45 48-4D 50-57 59 5B 5D 5F-
01-02 0100-02FF 7D 80-B4 B6-C4 C6-D3 D6-DB DD-EF F2-F4
03 00-77 7A-7E 84-8A 8C 8E-A1 A3-FF F6-FE
04 00-FF 20 00-64 6A-71 74-8E 90-94 A0-B5 D0-F0
05 00-23 31-56 59-5F 61-87 89-8A 91-C7 D0-EA 21 00-4F 53-88 90-FF
F0-F4 22 00-FF
06 00-03 06-1B 1E-1F 21-5E 60-FF 23 00-E7
07 00-0D 0F-4A 4D-B1 C0-FA 24 00-26 40-4A 60-FF
09 01-39 3C-4D 50-54 58-72 7B-7F 81-83 85-8C 25 00-FF
8F-90 93-A8 AA-B0 B2 B6-B9 BC-C4 C7-C8 26 00-9D A0-BC C0-C3
CB-CE D7 DC-DD DF-E3 E6-FA 27 01-04 06-09 0C-27 29-4B 4D 4F-52 56 58-5E
0A 01-03 05-0A 0F-10 13-28 2A-30 32-33 35-36 61-94 98-AF B1-BE C0-CA CC D0-FF
38-39 3C 3E-42 47-48 4B-4D 51 59-5C 5E 66- 28-2A 2800-2AFF
75 81-83 85-8D 8F-91 93-A8 AA-B0 B2-B3 B5- 2B 00-4C 50-54
B9 BC-C5 C7-C9 CB-CD D0 E0-E3 E6-EF F1 2C 00-2E 30-5E 60-6F 71-7D 80-EA F9-FF
0B 01-03 05-0C 0F-10 13-28 2A-30 32-33 35-39 2D 00-25 30-65 6F 80-96 A0-A6 A8-AE B0-B6 B8-
3C-44 47-48 4B-4D 56-57 5C-5D 5F-63 66-71 BE C0-C6 C8-CE D0-D6 D8-DE E0-FF
82-83 85-8A 8E-90 92-95 99-9A 9C 9E-9F A3- 2E 00-30 80-99 9B-F3
A4 A8-AA AE-B9 BE-C2 C6-C8 CA-CD D0 D7 2F 00-D5 F0-FB
E6-FA 30 00-3F 41-96 99-FF
0C 01-03 05-0C 0E-10 12-28 2A-33 35-39 3D-44 31 05-2D 31-8E 90-B7 C0-E3 F0-FF
46-48 4A-4D 55-56 58-59 60-63 66-6F 78-7F 32 00-1E 20-43 50-FE
82-83 85-8C 8E-90 92-A8 AA-B3 B5-B9 BC-C4 33 00-FF
C6-C8 CA-CD D5-D6 DE E0-E3 E6-EF F1-F2 34-4C 3400-4CFF
0D 02-03 05-0C 0E-10 12-28 2A-39 3D-44 46-48 4D 00-B5 C0-FF
4A-4D 57 60-63 66-75 79-7F 82-83 85-96 9A- 4E-9F 4E00-9FC3
B1 B3-BB BD C0-C6 CA CF-D4 D6 D8-DF F2-F4 A0-A3 A000-A3FF
0E 01-3A 3F-5B 81-82 84 87-88 8A 8D 94-97 99- A4 00-8C 90-C6
9F A1-A3 A5 A7 AA-A8 AD-B9 BB-BD C0-C4 C6 A5 00-FF
C8-CD D0-D9 DC-DD A6 00-2B 40-5F 62-73 7C-97
0F 00-47 49-6C 71-8B 90-97 99-BC BE-CC CE-D4 A7 00-8C FB-FF
10 00-99 9E-C5 D0-FC A8 00-2B 40-77 80-C4 CE-D9
11 00-59 5F-A2 A8-F9 A9 00-53 5F
12 00-48 4A-4D 50-56 58 5A-5D 60-88 8A-8D AA 00-36 40-4D 50-59 5C-5F
90-B0 B2-B5 B8-BE C0 C2-C5 C8-D6 D8-FF AC-D7 AC00-D7A3
13 00-10 12-15 18-5A 5F-7C 80-99 A0-F4 E0-F8 E000-F8FF
14-15 1401-15FF F9 00-FF
16 00-76 80-9C A0-F0 FA 00-2D 30-6A 70-D9
17 00-0C 0E-14 20-36 40-53 60-6C 6E-70 72-73 FB 00-06 13-17 1D-36 38-3C 3E 40-41 43-44 46-
80-DD E0-E9 F0-F9 B1 D3-FF
18 00-0E 10-19 20-77 80-AA FC 00-FF
19 00-1C 20-2B 30-3B 40 44-6D 70-74 80-A9 B0- FD 00-3F 50-8F 92-C7 F0-FD
C9 D0-D9 DE-FF FE 00-19 20-26 30-52 54-66 68-6B 70-74 76-FC
1A 00-1B 1E-1F FF
1B 00-4B 50-7C 80-AA AE-B9 FF 01-BE C2-C7 CA-CF D2-D7 DA-DC E0-E6 E8-
1C 00-37 3B-49 4D-7F EE F9-FD
1D 00-E6 FE-FF
Plane 01
Row Values within row 23 00-6E
00 00-0B 0D-26 28-3A 3C-3D 3F-4D 50-5D 80-FA 24 00-62 70-73
01 00-02 07-33 37-8A 90-9B D0-FD D0 00-F5
02 80-9C A0-D0 D1 00-26 29-DD
03 00-1E 20-23 30-4A 80-9D 9F-C3 C8-D5 D2 00-45
04 00-9D A0-A9 D3 00-56 60-71
08 00-05 08 0A-35 37-38 3C 3F D4 00-54 56-9C 9E-9F A2 A5-A6 A9-AC AE-B9 BB
09 00-19 1F-39 3F BD-C3 C5-FF
0A 00-03 05-06 0C-13 15-17 19-33 38-3A 3F-47 D5 00-05 07-0A 0D-14 16-1C 1E-39 3B-3E 40-44
50-58 46 4A-50 52-FF
20-22 2000-22FF D6 00-A5 A8-FF
Plane 02
Row Values within row
00-A6 0000-A6D6
F8-FA F800-FA1D
Plane 0E
Row Values within row
00 01 20-7F
01 00-EF
Plane 0F
Row Values within row
00-FF 0000-FFFD
Plane 10
Row Values within row
00-FF 0000-FFFD
NOTE – The collection 308 UNICODE 5.1 can also be determined by using another fixed collection from A.1 and several
ranges of code points.
Plane 00-10
Collection number and name
308 UNICODE 5.0
Plane 00
Row Values within row 20 64 F0
03 70-73 76-77 CF 21 4F 85-88
04 87 26 9D B3-BC C0-C3
05 14-23 27 CC EC-EF
06 06-0A 16-1A 3B-3F 2B 1B-1F 24-4C 50-54
07 6E-7F 2C 6D-6F 71-73 78-7D
09 71-72 2D E0-FF
0A 51 75 2E 18-1B 1E-30
0B 44 62-63 D0 31 2D D0-E3
0C 3D 58-59 62-63 78-7F 9F BC-C3
0D 3D 44 62-63 70-75 79-7F A5 00-FF
0F 6B-6C CE D2-D4 A6 00-2B 40-5F 62-73 7C-97
10 22 28 2B 33-35 3A-3F 5A-8A A7 1B-1F 22-8C FB-FF
18 AA A8 80-C4 CE-D9
1B 80-AA AE-B9 A9 00-53 5F
1C 00-37 3B-49 4D-7F AA 00-36 40-4D 50-59 5C-5F
1D CB-E6 FE 24-26
1E 9C-9F FA-FF
Plane 01
Row Values within row D1 29
01 90-9B D0-FD F0 00-2B 30-93
02 80-9C A0-D0
09 20-39 3F
Plane 00
Row Values within row 05 00-25 31-56 59-5F 61-87 89-8A 91-C7 D0-EA
00 20-7E A0-FF F0-F4
01-02 0100-02FF 06 00-03 06-1B 1E-1F 21-5E 60-FF
03 00-77 7A-7E 84-8A 8C 8E-A1 A3-FF 07 00-0D 0F-4A 4D-B1 C0-FA
04 00-FF 08 00-2D 30-3E
Plane 01
Row Values within row 24 00-62 70-73
00 00-0B 0D-26 28-3A 3C-3D 3F-4D 50-5D 80-FA 30-34 3000-342E
01 00-02 07-33 37-8A 90-9B D0-FD D0 00-F5
02 80-9C A0-D0 D1 00-26 29-DD
03 00-1E 20-23 30-4A 80-9D 9F-C3 C8-D5 D2 00-45
04 00-9D A0-A9 D3 00-56 60-71
08 00-05 08 0A-35 37-38 3C 3F-55 57-5F D4 00-54 56-9C 9E-9F A2 A5-A6 A9-AC AE-B9 BB
09 00-1B 1F-39 3F BD-C3 C5-FF
0A 00-03 05-06 0C-13 15-17 19-33 38-3A 3F-47 D5 00-05 07-0A 0D-14 16-1C 1E-39 3B-3E 40-44
50-58 60-7F 46 4A-50 52-FF
0B 00-35 39-3F-55 58-72 78-7F D6 00-A5 A8-FF
0C 00-48 D7 00-CB CE-FF
0E 60-7E F0 00-2B 30-93
10 80-C1 F1 00-0A 10-2E 31 3D 3F 42 46 4A-4E 57 5F 79
20-22 2000-22FF 7B-7C 7F 8A-8D 90
23 00-6E F2 00 10-31 40-48
Plane 02
Row Values within row
00-A6 0000-A6D6
A7-B7 A700-B734
F8-FA F800-FA1D
Plane 0E
Row Values within row
00 01 20-7F
01 00-EF
Plane 0F
Row Values within row
00-FF 0000-FFFD
Plane 10
Row Values within row
00-FF 0000-FFFD
NOTE – The collection 309 UNICODE 5.2 can also be determined by using another fixed collection from A.1 and several
ranges of code points.
Plane 00-10
Collection number and name
308 UNICODE 5.1
Plane 00
Row Values within row 26 9E-9F BD-BF C4-CD CF-E1 E3 E8-FF
05 24-25 27 57
08 00-2D 30-3E 2B 55-59
09 00 4E 55 79-7A FB 2C 70 7E-7F EB-F1
0F D5-D8 2D E0-FF
10 9A-9D 2E 31
11 5A-5E A3-A7 FA-FF 32 44-4F
14 00 9F C4-C6
16 77-7F A4 D0-FF
18 B0-F5 A6 A0-F7
19 AA-AB DA A8 30-39 E0-FB
1A 20-5E 60-7C 7F-89 90-99 A0-AD A9 60-7C 80-CD CF-D9 DE-DF
1C D0-F2 AA 60-7B 80-C2 DB-DF
1D FD AB C0-ED F0-F9
20 B6-B8 D7 B0-C6 CB-FB
21 50-52 89 FA 6B-6D
23 E8
Plane 01
Row Values within row 10 80-C1
08 40-55 57-5F 30-34 3000-342E
09 1A-1B F1 00-0A 10-2E 31 3D 3F 42 46 4A-4E 57 5F
0A 60-7F 79 7B-7C 7F 8A-8D 90
0B 00-35 39-55 58-72 78-7F F2 00 10-31 40-48
0C 00-48
0E 60-7E
Plane 02
Row Values within row
A7-B7 A700-B734
Plane 00
Row Values within row 04 00-FF
00 20-7E A0-FF 05 00-27 31-56 59-5F 61-87 89-8A 91-C7 D0-EA
01-02 0100-02FF F0-F4
03 00-77 7A-7E 84-8A 8C 8E-A1 A3-FF 06 00-03 06-1B 1E-FF
07 00-0D 0F-4A 4D-B1 C0-FA 20 00-64 6A-71 74-8E 90-9C A0-B8 D0-F0
08 00-2D 30-3E 40-5B 5E 21 00-89 90-FF
09 00-77 79-7F 81-83 85-8C 8F-90 93-A8 AA-B0 22 00-FF
B2 B6-B9 BC-C4 C7-C8 CB-CE D7 DC-DD DF- 23 00-F3
E3 E6-FB 24 00-26 40-4A 60-FF
0A 01-03 05-0A 0F-10 13-28 2A-30 32-33 35-36 25 00-FF
38-39 3C 3E-42 47-48 4B-4D 51 59-5C 5E 66- 26 00-FF
75 81-83 85-8D 8F-91 93-A8 AA-B0 B2-B3 B5- 27 01-CA CC CE-FF
B9 BC-C5 C7-C9 CB-CD D0 E0-E3 E6-EF F1 28-2A 2800-2AFF
0B 01-03 05-0C 0F-10 13-28 2A-30 32-33 35-39 2B 00-4C 50-59
3C-44 47-48 4B-4D 56-57 5C-5D 5F-63 66-77 2C 00-2E 30-5E 60-7F 80-F1 F9-FF
82-83 85-8A 8E-90 92-95 99-9A 9C 9E-9F A3- 2D 00-25 30-65 6F-70 7F-96 A0-A6 A8-AE B0-B6
A4 A8-AA AE-B9 BE-C2 C6-C8 CA-CD D0 D7 B8-BE C0-C6 C8-CE D0-D6 D8-DE E0-FF
E6-FA 2E 00-31 80-99 9B-F3
0C 01-03 05-0C 0E-10 12-28 2A-33 35-39 3D-44 2F 00-D5 F0-FB
46-48 4A-4D 55-56 58-59 60-63 66-6F 78-7F 30 00-3F 41-96 99-FF
82-83 85-8C 8E-90 92-A8 AA-B3 B5-B9 BC-C4 31 05-2D 31-8E 90-BA C0-E3 F0-FF
C6-C8 CA-CD D5-D6 DE E0-E3 E6-EF F1-F2 32 00-1E 20-FE
0D 02-03 05-0C 0E-10 12-3A 3D-44 46-48 4A-4E 33 00-FF
57 60-63 66-75 79-7F 82-83 85-96 9A-B1 B3- 34-4C 3400-4CFF
BB BD C0-C6 CA CF-D4 D6 D8-DF F2-F4 4D 00-B5 C0-FF
0E 01-3A 3F-5B 81-82 84 87-88 8A 8D 94-97 99- 4E-9F 4E00-9FC6
9F A1-A3 A5 A7 AA-A8 AD-B9 BB-BD C0-C4 C6 A0-A3 A000-A3FF
C8-CD D0-D9 DC-DD A4 00-8C 90-C6 D0-FF
0F 00-47 49-6C 71-97 99-BC BE-CC CE-DA A5 00-FF
10 00-C5 D0-FC A6 00-2B 40-73 7C-97 A0-F7
11 00-FF A7 00-91A0-A9 FA-FF
12 00-48 4A-4D 50-56 58 5A-5D 60-88 8A-8D A8 00-2B 30-39 40-77 80-C4 CE-D9 E0-FB
90-B0 B2-B5 B8-BE C0 C2-C5 C8-D6 D8-FF A9 00-53 5F-7C 80-CD CF-D9 DE-DF
13 00-10 12-15 18-5A 5D-7C 80-99 A0-F4 AA 00-36 40-4D 50-59 5C-7B 80-C2 DB-DF
14-15 1400-15FF AB 01-06 09-0E 11-16 20-26 28-2E C0-ED F0-F9
16 00-9C A0-F0 AC-D6 AC00-D6FF
17 00-0C 0E-14 20-36 40-53 60-6C 6E-70 72-73 D7 00-A3 B0-C6 CB-FB
80-DD E0-E9 F0-F9 E0-F8 E000-F8FF
18 00-0E 10-19 20-77 80-AA B0-F5 F9 00-FF
19 00-1C 20-2B 30-3B 40 44-6D 70-74 80-AB B0- FA 00-2D 30-6D 70-D9
C9 D0-DA DE-FF FB 00-06 13-17 1D-36 38-3C 3E 40-41 43-44 46-
1A 00-1B 1E-5E 60-7C 7F-89 90-99 A0-AD C1 D3-FF
1B 00-4B 50-7C 80-AA AE-B9 C0-F3 FA-FF FC 00-FF
1C 00-37 3B-49 4D-7F D0-F2 FD 00-3F 50-8F 92-C7 F0-FD
1D 00-E6 FC-FF FE 00-19 20-26 30-52 54-66 68-6B 70-74 76-FC
1E 00-FF FF
1F 00-15 18-1D 20-45 48-4D 50-57 59 5B 5D 5F- FF 01-BE C2-C7 CA-CF D2-D7 DA-DC E0-E6 E8-
7D 80-B4 B6-C4 C6-D3 D6-DB DD-EF F2-F4 EE F9-FD
F6-FE
Plane 01
Row Values within row B0 00-01
00 00-0B 0D-26 28-3A 3C-3D 3F-4D 50-5D 80-FA D0 00-F5
01 00-02 07-33 37-8A 90-9B D0-FD D1 00-26 29-DD
02 80-9C A0-D0 D2 00-45
03 00-1E 20-23 30-4A 80-9D 9F-C3 C8-D5 D3 00-56 60-71
04 00-9D A0-A9 D4 00-54 56-9C 9E-9F A2 A5-A6 A9-AC AE-B9 BB
08 00-05 08 0A-35 37-38 3C 3F-55 57-5F BD-C3 C5-FF
09 00-1B 1F-39 3F D5 00-05 07-0A 0D-14 16-1C 1E-39 3B-3E 40-44
0A 00-03 05-06 0C-13 15-17 19-33 38-3A 3F-47 46 4A-50 52-FF
50-58 60-7F D6 00-A5 A8-FF
0B 00-35 39-3F-55 58-72 78-7F D7 00-CB CE-FF
0C 00-48 F0 00-2B 30-93 A0-AE B1-BE C1-CF D1-DF
0E 60-7E F1 00-0A 10-2E 30-69 70-8E 90-9A E6-FF
10 00-4D 52-6F 80-C1 F2 00-02 10-3A 40-48 50-51
20-22 2000-22FF F3 00- 20 30-35 37-7C 80-93 A0-C4 C6-CA E0-F0
23 00-6E F4 00-3E 40 42-F7 F9-FC
24 00-62 70-73 F5 00-3D 50-67 FB-FF
30-34 3000-342E F6 00-29 2B-3E 80-C5
68-6A 6800-6A38 F7 00-73
Plane 02
Row Values within row
00-A6 0000-A6D6
A7-B6 A700-B600
B7 30-34 40-FF
B8 00-1D
F8-FA F800-FA1D
Plane 0E
Row Values within row
00 01 20-7F
01 00-EF
Plane 0F
Row Values within row
00-FF 0000-FFFD
Plane 10
Row Values within row
00-FF 0000-FFFD
NOTE – The collection 310 UNICODE 6.0 can also be determined by using another fixed collection from A.1 and several
ranges of code points.
Plane 00-10
Collection number and name
309 UNICODE 5.2
Plane 00
Row Values within row 26 CE E2 E4-E7
05 26-27 27 05 0A-0B 28 4C 4E 53-55 5F-60 95-97 B0
08 40-5B 5E BF CE CF
09 3A-3B 4F 56-57 73-77 2D 70 7F
0B 72-77 30 97
0D 29 3A 4E 31 B8-BA
0F 8C-8F D9-DA A6 60-61
13 5D-5E A7 8D-91 A0-A9 FA
1B C0-F3 FA-FF AB 01-06 09-0E 11-16 20-26 28-2E
1D FC FB B2-C1
23 E9-F3
Plane 01
Row Values within row F3 00-20 30-35 37-7C 80-93 A0-C4 C6-CA
10 00-4D 52-6F E0-F0
68-6A 6800-6A38 F4 00-3E 40 42-F7 F9-FC
B0 00-01 F5 00-3D 50-67 FB-FF
F0 A0-AE B1-BE C1-CF D1-DF F6 00-29 2B-3E 80-C5
F1 30 32-3C 3E 40-41 43-45 47-49 4F-56 58- F7 00-73
5E 60-69 70-78 7A 7D-7E 80-89 8E-8F
91-9A E6-FF
F2 01-02 32-3A 50-51
Plane 02
Row Values within row B8 00-1D
B7 40-FF
Annex B
(normative)
List of combining characters
NOTE – Replaced by formal character class definition, see 4.14
Annex C
(normative)
Transformation format for planes 1 to 10 of the UCS (UTF-16)
NOTE – Incorporated in main body text, see UCS UTF-16 encoding form in 9 and UCS UTF-16 based encoding schemes in
10.
Annex D
(normative)
UCS Transformation Format 8 (UTF-8)
NOTE – Incorporated in main body text, see UCS UTF-8 encoding form in 9 and UCS UTF-8 encoding schemes in 10.
Annex E
(normative)
Mirrored characters in bidirectional context
NOTE – Replaced by formal character class definition for mirrored character, see 15.1.
Annex F
(informative)
Format characters
There is a special class of characters, called Format characters, the primary purpose of which is to affect
the layout or processing of characters around them. With few exceptions, these characters do not have
printable graphic symbols and, like the space characters, are represented in the character code charts by
dotted boxes.
The function of most of these characters is to indicate the correct presentation of a CC-data element. For
any text processing other than presentation (such as sorting and searching), the format characters, except
for ZWJ and ZWNJ described in F.1.1, can be ignored by filtering them out. The format characters are not
intended to be used in conjunction with bidirectional control functions from ISO/IEC 6429.
SOFT HYPHEN (00AD): SOFT HYPHEN (SHY) is a format character that indicates a preferred intra-word
line-break opportunity. If the line is broken at that point, then whatever mechanism is appropriate for intra-
word line-breaks should be invoked, just as if the line break had been triggered by another mechanism,
such as a dictionary lookup. Depending on the language and the word, that may produce different visible
results, such as:
• inserting a graphic symbol indicating the hyphenation and breaking the line after it,
• inserting a graphic symbol indicating the hyphenation, breaking the line after the symbol and changing
spelling in the divided word parts,
• not showing any visible change and simply breaking the line at that point.
The inserted graphic symbol, if any, can take a wide variety of shapes, such as HYPHEN (2010),
ARMENIAN HYPHEN (058A), MONGOLIAN TODO SOFT HYPHEN (1806), as appropriate for the situa-
tion.
When encoding text that includes explicit line breaking opportunities, including actual hyphenations, char-
acters such as HYPHEN, ARMENIAN HYPHEN, and MONGOLIAN TODO SOFT HYPHEN may be used,
depending on the language.
When a SOFT HYPHEN is inserted into a CC-data-element to encode a possible hyphenation point (for
example: "tug{00AD}gumi"), the character representation remains otherwise unchanged. When encoding
a CC-data-element that includes characters encoding hard line breaks, including actual hyphenations, the
character representation of the text sequence must reflect any changes due to hyphenation (for example:
"tugg{2010}" / "gumi", where / represents the line break).
NOTE 2 – The notations {00AD} and {2010} indicate the inclusion of the corresponding code points: 00AD and 2010 into the
CC-data-elements. The curly brackets “{}” are not part of the CC-data elements.
ZERO WIDTH SPACE (200B): This character behaves like a SPACE in that it indicates a word boundary,
but unlike SPACE it has no presentational width. For example, this character could be used to indicate
word boundaries in Thai, which does not use visible gaps to separate words.
WORD JOINER (2060) and ZERO WIDTH NO-BREAK SPACE (FEFF): These characters behave like a
NO-BREAK SPACE in that they indicate the absence of word boundaries, but unlike NO-BREAK SPACE
they have no presentational width. For example, these characters could be inserted after the fourth char-
acter in the text "base+delta" to indicate that there is to be no word break between the "e" and the "+".
NOTE 3 – For additional usages of the ZERO WIDTH NO-BREAK SPACE for "signature", see Annex H.
The following characters are used to indicate whether or not the adjacent characters are joined together in
rendering (cursive joiners).
ZERO WIDTH NON-JOINER (200C): This character indicates that the adjacent characters are not joined
together in cursive connection even when they would normally join together as cursive letter forms. For
example, ZERO WIDTH NON-JOINER between ARABIC LETTER NOON and ARABIC LETTER MEEM
indicates that the characters are not rendered with the normal cursive connection.
ZERO WIDTH JOINER (200D): This character indicates that the adjacent characters are represented with
joining forms in cursive connection even when they would not normally join together as cursive letter
forms. For example, in the sequence SPACE followed by ARABIC LETTER BEH followed by SPACE,
ZERO WIDTH JOINER can be inserted between the first two characters to display the final form of the
ARABIC LETTER BEH.
LINE SEPARATOR (2028): This character indicates where a new line starts; although the text continues
to the next line, it does not start a new paragraph; e.g. no inter-paragraph indentation might be applied.
PARAGRAPH SEPARATOR (2029): This character indicates where a new paragraph starts; e.g. the text
continues on the next line and inter-paragraph line spacing or paragraph indentation might be applied.
An implicit algorithm uses the directional character properties to determine the correct display order of
characters on a horizontal line of text.
The following characters are format characters that act exactly like right-to-left or left-to-right characters in
terms of affecting ordering (Bidirectional format marks). They have no visible graphic symbols, and they
do not have any other semantic effect.
Their use can be more convenient than the explicit embeddings or overrides, since their scope is more
local.
LEFT-TO-RIGHT MARK (200E): In bidirectional formatting, this character acts like a left-to-right character
(such as LATIN SMALL LETTER A).
RIGHT-TO-LEFT MARK (200F): In bidirectional formatting, this character acts like a right-to-left character
(such as ARABIC LETTER NOON).
The following format characters indicate that a piece of text is to be treated as embedded, and is to have a
particular ordering attached to it (Bidirectional format embeddings). For example, an English quotation in
the middle of an Arabic sentence can be marked as being an embedded left-to-right string. These format
characters nest in blocks, with the embedding and override characters initiating (pushing) a block, and the
pop character terminating (popping) a block.
The function of the embedding and override characters are very similar; the main difference is that the
embedding characters specify the implicit direction of the text, while the override characters specify the
explicit direction of the text. When text has an explicit direction, the normal directional character properties
are ignored, and all of the text is assumed to have the ordering direction determined by the override char-
acter.
LEFT-TO-RIGHT EMBEDDING (202A): This character is used to indicate the start of a left-to-right implicit
embedding.
RIGHT-TO-LEFT EMBEDDING (202B): This character is used to indicate the start of a right-to-left implicit
embedding.
LEFT-TO-RIGHT OVERRIDE (202D): This character is used to indicate the start of a left-to-right explicit
embedding.
RIGHT-TO-LEFT OVERRIDE (202E): This character is used to indicate the start of a right-to-left explicit
embedding.
POP DIRECTIONAL FORMATTING (202C): This character is used to indicate the termination of an im-
plicit or explicit directional embedding initiated by one of the four characters above.
The default state of interpretation may be set by a higher level protocol or standard, such as ISO/IEC
6429. In the absence of such a protocol, the default state is as established by ACTIVATE SYMMETRIC
SWAPPING.
INHIBIT SYMMETRIC SWAPPING (206A): Between this character and the following ACTIVATE
SYMMETRIC SWAPPING format character (if any), the mirrored characters described in clause 15 are
interpreted and rendered as LEFT and RIGHT, and the processing specified in that clause is not per-
formed.
ACTIVATE SYMMETRIC SWAPPING (206B): Between this character and the following INHIBIT
SYMMETRIC SWAPPING format character (if any), the mirrored characters described in clause 15 are
interpreted and rendered as OPENING and CLOSING characters as specified in that clause.
INHIBIT ARABIC FORM SHAPING (206C): Between this character and the following ACTIVATE ARABIC
FORM SHAPING format character (if any), the character shaping determination process is inhibited. The
stored Arabic presentation forms are presented without shape modification. This is the default state.
ACTIVATE ARABIC FORM SHAPING (206D): Between this character and the following INHIBIT ARABIC
FORM SHAPING format character (if any), the stored Arabic presentation forms are presented with shape
modification by means of the character shaping determination process.
NOTE – These characters have no effect on characters that are not presentation forms: in particular, Arabic nominal charac-
ters as from 0600 to 06FF are always subject to character shaping, and are unaffected by these formatting characters.
NATIONAL DIGIT SHAPES (206E): Between this character and the following NOMINAL DIGIT SHAPES
format character (if any), digits from 0030 to 0039 are rendered with the appropriate national digit shapes
as specified by means of appropriate agreements. For example, they could be displayed with shapes such
as the ARABIC-INDIC digits from 0660 to 0669.
NOMINAL DIGIT SHAPES (206F): Between this character and the following NATIONAL DIGIT SHAPES
format character (if any), the digits from 0030 to 0039 are rendered with the shapes as those shown in the
code charts for those digits. This is the default state.
INTERLINEAR ANNOTATION ANCHOR (FFF9): This character indicates the beginning of the base
string.
INTERLINEAR ANNOTATION SEPARATOR (FFFA): This character indicates the end of the base string
and the beginning of the annotation string.
INTERLINEAR ANNOTATION TERMINATOR (FFFB): This character indicates the end of the annotation
string.
The relationship between the annotation string and the base string is defined by agreement between the
user of the originating device and the user of the receiving device. For example, if the base string is ren-
dered in a visible form the annotation string may be rendered on a different line from the base string, in a
position close to the base string.
If the interlinear annotation characters are filtered out during processing, then all characters between the
Interlinear Annotation Separator and the Interlinear Annotation Terminator should also be filtered out.
The scope of these characters is the subsequent sequence of digits (plus certain other characters), with
the exact specification as defined in the Unicode Standard, Version 6.0 (see Annex M for referencing in-
formation), for ARABIC END OF AYAH.
INVISIBLE SEPARATOR (2063): This character indicates that adjacent mathematical symbols form a list,
e.g. when no visible COMMA is used between multiple indices.
Extended beams are used frequently in music notation between groups of notes having short values. The
format characters 1D173 MUSICAL SYMBOL BEGIN BEAM and 1D174 MUSICAL SYMBOL END BEAM
can be used to indicate the extents of beam groupings. In some exceptional cases, beams are unclosed
on one end. This can be indicated with a "null note" (MUSICAL SYMBOL NULL NOTEHEAD) character if
no stem is to appear at the end of the beam.
Similarly, other format characters have been provided for other connecting structures. The characters
These pairs of characters modify the layout and grouping of notes and phrases in full music notation.
When musical examples are written or rendered in plain text without special software, the start/end control
characters may be rendered as brackets or left un-interpreted. More sophisticated in-line processes may
interpret them, to the extent possible, in their actual control capacity, rendering ties, slurs, beams, and
phrases as appropriate.
For maximum flexibility, the character set includes both pre-composed note values as well as primitives
from which complete notes are constructed. Due to their ubiquity, the pre-composed versions are provided
mainly for convenience.
Coding convenience notwithstanding, notes built up from alternative noteheads, stems and flags, and
articulation symbols are necessary for complete implementations and complex scores. Examples of their
use include American shape-note and modern percussion notations. For example,
Augmentation dots and articulation symbols may be appended to either the pre-composed or built-up
notes.
In addition, augmentation dots and articulation symbols may be repeated as necessary to build a com-
plete note symbol. For example,
MUSICAL SYMBOL EIGHTH NOTE + MUSICAL SYMBOL COMBINING AUGMENTATION DOT + MUSICAL SYMBOL
COMBINING AUGMENTATION DOT + MUSICAL SYMBOL COMBINING ACCENT
These tag characters can be used to spell out a character string in any ASCII-based tagging scheme that
needs to be embedded into plain text. These characters can be easily identified by their code value and
there is no overloading of usage for these tag characters. They can only express tag values and never
textual content itself.
When characters are used within the context of a protocol or syntax containing explicit markup providing
the same association, the Tag characters may be filtered out and ignored by these protocols.
For example, in SGML/XML context, an explicit language markup is specified. Therefore, the LANGUAGE
TAG (E0001) and other tag characters should not be used to mark a language in that context. The Uni-
code Consortium and the W3C have co-written a technical report: Unicode in XML and other Markup Lan-
guages (UTR#20), available from the Unicode web site (http://www.unicode.org/reports/), which describes
these issues in detail.
The TAGS block contains 97 dedicated tag characters consisting of a clone of the BASIC LATIN graphic
characters (names formed by prefixing these BASIC LATIN names with the word ‘TAG’, code points from
E0020 to E007E), as well as a language tag identification character: LANGUAGE TAG (E0001) and a
cancel tag character: CANCEL TAG (E007F).
The tag identification character is used as a mechanism for identifying tags of different types. This enables
multiple types of tags to coexist amicably embedded in plain text and solves the problem of delimitation if
a tag is concatenated directly onto another tag. Although only one type of tag is currently specified,
namely the language tag, the encoding of other tag identification characters in the future would allow for
distinct types to be used.
No termination character is required for a tag. A tag terminates either when the first non Special Purpose
Plane character is encountered, or when the next tag identification character is encountered.
Tag arguments can only be encoded using tag characters. No other characters are valid for expressing
the tag arguments.
The usage of the CANCEL TAG character without a prefixed tag identification character cancels any tag
value that may be defined.
The main function of the character is to make possible such operations as blind concatenation of strings in
a tagged context without the propagation of inappropriate tag values across the string boundaries.
Annex G
(informative)
Alphabetically sorted list of character names
The alphabetically sorted list of character names is provided in machine-readable format that is accessible
as a link to this document. The content linked to is a plain text file, using ISO/IEC 646-IRV characters with
LINE FEED as end of line mark, that specifies, after a 4-lines header, all the character names from
ISO/IEC 10646 except Hangul syllables and CJK ideographs (these are characters from blocks:
HANGUL SYLLABLES,
CJK UNIFIED IDEOGRAPHS,
CJK UNIFIED IDEOGRAPHS EXTENSION A,
CJK UNIFIED IDEOGRAPHS EXTENSION B,
CJK UNIFIED IDEOGRAPHS EXTENSION C,
CJK UNIFIED IDEOGRAPHS EXTENSION D,
CJK COMPATIBILITY IDEOGRAPHS, and
CJK COMPATIBILITY IDEOGRAPHS SUPPLEMENT).
The format of the file, after the header, is as follows:
Annex H
(informative)
The use of “signatures” to identify UCS
NOTE – Integrated in main body text, see 10.
Annex I
(informative)
Ideographic description characters
An Ideographic Description Character (IDC) is a graphic character, which is used with a sequence of other
graphic characters to form an Ideographic Description Sequence (IDS). Such a sequence may be used to
describe an ideographic character which is not specified within this International Standard.
The IDS describes the ideograph in the abstract form. It is not interpreted as a composed character and
does not imply any specific form of rendering.
NOTE – An IDS is not a character and therefore is not a member of the repertoire of ISO/IEC 10646.
• a coded ideograph
• a coded radical
• another IDS
NOTE 1 – The above description implies that any IDS may be nested within another IDS.
• the number of DCs used in the IDS that commences with that IDC,
• the definition of its acronym,
• the syntax of the corresponding IDS,
• the relative positions of the DCs in the visual representation of the ideograph that is being described in
its abstract form.
The syntax of the IDS introduced by each IDC is indicated in the “IDS Acronym and Syntax” column of the
table by the abbreviated name of the IDC (e.g. IDC-LTR) followed by the corresponding number of DCs,
i.e. (D1 D2) or (D1 D2 D3).
NOTE 2 – An IDS is restricted to no more than 16 characters in length. Also no more than six ideographs and/or radicals may
occur between any two instances of an IDC character within an IDS.
IDEOGRAPHIC DESCRIPTION CHARACTER ABOVE TO BELOW (2FF1): The IDS introduced by this
character describes the abstract form of the ideograph with D1 above D2.
IDEOGRAPHIC DESCRIPTION CHARACTER LEFT TO MIDDLE AND RIGHT (2FF2): The IDS intro-
duced by this character describes the abstract form of the ideograph with D1 on the left of D2, and D2 on
the left of D3.
IDEOGRAPHIC DESCRIPTION CHARACTER ABOVE TO MIDDLE AND BELOW (2FF3): The IDS in-
troduced by this character describes the abstract form of the ideograph with D1 above D2, and D2 above
D3.
IDEOGRAPHIC DESCRIPTION CHARACTER FULL SURROUND (2FF4): The IDS introduced by this
character describes the abstract form of the ideograph with D1 surrounding D2.
IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM ABOVE (2FF5): The IDS introduced
by this character describes the abstract form of the ideograph with D1 above D2, and surrounding D2 on
both sides.
IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM BELOW (2FF6): The IDS introduced
by this character describes the abstract form of the ideograph with D1 below D2, and surrounding D2 on
both sides.
IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM LEFT (2FF7): The IDS introduced by
this character describes the abstract form of the ideograph with D1 on the left of D2, and surrounding D2
above and below.
IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM UPPER LEFT (2FF8): The IDS intro-
duced by this character describes the abstract form of the ideograph with D1 at the top left corner of D2,
and partly surrounding D2 above and to the left.
IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM UPPER RIGHT (2FF9): The IDS
introduced by this character describes the abstract form of the ideograph with D1 at the top right corner of
D2, and partly surrounding D2 above and to the right.
IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM LOWER LEFT (2FFA): The IDS in-
troduced by this character describes the abstract form of the ideograph with D1 at the bottom left corner of
D2, and partly surrounding D2 below and to the left.
IDEOGRAPHIC DESCRIPTION CHARACTER OVERLAID (2FFB): The IDS introduced by this character
describes the abstract form of the ideograph with D1 and D2 overlaying each other.
⿰ D1 D2 ⿰ 亻母 㑄
ABOVE TO BELOW 2 IDC-ATB D1 D2
⿱ D1
D2 ⿱八天 关
LEFT TO MIDDLE AND RIGHT 3
IDC-LMR D1 D2
D3
⿲ D1 D2D3 ⿲彳言亍 𧗳𧗳
⿳
D1
⿳从从日 㫺
IDC-AMB D1 D2
ABOVE TO MIDDLE AND BELOW 3 D2
D3
D3
⿴
D1
FULL SURROUND 2 IDC-FSD D1 D2 D2 ⿴囗莫 𡈗𡈗
⿵
D1
SURROUND FROM ABOVE 2 IDC-SAV D1 D2
D2 ⿵門卞 𨳲𨳲
SURROUND FROM BELOW 2 IDC-SBL D1 D2
⿶ D2
D1
⿶凵士 凷
SURROUND FROM LEFT 2 IDC-SLT D1 D2
⿷ D1 D2 ⿷匸虎 𠥶𠥶
⿸
D1
SURROUND FROM UPPER LEFT 2 IDC-SUL D1 D2
D2 ⿸广舞 𢋑𢋑
⿹
D1
SURROUND FROM UPPER RIGHT 2 IDC-SUR D1 D2
D2 ⿹勹去 𠣗𠣗
SURROUND FROM LOWER LEFT 2 IDC-SLL D1 D2
⿺ D1
D2
⿺辶交 䢒
⿻
D1
OVERLAID 2 IDC-OVL D1 D2
D2
*
⿻从工 巫
* NOTE – D1 and D2 overlap each other. This diagram does not imply that D1 is on the top left corner and D2 is on the bottom
right corner.
Annex J
(informative)
Recommendation for combined receiving/originating devices with internal storage
This annex is applicable to a widely-used class of devices that can store received CC-data elements for
subsequent retransmission.
This recommendation is intended to ensure that loss of information is minimized between the receipt of a
CC-data-element and its retransmission.
A device of this class includes a receiving device component and an originating device component as in
2.3, and can also store received CC-data-elements for retransmission, with or without modification by the
actions of the user on the corresponding characters represented within it. Within this class of device, two
distinct types are identified here, as follows.
1) Receiving device with full retransmission capability. The originating device component will re-
transmit the coded representations of any received characters, including those that are outside the
identified subset of the receiving device component, without change to their coded representation,
unless modified by the user.
2) Receiving device with subset retransmission capability. The originating device component can re-
transmit only the coded representations of the characters of the subset adopted by the receiving de-
vice component.
Annex K
(informative)
Notations of octet value representations
Representation of octet values in ISO/IEC 10646 except in clause 12 is different from other character cod-
ing standards such as ISO/IEC 2022, ISO/IEC 6429 and ISO 8859. This annex clarifies the relationship
between the two notations.
In ISO/IEC 10646, the notation used to express an octet value is z, where z is a hexadecimal number in
the range 00 to FF. For example, the character ESCAPE (ESC) of ISO/IEC 2022 is represented in
ISO/IEC 10646 by 1B.
In other character coding standards, the notation used to express an octet value is x/y, where x and y are
two decimal numbers in the range 00 to 15. The correspondence between the notations of the form x/y
and the octet value is as follows.
• x is the number represented by bit 8, bit 7, bit 6 and bit 5 where these bits are given the weights 8, 4,
2 and 1 respectively;
• y is the number represented by bit 4, bit 3, bit 2 and bit 1 where these bits are given the weights 8, 4,
2 and 1 respectively.
For example, the character ESC of ISO/IEC 2022 is represented by 01/11.
Thus ISO/IEC 2022 (and other character coding standards) octet value notation can be converted to
ISO/IEC 10646 octet value notation by converting the value of x and y to hexadecimal notation. For ex-
ample; 04/15 is equivalent to 4F.
Annex L
(informative)
Character naming guidelines
The clause 24 of this standard specifies rules for name formation and name uniqueness. These rules are
used in other information technology coded character set standards such as ISO/IEC 646, ISO/IEC 6937,
ISO/IEC 8859, and ISO/IEC 10367. This annex provides additional guidelines for the creation of these
entity names.
These guidelines do not apply to the names of CJK Ideographs and Hangul syllables which are formed
using rules specified in clause 24.5 and 24.6 respectively.
Guideline 1
The name of an entity wherever possible denotes its customary meaning (for example, the character
name: PLUS SIGN or the block name: BENGALI).
Some entities, such as characters, may have a name describing shapes, not usage, (for example, the
character name: UPWARDS ARROW).
The name on an entity is not intended to identify its properties or attributes, or to provide information on its
linguistic characteristics, except as defined in guideline 4 below.
Guideline 2
An acronym consists of Latin capital letters A to Z and digits and is associated with a name.
Acronyms may be used in entity names where usage already exists and clarity requires it. For example,
the names of control functions are coupled with an acronym.
EXAMPLES
Name: Acronym
LOCKING-SHIFT TWO RIGHT LS2R
SOFT HYPHEN SHY
INTERNATIONAL PHONETIC ALPHABET IPA
NOTE – In ISO/IEC 6429, the names of the modes have also been presented in the same way as control functions.
Guideline 3
Character names and named UCS Sequence Identifiers only include digits 0 to 9 if spelling out the name
of the corresponding digits(s) would be inappropriate.
NOTE – As an example the name of the character at the code point value 201A is SINGLE LOW-9 QUOTATION MARK; the
symbol for the digit 9 is included in this name to illustrate the shape of the character, and has no numerical significance.
Guideline 4
Character names and named UCS Sequence Identifiers are constructed from an appropriate set of the
applicable terms of the following grid and ordered in the sequence of this grid. Exceptions are specified in
guidelines 9 to 11. The words WITH and AND may be included for additional clarity when needed.
1 Script 5 Attribute
2 Case 6 Designation
3 Type 7 Mark(s)
4 Language 8 Qualifier
For character names, where a character comprises a base letter with multiple marks, the sequence of
those in the name is the order in which the marks are positioned relative to the base letter. The sequence
may start with the marks above the letters taken in upwards sequence, and follow with the marks below
the letters taken in downwards sequence, or the reverse (below/above).
For named UCS Sequence Identifiers, where the sequence comprises a base letter with multiple marks,
the name describes the individual characters in the sequence in which they are encoded in the sequence.
EXAMPLES
Ộ LATIN CAPITAL LETTER O WITH CIRCUMFLEX AND DOT BELOW
Ḉ LATIN CAPITAL LETTER C WITH CEDILLA AND ACUTE
Ų́ LATIN CAPITAL LETTER U WITH OGONEK AND ACUTE
Guideline 5
The letters of the Latin script are represented within their name by their basic graphic symbols (A, B, C,
etc.). The letters of all other scripts are represented by their transcription in the language of the first pub-
lished International Standard.
EXAMPLES
K LATIN CAPITAL LETTER K
Ю CYRILLIC CAPITAL LETTER YU
Guideline 6
In principle when a character of a given script is used in more than one language, no language name is
specified. Exceptions are tolerated where an ambiguity would otherwise result.
EXAMPLES
И CYRILLIC CAPITAL LETTER I
I CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I
Guideline 7
Letters that are elements of more than one script are considered different even if their shape is the same;
they have different names.
EXAMPLES
A LATIN CAPITAL LETTER A
Α GREEK CAPITAL LETTER ALPHA
А CYRILLIC CAPITAL LETTER A
Guideline 8
Where possible, named UCS Sequence Identifiers are constructed by appending the names of the con-
stituent elements together while eliding duplicate elements. Should this process result in a name that al-
ready exists, the name is modified suitably to guarantee uniqueness among character names and named
UCS Sequence Identifiers. The words WITH and AND may be included for additional clarity when needed.
Guideline 9
A character of one script used in isolation in another script, for example as a graphic symbol in relation
with physical units of dimension, is considered as a character different from the character of its native
script.
EXAMPLE
µ MICRO SIGN
Guideline 10
A number of characters have a traditional name consisting of one or two words. It is not intended to
change this usage.
EXAMPLES
' APOSTROPHE
: COLON
@ COMMERCIAL AT
_ LOW LINE
~ TILDE
Guideline 11
In some cases, characters of a given script, often punctuation marks, are used in another script for a dif-
ferent usage. In these cases the customary name reflecting the most general use is given to the charac-
ter. The customary name may be followed in the list of characters of a particular standard by the name in
parentheses which this character has in the script specified by this particular standard.
EXAMPLE
‿ UNDERTIE (Enotikon)
Annex M
(informative)
Sources of characters
Several sources and contributions were used for constructing this coded character set. National and inter-
national standards are listed first for each category, followed by relevant publications references.
General
ISO international register of character sets to be used with escape sequences. (registration procedure ISO
2375:1985) .
ISO 8879:1986, Information processing - Text and office systems - Standard Generalized Markup Lang-
uage (SGML).
ISO/IEC TR 15285:1998, Information technology - An operational model for characters and glyphs.
Allworth, Edward. Nationalities of the Soviet East: Publications and Writing Systems. New York, London,
Columbia University Press, 1971. ISBN 0-231-03274-9.
Barry, Randall K. 1997. ALA-LC romanization tables: transliteration schemes for non-Roman scripts.
Washington, DC: Library of Congress Cataloging Distribution Service. ISBN 0-8444-0940-5
Daniels, Peter T., and William Bright, eds. 1996. The world's writing systems. New York; Oxford: Oxford
University Press. ISBN 0-19-507993-0
Diringer, David. 1996. The alphabet: a key to the history of mankind. New Delhi: Munshiram Manoharlal.
ISBN 81-215-0780-0
Faulmann, Carl. 1990 (1880). Das Buch der Schrift. Frankfurt am Main: Eichborn. ISBN 3-8218-1720-8
Haarmann, Harald. 1990. Universalgeschichte der Schrift. Frankfurt/Main; New York: Campus. ISBN 3-
593-34346-0
Imprimerie Nationale. 1990. Les caractères de l'Imprimerie nationale. Paris: Imprimerie nationale Éditions.
ISBN 2-11-081085-8
Jensen, Hans. 1969. Die Schrift in Vergangenheit und Gegenwart. 3., neubearbeitete und erweiterte Au-
flage. Berlin: VEB Deutscher Verlag der Wissenschaften.
Knuth, Donald E. The TeXbook. – 19th. printing, rev, – Reading, MA : Addison-Wesley, 1990.
Nakanishi, Akira. 1990. Writing systems of the world: alphabets, syllabaries, pictograms. Rutland, VT:
Charles E. Tuttle. ISBN 0-8048-1654-9
Shepherd, Walter. Shepherd's glossary of graphic signs and symbols. Compiled and classified for ready
reference. – New York : Dover Publications, [1971].
The Unicode Consortium The Unicode Standard. Worldwide Character Encoding Version 1.0, Volume
One. – Reading, MA : Addison-Wesley, 1991.
The Unicode Consortium The Unicode Standard, Version 2.0. Reading, MA: Addison-Wesley, 1996. ISBN
0-201-48345-9
The Unicode Consortium The Unicode Standard, Version 3.0. Reading, MA: Addison-Wesley Developer's
Press, 2000. ISBN 0-201-61633-5
The Unicode Consortium The Unicode standard, Version 4.0. Reading, MA: Addison-Wesley Developer's
Press, 2003. ISBN 0-321-18578-1
The Unicode Consortium The Unicode Standard, Version 5.0. Reading, MA: Addison-Wesley Developer’s
Press, 2007. ISBN 0-321-48091-0
Arabic
ISO/IEC 8859, Information technology - 8-bit single-byte coded graphic character sets
Part 6: Latin/Arabic alphabet (1999)
ISO 9036:1987, Information processing - Arabic 7-bit coded character set for information interchange.
ASMO 449-1982 Arab Organization for Standardization and Metrology. Data processing - 7-bit coded
character set for information interchange.
Balinese
Medra, Nengah. Pedoman Pasang Aksara Bali. Denpasar: Dinas Kebudayaan Propinsi Bali, 2003.
Menaka, Made. Kamus Kawi Bali / olih, made Menaka. Singaraja: Yayasan Kawi Sastra Mandala, 1990.
Braille
ISO 11548-1:2001. Communication aids for blind persons – identifiers, names and assignation to coded
character sets for 8-dot Braille characters – Part 1: General guidelines for Braille identifiers and shift
marks.
Canadian Aboriginal Syllabic Encoding Committee. Repertoire of Unified Canadian Aboriginal Syllabics
Proposed for Inclusion into ISO/IEC 10646: International Standard Universal Multi-Octet Coded Character
Set. [Canada]: CASEC [1994]
Cherokee
Holmes, Ruth Bradley, and Betty Sharp Smith. 1976. Beginning Cherokee: Talisgo galiquogi dideliquas-
dodi Tsalagi digoweli . Norman: University of Oklahoma Press.
CJK Ideographs
GB2312-80 Code of Chinese Graphic Character Set for Information Interchange: Jishu Biaozhun Chuban-
she (Technical Standards Publishing).
GBK (Guo Biao Kuo) Han character internal code extension specification: Jishu Biaozhun Chubanshe
(Technical Standards Publishing, Beijing)
JIS X 0201-1976 Japanese Standards Association. Jouhou koukan you fugou (Code for Information Inter-
change).
JIS X 0208-1990 Japanese Standards Association. Jouhou koukan you kanji fugoukei (Code of the Japa-
nese Graphic Character Set for Information Interchange).
JIS X 0212-1990 Japanese Standards Association. Jouhou koukan you kanji fugou-hojo kanji (Code of the
supplementary Japanese graphic character set for information interchange).
JIS X 0213:2000, Japanese Standards Association. 7-bit and 8-bit double byte coded extended KANJI
sets for information interchange, 2000-01-20.
KS X 1001:2004 (formerly KSC 5601-1992) Korean Industrial Standards Association. Jeongbo gyohwan-
yong buho (Code for Information Interchange(Hangeul and Hanja)).
ANSI Z39.64-1989 American National Standards Institute. East Asian character code for bibliographic
use.
Mandarin Promotion Council, Ministry of Education, Taiwan. Shiangtu yuyan biauyin fuhau shoutse (The
Handbook of Taiwan Languages Phonetic Alphabet). 1999.
Shinmura, Izuru. Kojien – Dai 4-han. – Tokyo : Iwanami Shoten, Heisei 3 [1991].
NOTE – For additional sources of the CJK unified ideographs in ISO/IEC 10646 refer to clause 22.4.
Coptic
Browne, Gerald M. Old Nubian Grammar. München: Lincom Europa, 2002. (Languages of the world: Ma-
terials, 330). ISBN 3-89586-893-0 (pbk.).
Kasser, Rodolphe. “La ‘Genève 1986’: une nouvelle série de caractères typographiques coptes, proto-
coptes et vieux-coptes créée à Genève.” Bulletin de la Société d’égyptologie de Genève, 12 (1988): 59–
60. ISSN 0255-6286.
Kasser, Rodolphe. “A standard system of sigla for referring to the dialects of Coptic.” Journal of Coptic
Studies, 1 (1990): 141–151. ISSN 1016-5584.
Cypriot
Cyrillic
ISO/IEC 8859, Information technology - 8-bit single-byte coded graphic character sets
Part 5: Latin/Cyrillic alphabet (1999)
ISO 5427:1984, Extension of the Cyrillic alphabet coded character set for bibliographic information inter-
change.
ISO 10754:1984, Information and documentation – Extension of the Cyrillic alphabet coded character set
for non-Slavic languages for bibliographic information interchange.
Deseret
Encyclopedia of Mormonism, entry for “Deseret Alphabet.” New York: Macmillan, 1992. ISBN 0-02-
904040-X.
Ivins, Stanley S. “The Deseret Alphabet” Utah Humanities Review 1 (1947): 223-39.
Monson, Samuel C. Representative American Phonetic Alphabets. New York: 1954. Ph.D. dissertation—
Columbia University.
Ethiopic
Armbruster, Carl Hubert. Initia Amharica: an Introduction to Spoken Amharic. Cambridge, Cambridge Uni-
versity Press, 1908-20.
Launhardt, Johannes. Guide to Learning the Oromo (Galla) Language. Addis Ababa, Launhardt [1973?]
Leslau, Wolf. Amharic Textbook. Weisbaden, Harrassowitz; Berkeley, University of California Press, 1968.
Glagolitic
ISO 6861, Information and documentation - Glagolitic coded character set for bibliographic information
interchange.
Glagolitica: zum Ursprung der slavischen Schriftkultur, herausgegeben von Heinz Miklas, unter der Mitar-
beit von Sylvia Richter und Velizar Sadovski. Wien: Verlag der Österreichischen Akademie der Wissen-
schaften, 2000. (Schriften der Balkan-Kommission, Philologische Abteilung, 41). ISBN 3-7001-2895-9.
Khaburgaev, Georgii Aleksandrovich. Staroslavianskii iazyk. Izd. 2-e, perer. i dop. Moskva: Prosveshche-
nie, 1986.
Žubrinic, Darko. Hrvatska glagoljica: biti pismen—biti svoj. Zagreb: Hrvatsko književno društvo sv.
Jeronima (sv. Cirila i Metoda): Element, 1996. ISBN 953-6111-35-7.
Gothic
Ebbinghaus, Ernst. “The Gothic Alphabet.” In The World’s Writing Systems, edited by Peter T. Daniels
and William Bright. New York: Oxford University Press, 1996. ISBN 0-19-507993-0.
Fairbanks, Sydney, and F. P. Magoun Jr. 1940. ‘On writing and printing Gothic’, in Speculum 15:313-16.
Greek
ISO 5428:1984, Greek alphabet coded character set for bibliographic information interchange.
ISO/IEC 8859, Information technology - 8-bit single-byte coded graphic character sets
Part 7: Latin/Greek alphabet (1999)
Austin, Colin. Comicorum Graecorum Fragmenta in Papyris Reperta, ed. Colinus Austin. Berolini [Berlin],
Novi Eboraci [New York]: de Gruyter, 1973, p. 29. ISBN 3110024012.
Homer. Iliad. Homeri Ilias, edidit Thomas W. Allen. 3 vols. Oxonii [Oxford]: e typographeo Clarendoniano
[Clarendon Press], 1931, vol. 2: pp. 39, 234.
The Oxyrhynchus Papyri, Part XV, edited with translations and notes by Bernard P. Grenfell and Arthur S.
Hunt. London: Egypt Exploration Society, 1921, p. 56. (Egypt Exploration Society, Graeco-Roman Mem-
oirs, 18).
Hebrew
ISO/IEC 8859, Information technology - 8-bit single-byte coded graphic character sets
Part 8: Latin/Hebrew alphabet (1999)
ISO 8957:1996, Information and documentation - Hebrew alphabet coded character sets for bibliographic
information interchange.
SI 1311.1 – 1996: Standards Institution of Israel. Information technology. ISO 8 bit coded character set
with Hebrew points.
SI 1311.2 – 1996, The Standards Institution of Israel. Information Technology. ISO 8-bit coded character
set for information interchange with Hebrew points and cantillation marks.
Indian scripts
IS 13194:1991 Bureau of Indian Standards Indian script code for information interchange - ISCII
Esling, John. Computer coding of the IPA: supplementary report. Journal of the International Phonetic
Association, 20:1 (1990), p. 22-26.
International Phonetic Association. The IPA 1989 Kiel Convention Workgroup 9 report: Computer Coding
of IPA Symbols and Computer Representation of Individual Languages. Journal of the International Phon.
Assoc., 19:2 (1989), p. 81-82.
Pullum, Geoffrey K. Phonetic symbol guide. Geoffrey K. Pullum and William A. Ladusaw. – Chicago : Uni-
versity of Chicago Press, 1986.
Pullum, Geoffrey K. Remarks on the 1989 revision of the International Phonetic Alphabet. Journal of the
International Phonetic Association, 20:1 (1990), p. 33-40.
Kharoshthi
Glass, Andrew. A Preliminary Study of Kharosthi Manuscript Paleography. 2000. Thesis (M.A.), University
of Washington, 2000.
Glass, Andrew. “KharoDEhG Manuscripts: A Window on GandhFran Buddhism.” Nagoya Studies in Indian
Culture and Buddhism, 24 (2004): 129–152. ISSN 0285-7154.
Salomon, Richard. Ancient Buddhist Scrolls from GandhZra: The British Library Kharosthi Fragments.
Seattle: University of Washington Press; London: British Library, 1999. ISBN 029597768X; 0295977698
(pbk).
Latin
ISO/IEC 646:1991, Information technology - ISO 7-bit coded character set for information interchange.
ISO 5426:1983, Extension of the Latin alphabet coded character set for bibliographic information inter-
change.
ISO 6438:1983, Documentation - African coded character set for bibliographic information interchange.
ISO 6937:1994, Information technology - Coded graphic character sets for text communication - Latin
alphabet.
ISO/IEC 8859, Information technology - 8-bit single-byte coded graphic character sets
Part 1: Latin alphabet No. 1 (1998).
Part 2: Latin alphabet No. 2 (1999).
Part 3: Latin alphabet No. 3 (1999).
Part 4: Latin alphabet No. 4 (1998).
Part 9: Latin alphabet No. 5 (1999)
Part 10: Latin alphabet No. 6 (1998).
ISO/IEC 10367:1991, Information technology - Standardized coded graphic character sets for use in 8-bit
codes.
ANSI X3.4-1986 American National Standards Institute. Coded character set - 7-bit American national
standard code.
ANSI Z39.47-1985 American National Standards Institute. Extended Latin alphabet coded character set
for bibliographic use.
LVS 18-92 Latvian National Centre for Standardization and Metrology Libiesu kodu tabula ar 191 simbolu.
Limbu
Driem, George van. A Grammar of Limbu. Berlin, New York: Mouton de Gruyter, 1987. (Mouton grammar
library, 4.) ISBN 0-89925-345-8. Appendix: Anthology of Kiranti scripts, pp. 550–558.
Sprigg, R. K. “Limbu Books in the Kiranti Script.” In International Congress of Orientalists (24th: 1957:
Munich). Akten des Vierundzwanzigsten Internationalen Orientalisten-Kongresses München 28. August
bis 4. September 1957, hrsg. von Herbert Franke. Wiesbaden: Deutsche Morgenländische Gesellschaft,
in Kommission bei Franz Steiner Verlag, 1959.
Sprigg, R. K. [Review of van Driem (1987)]. Bulletin of the School of Oriental and African Studies, Univer-
sity of London, 52 (1989):1.163–165.
Subba, B. B. Limbu, Nepali, English Dictionary. Gangtok: Text Book Unit, Directorate of Education, Govt.
of Sikkim, 1979 [i.e. 1980]. Cover title: Yakthun-Pene-Mikphula-panchekva.
Subba, B. B. Yakthuŋ huɁsiŋlam (“Limbu self-teaching method”) = Limbu akṣar gāiḍ (“Limbu letter guide”).
Gangtok: Kwality Stores, 1991?
Yoṅhāṅ, Khel Rāj. Limbū Nepālī śabdakoś. [Lalitpur]: 2052 B.S. [i.e. 1995].
Bennett, Emmett L. “Aegean Scripts.” In The World’s Writing Systems, edited by Peter T. Daniels and
William Bright. New York: Oxford University Press, 1996. ISBN 0-19-507993-0.
Chadwick, John. The Decipherment of Linear B. 2nd ed. London: Cambridge University Press., 1967 [i.e.
1968].
Chadwick, John. Linear B and Related Scripts. Berkeley: University of California Press; [London]: British
Museum, 1987. (Reading the Past, v. 1.) ISBN 0-520-06019-9.
Hooker, J. T. Linear B: An Introduction. Bristol: Bristol Classical Press, 1980. ISBN 0-906515-69-6. Cor-
rected printing published 1983. ISBN 0-906515-69-6; 0-906515-62-9 (pbk.).
International Colloquium on Mycenaean Studies (3rd: 1961: Racine, WI). Mycenaean Studies: Proceed-
ings of the Third International Colloquium for Mycenaean Studies held at “Wingspread,” 4–8 September
1961, edited by Emmett L. Bennett, Jr. Madison: University of Wisconsin Press, 1964.
Masson, Olivier. Les Inscriptions chypriotes syllabiques: recueil critique et commenté. Réimpr. augm.
Paris: E. de Boccard, 1983.
Sampson, Geoffrey. Writing Systems: A Linguistic Introduction. Stanford, CA: Stanford University Press,
1985. ISBN 0-8047-1254-9. Also published: London, Hutchinson. ISBN 0-09-156980-X; 0-09-173051-1
(pbk.).
Ventris, Michael. Documents in Mycenaean Greek. 1st ed. by Michael Ventris and John Chadwick with a
foreword by Alan J. B. Wace. 2nd ed. by John Chadwick. Cambridge: Cambridge University Press, 1973.
ISBN 0-521-08558-6.
Mathematical Symbols
ISO 6862, Information and documentation - Mathematical coded character set for bibliographic inform-
ation interchange.
ANSI Y10.20-1988 American National Standards Institute. Mathematic signs and symbols for use in
physical sciences and technology.
Mathematical Markup Language (MathML) Version 2.0. (W3C Recommendation 21 February 2001). Edi-
tors: David Carlisle, Patrick Ion, Robert Miner, [and] Nico Poppolier. Latest version:
http://www.w3.org/TR/MathML2/
Selby, Samuel M. Standard mathematical tables. – 16th ed. – Cleveland, OH : Chemical Rubber Co.,
1968. Shepherd, Walter.
Swanson, Ellen. Mathematics into Type. Updated ed. by Arlene O’Sean and Antoinette Schleyer. Provi-
dence, RI: American Mathematical Society, 1999. ISBN 0-8218-1961-5.
Musical Symbols
ELOT 1373. The Greek Byzantine Musical Notation System. Athens, 1997 (ΣΕΠ ΕΛΟΤ 1373: 1997).
Catholic Church. Graduale Sacrosanctae Romanae Ecclesiae de Tempore et de Sanctis SS. D. N. Pii X.
Pontificis Maximi. Parisiis: Desclée, 1961. (Graduale Romanum, no. 696.)
Gazimihal, Mahmut R. Anadolu türküleri ve mûsikî istikbâlimiz [by] Mahmut Ragip. [Istanbul]: Mârifet Mat-
baasi, 1928.
Heussenstamm, George. Norton Manual of Music Notation. New York: W.W. Norton, 1987. ISBN 0-393-
95526-5 (pbk.).
Kennedy, Michael. Oxford Dictionary of Music. Oxford, New York: Oxford University Press, 1985. ISBN 0-
19-311333-3. Second ed. published 1994. ISBN 0-19-869162-9.
The New Harvard Dictionary of Music, edited by Don Michael Randel. Cambridge, MA: Belknap Press of
Harvard University Press, 1986. ISBN 0-674-61525-5.
Ottman, Robert W. Elementary Harmony: Theory and Practice. 2nd ed. Englewood Cliffs, NJ: Prentice-
Hall, 1970. ISBN 0-13-257451-9. Fifth ed. published 1998. ISBN 0-13-281610-5.
Rastall, Richard. The Notation of Western Music: An Introduction. London: Dent, 1983. ISBN 0-460-
04205-X. Also published: New York: St. Martin’s Press, 1982. ISBN 0-312-57963-2.
Read, Gardner. Music Notation: A Manual of Modern Practice. Boston: Allyn and Bacon, 1964.
Stone, Kurt. Music Notation in the Twentieth Century: A Practical Guidebook. New York: W.W. Norton,
1980. ISBN 0-393-95053-0.
Understanding Music with AI: Perspectives on Music Cognition, edited by Mira Balaban, Kemal Ebcioglu,
and Otto Laske. Cambridge, MA: MIT Press; Menlo Park, CA: AAAI Press, 1992. ISBN 0-262-52170-9.
Myanmar
Mranmā cālui:poṅg:satpui kyam: nhaṅ. khwaithā:. [Rankun]: 1996. Translated title: Myanmar orthography
treatise.
Okell, John. 1971. A guide to the romanization of Burmese. (James G. Forlang Fund; 27) London: Royal
Asiatic Society of Great Britain and Ireland.
Roop, D. Haigh. An Introduction to the Burmese Writing System. [Honolulu]: Center for Southeast Asian
Studies, University of Hawaii at Manoa, 1997. (Southeast Asia Paper, 11). Originally published: New Ha-
ven: Yale University Press, 1972. (Yale linguistic series). ISBN 0-300-01528-3.
N’Ko
Kanté, Souleymane. Méthode pratique d’écriture n’ko, 1961. Kankan, Guinea: Association de tradithera-
peutes et pharmacologues, 1995.
Ogham
I. S. 434:1999, Information Technology - 8-bit single-byte graphic coded character set for Ogham = Teic-
neolaíocht Eolais - Tacar carachtar grafach Oghaim códaithe go haonbheartach le 8 ngiotán. National
Standards Authority of Ireland.
McManus, Damian. A Guide to Ogam. Maynooth: An Sagart, 1991. (Maynooth monographs, 4). ISBN 1-
87068-417-6.
Old Italic
Bonfante, Larissa. “The Scripts of Italy.” In The World’s Writing Systems, edited by Peter T. Daniels and
William Bright. New York: Oxford University Press, 1996. ISBN 0-19-507993-0.
Cristofani, Mauro. “L’alfabeto etrusco.” In Lingue e dialetti dell’Italia antica, a cura di Aldo Larosdocimi.
Roma: Biblioteca di storia patria, a cura dell’ Ente per la diffusione e l’educazione storia, 1978. (Popoli e
civiltà dell’Italia antica, VI.)
Gordon, Arthur E. Illustrated Introduction to Latin Epigraphy. Berkeley: University of California Press,
1983. ISBN 0-520-03898-3.
Marinetti, Anna. Le iscrizione sudpicene. I. Testi. Firenze: Olschki, 1985. ISBN 88-222-3331-X (v. 1).
Parlangèli, Oronzo. Studi Messapici. Milano: Istituto lombardo di scienze e lettere, 1960.
Old Persian
Schmitt, Rüdiger. The Bisitun Inscriptions of Darius the Great, Old Persian Text. London, School of Orien-
tal and African Studies, 1991 (Corpus Inscriptionum Iranicarum, Part I: Inscriptions of ancient Iran, v.1,
Text 1). ISBN 0-7286-0181-8.
Osmanya
Afkeenna iyo fartiisa: buug koowaad. Xamar: Goosanka afka iyo suugaanta Soomaalida, 1971. Translated
title: Our language and its handwriting: book one.
Cerulli, Enrico. “Tentativo indigeno di formare un alfabeta somalo.” Oriente moderno, 12 (1932): 212–213.
ISSN 0030-5472.
Gaur, Albertine. A History of Writing. London: British Library, 1992. ISBN 0-7123-0270-0. Also published:
Rev. ed. New York: Cross River Press, 1992. ISBN 1-558-59358-6.
Gregersen, Edgar A. Language in Africa: An Introductory Survey. New York: Gordon and Breach, 1977.
(Library of Anthropology). ISBN: 0-677-04380-5; 0-677-04385-6 (pbk.).
Maino, Mario. “L’alfabeta ‘Osmania’ in Somalia.” Rassegna di studi etiopici, 10 (1951): 108–121. ISSN
0390-3699.
Nakanishi, Akira. Writing Systems of the World: Alphabets, Syllabaries, Pictograms. Rutland, VT: Tuttle,
1980. ISBN 0-8048-1293-4; 0-8048-1654-9 (pbk.). Revised translation of Sekai no moji.
Phags-pa
Luo, Changpei. Basibazi yu Yuandai Hanyu [ziliao huibian] / Luo Changpei, Cai Meibiao bian zhu. Beijing:
Kexue chubanshe, 1959.
Poppe, Nikolai Nikolaevich. The Mongolian Monuments in hP’ags-pa Script. Translated and edited by
John R. Krueger. 2nd ed. Wiesbaden: Harrassowitz, 1957. (Göttinger asiatische Forschungen, 8).
Zhaonasitu. Menggu ziyun jiaoben / Zhaonasitu, Yang Naisi bian zhu. [Beijing]: Min zu chu ban she, 1987.
Author Zhaonasitu also known as Jagunasutu or Junast.
Philippines Scripts
Doctrina Christiana: The First Book Printed in the Philippines, Manila 1593. A facsimile of the copy in the
Lessing J. Rosenwald Collection, with an introductory essay by Edwin Wolf II. Washington, DC: Library of
Congress, 1947.
Kuipers, Joel C., and Ray McDermott. “Insular Southeast Asian Scripts.” In The World’s Writing Systems.
Edited by Peter T. Daniels and William Bright. New York: Oxford University Press, 1996. ISBN 0-19-
507993-0.
Santos, Hector. The Living Scripts. Los Angeles: Sushi Dog Graphics, 1995. (Ancient Philippine scripts
series, 2). User’s guide accompanying Computer Fonts, Living Scripts software.
Santos, Hector. Our Living Scripts. January 31, 1997. http://www.bibingka.com/dahon/living/living.htm Part
of his A Philippine Leaf.
Santos, Hector. The Tagalog Script. Los Angeles: Sushi Dog Graphics, 1994. (Ancient Philippine scripts
series, 1). User’s guide accompanying Tagalog Script Fonts software.
Phoenician
Branden, Albertus van den. Grammaire phénicienne. Beyrouth: Librairie du Liban, 1969. (Bibliothèque de
l’Université Saint-Esprit, 2).
McCarter, P. Kyle. The Antiquity of the Greek Alphabet and the Early Phoenician Scripts. Missoula, MT:
Published by Scholars Press for Harvard Semitic Museum, 1975. (Harvard Semitic Monographs; 9.) ISBN
0-89130-066-X.
Noldeke, Theodor. Beiträge zur semitischen Sprachwissenschaft. Strassburg: Karl J. Trübner, 1904. Re-
printed as: vol. 1 of Beiträge und Neue Beiträge zur semitischen Sprachwissenschaft: achtzehn Aufsätze
und Studien. Amsterdam: APA-Philo Press, [1982]. Also published on microfiche by the American Theo-
logical Library Association.
Powell, Barry B. Homer and the Origin of the Greek Alphabet. Cambridge, New York: Cambridge Univer-
sity Press, 1991. ISBN 0-521-37157-0. Reprinted, 1996. ISBN 0-521-58907-X (pbk).
Runic
Benneth, Solbritt, Jonas Ferenius, Helmer Gustavson, & Marit Åhlén. 1994. Runmärkt: från brev till klotter.
Runorna under medeltiden. [Stockholm]: Carlsson Bokförlag. ISBN 91-7798-877-9
Derolez, René. 1954. Runica manuscripta: the English tradition. (Rijksuniversiteit te Gent: Werken uit-
gegeven door de Faculteit van de Wijsbegeerte en Letteren; 118e aflevering) Brugge: De Tempel.
Friesen, Otto von. Runorna. Stockholm, A. Bonnier [1933]. (Nordisk kultur, 6).
Haugen, Einar Ingvald. The Scandinavian Languages: An Introduction to Their History. London: Faber,
1976. ISBN 0-571-10423-1. Also published: Cambridge, MA: Harvard University Press, 1976. ISBN 0-674-
79002-2.
Page, Raymond Ian. Runes. Berkeley: University of California Press; [London]: British Museum, 1987.
(Reading the Past). ISBN 0-520-06114-4. British Museum Publications edition has ISBN 0-7141-8065-3.
Shavian
ConScript Unicode Registry [by] John Cowan and Michael Everson. “E700–E72F Shavian.” Included in
the ConScript Registry (http://www.evertype.com/standards/csur/index.html) in 1997. Shavian was with-
drawn from the ConScript Registry in 2001, because of its addition to the Unicode Standard and ISO/IEC
10646.
Crystal, David. The Cambridge Encyclopedia of Language. Cambridge, New York: Cambridge University
Press, 1987. ISBN 0-521-26438-3. 2nd ed. Cambridge, New York: Cambridge University Press, 1997.
ISBN 0-521-55050-5; 0-521-55967-7.
Shaw, George Bernard. Androcles and the Lion: An Old Fable Renovated, by Bernard Shaw, with a Paral-
lel Text in Shaw’s Alphabet to Be Read in Conjunction Showing Its Economies in Writing and Reading.
Harmondsworth: Penguin Books, 1962.
Sinhala
SLS 1134:1996 Sri Lanka Standards Institution Sinhala character code for information interchange.
Gunasekara, Abraham Mendis. A comprehensive grammar of the Sinhalese language. New Delhi: Asian
Educational Services, 1986( Reprint of 1891 edition).
Symbols (Miscellaneous)
ISO 2033:1983, Information processing - Coding of machine readable characters (MICR and OCR).
ISO 2047:1975, Information processing - Graphical representations for the control characters of the 7-bit
coded character set.
ISO/IEC 9995-7:1994, Information technology – Keyboard layouts for text and office systems – Part 7:
Symbols used to represent functions.
ANSI X3.32-1973 American National Standards Institute. American national standard graphic representa-
tion of the control characters of American national standard code for information interchange.
ANSI Y14.5M-1982 American National Standard. Engineering drawings and related document practices,
dimensioning and tolerances.
Syriac
Nöldeke, Theodor. Compendious Syriac Grammar. With a table of characters by Julius Euting. Translated
from the 2nd and improved German ed., by James A. Crichton. London: Williams & Norgate, 1904. Re-
printed: Tel Aviv: Zion Pub. Co. [1970].
Robinson, Theodore Henry. Paradigms and Exercises in Syriac Grammar. 4th ed. Rev. by L. H. Brocking-
ton. Oxford: Clarendon Press; New York: Oxford University Press, 1962. ISBN 0-19-815416-X, 0-19-
815458-5 (pbk.).
Tai Le
Coulmas, Florian. The Blackwell Encyclopedia of Writing Systems. Oxford, Cambridge: Blackwell, 1996.
ISBN 0-631-19446-0. Dehong writing, pp. 118–119.
Tsa va4 má3 hó va3: la ta6 mé2 sá ai3 seh va2 xo ŋa3. Yina5lána5 mina5su4 su4pána2se3 (Yunnan minzu
chubanshe). 1997. ISBN 7-5367-1455-6.
Thaana
Geiger, Wilhelm. Maldivian Linguistic Studies. New Delhi: Asian Educational Services, 1996. ISBN 81-
206-1201-9. Originally published: Colombo: H. C. Cottle, Govt. Printer, 1919.
Maniku, Hassan Ahmed. Say It in Maldivian (Dhivehi), [by] H. A. Maniku [and] J. B. Disanayaka. Colombo:
Lake House Investments, 1990.
Tibetan
Beyer, Stephen V. The classical Tibetan language. State University of New York. ISBN 0-7914-1099-4
Ugaritic
O’Connor, M. “Epigraphic Semitic Scripts.” In The World’s Writing Systems, edited by Peter T. Daniels and
William Bright. New York: Oxford University Press, 1996. ISBN 0-19-507993-0.
Walker, C. B. F. Cuneiform. London: British Museum Press, 1987. (Reading the Past, v. 3.) ISBN 0-7141-
8059-9. University of California Press edition has ISBN 0-520-06115-2 (pbk.).
Thai
TIS 620-2533 Thai Industrial Standard for Thai Character Code for Computer. (1990)
Yi
GB13134: Xinxi jiaohuanyong yiwen bianma zifuji (Yi coded character set for information interchange),
[prepared by] Sichuansheng minzushiwu weiyuanhui. Beijing, Jishu Biaozhun Chubanshe (Technical
Standards Press), 1991. (GB 13134-1991).
Nuo-su bbur-ma shep jie zzit. = Yi wen jian zi ben. Chengdu: Sichuan minzu chubanshe, 1984.
Nip huo bbur-ma ssix jie. = Yi Han zidian. Chengdu: Sichuan minzu chubanshe, 1990. ISBN 7-5409-0128-
4.
Annex N
(informative)
External references to character repertoires
N.1 Methods of reference to character repertoires and their coding
Within programming languages and other methods for defining the syntax of data objects there is com-
monly a need to declare a specific character repertoire from among those that are specified in ISO/IEC
10646. There may also be a need to declare the corresponding coded representations applicable to that
repertoire.
For any character repertoire that is in accordance with ISO/IEC 10646 a precise declaration of that reper-
toire should include the following parameters:
ISO/IEC 8824-1 Annex B specifies the form of object identifier values for objects that are specified in an
ISO standard. In such an object identifier the features and options of ISO/IEC 10646 are identified by
means of numbers (arcs) which follow the arcs “10646” and “0” which identify the whole ISO/IEC 10646.
NOTE 1 – The arc (0) is required to complement the arcs (1) and (2) which represent respectively ISO/IEC 10646-1 and
ISO/IEC 10646-2. These two arcs should not be used.
The first such arc following a 10646 arc identifies the CC-data-element content definition, and is referred
as ‘level-3 (3)’.
NOTE 2 – This version of the standard specifies a single definition for CC-data-element content. That definition was formerly
known as implementation level 3 in previous editions of this standard
The second such arc identifies the repertoire subset, and is either
• all (0), or
• collections (1).
Arc (0) identifies the entire collection of characters specified in ISO/IEC 10646. No further arc follows this
arc.
NOTE 3 – This collection includes private planes, and is therefore not fully-defined. Its use without additional prior agreement
is deprecated.
Arc (1) is followed by one or a sequence of further arcs, each of which is a collection number from An-
nex A, in ascending numerical order. This sequence identifies the subset consisting of the collections
whose numbers appear in the sequence.
NOTE 4 – As an example, the object identifier for the subset comprising the collections BASIC LATIN, LATIN-1
SUPPLEMENT, and MATHEMATICAL OPERATORS is:
{iso standard 10646 (0) level-3 (3) collections (1) 1 2 39}
ISO/IEC 8824 also specifies object descriptors corresponding to object identifier values. For an unre-
stricted repertoire, the corresponding object descriptor is as follows:
In an object identifier in accordance with ISO/IEC 8824-1 Annex B, the coded representation form speci-
fied in ISO/IEC 10646 is identified by means of numbers (arcs) which follow the arcs "10646" and "0"
which identify the whole ISO/IEC 10646.
transfer-syntaxes (0).
The second such arc identifies the encoding form and is either
Annex P
(informative)
Additional information on CJK Unified Ideographs
This annex contains additional information on CJK Unified Ideographs.
NOTE – The first edition of this standard (ISO/IEC 10646:2003 and amendments 1 to 5) used this annex to provide additional
information on all characters. This edition of the standard includes most of that information in the code charts. Because the
code charts for CJK unified ideographs do not include any name list, the information about these characters is still included in
this annex.
Each entry in this annex consists of the name of a character preceded by its code point, followed by the
related additional information. Entries are arranged in ascending sequence of code point.
9FB9 CJK UNIFIED IDEOGRAPH-9FB9
9FBA CJK UNIFIED IDEOGRAPH-9FBA
9FBB CJK UNIFIED IDEOGRAPH-9FBB
These three characters are intended to represent a component at a specific position of a full ideograph. The
ideographs representing the same structure without a preferred positional preference are encoded at 20509,
2099D, and 470C respectively.
Annex Q
(informative)
Code mapping table for Hangul syllables
NOTE – The information concerning mapping between Hangul syllables (and code points) that were specified in the first edi-
tion of ISO/IEC 10646-1 and their amended code points is available in previous editions of this standard.
Annex R
(informative)
Names of Hangul syllables
This annex provides the full name and additional information of Hangul syllables through a linked file:
The content linked to is a plain text file, using ISO/IEC 646-IRV characters with LINE FEED as end of line
mark that specifies, after a 5-lines header, as all the Hangul syllables, each line specified as follows:
Annex S
(informative)
Procedure for the unification and arrangement of CJK Ideographs
The graphic character collections of CJK unified ideographs in ISO/IEC 10646 are specified in clause 30.
They are derived from many more ideographs which are found in various different national and regional
standards for coded character sets (the "sources").
This annex describes how the ideographs in this standard are derived from the sources by applying a set
of unification procedures. It also describes how the ideographs in this standard are arranged in the se-
quence of consecutive code points to which they are assigned.
The source references for CJK unified ideographs are specified in clause 23.
Within the context of ISO/IEC 10646 a unification process is applied to the ideographic characters taken
from the codes in the source groups. In this process, single ideographs from two or more of the source
groups are associated together, and a single code point is assigned to them in this standard. The associa-
tions are made according to a set of procedures that are described below. Ideographs that are thus asso-
ciated are described here as “unified”.
NOTE – The unification process does not apply to the following collections of ideographic characters:
CJK RADICALS SUPPLEMENT (2E80 - 2EFF)
KANGXI RADICALS (2F00 - 2FDF)
CJK COMPATIBILITY IDEOGRAPHS (F900 - FAFF with the exception of FA0E, FA0F, FA11, FA13, FA14, FA1F, FA21,
FA23, FA24, FA27, FA28 and FA29)
CJK COMPATIBILITY IDEOGRAPHS SUPPLEMENT (2F800-2FA1F).
士,土
NOTE – The difference of shape between the two ideographs in the above example is in the length of the lower horizontal line.
This is considered an actual difference of shape. Furthermore these ideographs have different meanings. The meaning of the
first is "Soldier" and of the second is "Soil or Earth".
An association between ideographs from different sources is made here if their shapes are sufficiently
similar, according to the following system of classification.
S.1.3 Procedure
A unification procedure is used to determine whether two ideographs have the same abstract shape or
different ones. The unification procedure has two stages, applied in the following order:
If all of the features a) to c) above are the same between the ideographs, the ideographs are considered
to have the same abstract shape and are therefore unified.
峰•峯, 荊•荆
When the ideographs consists of two horizontally aligned components, a difference of the last stroke of
the left-hand document going beneath the right-hand component should not warrant separate encoding,
as in the case of the source glyphs for U+34F3:
G T
㓳•㓳
S.1.4.3 Different structure of a corresponding component
The examples below illustrate rule c). The structure of one (or more) corresponding components within the
two ideographs in each pair is different.
, , , , , , ,
, , , , , , ,
, , , , , , ,
, , , , , , ,
, , , ,
The differences are further classified according to the following examples.
巨•巨
e) Differences in bent strokes
册•册
f) Differences in folding back at the stroke termination
佘•佘
g) Differences in accent at the stroke initiation
父•父, 丈•丈
h) Differences in "rooftop" modification
八•八, 穴•穴
i) Addition or omission of a minor stroke
刃•刃•刃
k) Miscellaneous
However, some ideographs encoded in two standards belonging to the same source group (e.g. GB2312-
80 and GB12345-90) have been unified during the process of collecting ideographs from the source
group.
The source separation rule described in this clause only applies to the CJK UNIFIED IDEOGRAPHS block
specified in the Basic Multilingual Plane.⻎
NOTE – CJK Compatibility Ideographs are created following a rule very similar to the source separation rule. However, the end
result is the combination of a single CJK Unified Ideograph and one or several CJK Compatibility Ideographs. When the
source separation rule is applied, all ‘similar’ source CJK Ideographs result in separate CJK Unified Ideographs.
S.2.2 Procedure
S.2.2.1 Ideographs found in the dictionaries
a) If an ideograph is found in the Kangxi Dictionary, it is positioned in the code chart in accordance with
the Kangxi Dictionary order.
b) If an ideograph is not found in the Kangxi Dictionary but is found in the Daikanwa Jiten, it is given a
position at the end of the radical-stroke group under which is indexed the nearest preceding Daikanwa
Jiten character that also appears in the Kangxi dictionary.
c) If an ideograph is found in neither the Kangxi nor the Daikanwa, the Hanyu Dazidian and the Dae-
jaweon dictionaries are referred to with a similar procedure.
S.2.2.2 Ideographs not found in the dictionaries
If an ideograph is not found in any of the four dictionaries, it is given a position at the end of the radical-
stroke group (after the characters that are present in the dictionaries) and it is indexed under the same
radical-stroke count.
丟丢 T
仞仭 J 俁俣 TJK 値值 T
么幺 GT 併倂 T 俞兪 T 偷偸 T
争爭 GTJ 侣侶 T 俱倶 T 偽僞 TJ
兌兑 T 剝剥 T 唧喞 T 壮壯 GTJ
兎兔 TJ
劒劔 J 喩喻 T 壽夀 T
兖兗 T 勻匀 T 嘘噓 T 夐敻 T
冊册 TJ 单単 T 嚏嚔 GTJ 夲本 GTJ
净凈 G 即卽 TK 囯国 T
奥奧 J
凢凣 T 卷巻 TJ 圈圏 TJ 奨奬獎 TJ
刃刄 TJ 叁参 GT 圎圓 T 妆妝 GT
刊刋 TJ 參叄 T 圖圗 T 妍姸 T
删刪 T 吕呂 T 坙巠 T 姍姗 T
別别 T 吞呑 T
埒埓 J 姫姬 GT
券劵 TJ 吳吴 呉 TJ 塈墍 T 娛娯娱 T
5238 52B5 5433 5434 5449 5848 588D 5A1B 5A2F 5A31
刹剎 T 吶呐 T 塡填 TJ 婕媫 T
剏剙 T 吿告 T 増增 T 婾媮 T
媪媼 TK 尪尫 T 彔录 T 戩戬 GT
媯嬀 T 尶尷 T 彙彚 T 戯戱 T
嬎嬔 T 屏屛 T
彛彜 J 戶户戸 T
嬤嬷 GT 峥崢 GT 彝彞 T 戻戾 T
孳孶 T 巓巔 T 彥彦 T 抛拋 T
宫宮 T 帡帲 T 徳德 T 抜拔 TJ
寛寬 T 帯帶 TJ 徴徵 T 挩捝 T
寜寧 T 并幷 T 恵惠 TJ 挿插揷 TJ
寝寢 GTJ 廄廏 T 悅悦 T 捏揑 TJ
専專 J 弑弒 T 悞悮 T 捜搜 TJ
将將 GTJ 強强 T 悳惪 T 掲揭 T
尓尔 T 弹弾 T 愠慍 T 揺搖摇 TJ
尙尚 T 彐彑 TJ 愼慎 TJ 揾搵 T
撃擊 TJ 概槪 T 汚污 T
潛濳 GTJK
敎教 T 榅榲 T 沒没 TJ 瀨瀬 T
敓敚 T 榝樧 T 浄淨 TJ 為爲 GTJ
既旣 T
槇槙 J 涉渉 T 焭煢 GTJK
昂昻 T 様樣 TJ 涗涚 T
煕熙 J
晚晩 T 横橫 T 涙淚 T 煴熅 T
暨曁 T 步歩 T 淥渌 T 状狀 GT
曽曾 J 歲歳 T 淸清 T 瑤瑶 TJ
枴柺 T 歿殁 T 渇渴 T 瓶甁 T
查査 T 殻殼 GTJ 温溫 T 產産 T
柵栅 T 毀毁 T 溈潙 T
痩瘦 J
梲棁 T 毎每 T 溉漑 T 皡皥 T
楡榆 T 氲氳 T 滚滾 T 眞真 TJ
眾衆 TJK 緣縁 T 蒀蒕 T 諌諫 TJ
研硏 T 緼縕 T
蒋蔣 GJ
謠謡 J
祿禄 TJ 繈繦 T 蒍蔿 T 豜豣 T
禿秃 T 羮羹 TJ 蕰薀 T 走赱 TJ
稅税 T 翶翺 T 薫薰 T 軿輧 T
穂穗 TJ 胼腁 T 藴蘊 T
輜輺 J
筝箏 GJ 脫脱 T 虚虛 T 輼轀 T
箳簈 T 腽膃 T 蛻蜕 T 达迖 T
篡簒 T
舃舄 GT 衛衞 TJK 迸逬 TJ
粤粵 T 舍舎 TJ 衮袞 TK
遙遥 J
絕絶 T
舖舗 J
装裝 GJK 邢郉 T
綠緑 T 荘莊 TJ 訮詽 T 郎郞 T
緒緖 T 菑葘 TJ 說説 T 郷鄉鄕 T
醖醞 T
陧隉 G 餅餠 TJ 鳯鳳 T
醤醬 J 靑青 T 馱駄 TJK
鶇鶫 J
鈃銒 T 静靜 GTJ 駢騈 TK
鷆鷏 J
銳鋭 T
靭靱 J 骩骫 T 麪麫 T
錄録 T 頹頽 T 高髙 T 麼麽 T
錬鍊 TK 顏顔 TJ 髪髮 TJ 黃黄 T
鎭鎮 TJ
顚顛 J 鬬鬭 T 黑黒 T
閱閲 T
飮飲 J 鰛鰮 TJ
冑胄 non cognate
垛垜 S.1.4.3 懐懷 S.1.4.1 朐胊 non cognate
决決 S.1.4.3
寳寶 S.1.4.3 朌肦 non cognate 朘脧 non cognate
Annex T
(informative)
Language tagging using Tag Characters
NOTE – Moved to F.7.
Annex U
(informative)
Characters in identifiers
A common task facing an implementer of UCS is the provision of a parsing and/or lexing engine for identi-
fiers. Each programming language standard has its own identifier syntax; different programming lan-
guages have different conventions for the use of certain characters from the ASCII (ISO 646-IRV) range
($, @, #, _) in identifiers. Questions as to which characters to use for syntactic purposes versus which to
be allowed in identifiers, whether case-pairing should be included, normalization should be performed,
and other factors enter into the picture when defining the set of permitted characters for a given identifica-
tion purpose.
The Unicode Consortium publishes a document "UAX 31 – Identifier and Pattern Syntax" to assist in the
standard treatment of identifiers in UCS character-based parsers. Those specifications are recommended
for determining the list of UCS characters suitable for use in identifiers. The document is available at
http://www.unicode.org/reports/tr31/.