B Unicode Character Code Assignments

This appendix offers an introduction to Unicode character assignments. This appendix contains:

Unicode Code Ranges
UTF-16 Encoding
UTF-8 Encoding

Unicode Code Ranges

Table B-1 contains code ranges that have been allocated in Unicode for UTF-16 character codes.

Table B-1 Unicode Character Code Ranges for UTF-16 Character Codes

Types of Characters	First 16 Bits	Second 16 Bits
ASCII	0000-007F	-
European (except ASCII), Arabic, Hebrew	0080-07FF	-
Iindic, Thai, certain symbols (such as the euro symbol), Chinese, Japanese, Korean	0800-0FFF 1000 - CFFF D000 - D7FF F900 - FFFF	-
Private Use Area #1	E000 - EFFF F000 - F8FF	-
Supplementary characters: Additional Chinese, Japanese, and Korean characters; historic characters; musical symbols; mathematical symbols	D800 - D8BF D8CO - DABF DAC0 - DB7F	DC00 - DFFF DC00 - DFFF DC00 - DFFF
Private Use Area #2	DB80 - DBBF DBC0 - DBFF	DC00 - DFFF DC00 - DFFF

Table B-2 contains code ranges that have been allocated in Unicode for UTF-8 character codes.

Table B-2 Unicode Character Code Ranges for UTF-8 Character Codes

Types of Characters	First Byte	Second Byte	Third Byte	Fourth Byte
ASCII	00 - 7F	-	-	-
European (except ASCII), Arabic, Hebrew	C2 - DF	80 - BF	-	-
Indic, Thai, certain symbols (such as the euro symbol), Chinese, Japanese, Korean	E0 E1 - EC ED EF	A0 - BF 80 - BF 80 - 9F A4 - BF	80 - BF 80 - BF 80 - BF 80 - BF	-
Private Use Area #1	EE EF	80 - BF 80 - A3	80 - BF 80 - BF	-
Supplementary characters: Additional Chinese, Japanese, and Korean characters; historic characters; musical symbols; mathematical symbols	F0 F1 - F2 F3	90 - BF 80 - BF 80 - AF	80 - BF 80 - BF 80 - BF	80 - BF 80 - BF 80 - BF
Private Use Area #2	F3 F4	B0 - BF 80 - 8F	80 - BF 80 - BF	80 - BF 80 - BF

Note:

Blank spaces represent nonapplicable code assignments. Character codes are shown in hexadecimal representation.

UTF-16 Encoding

As shown in Table B-1, UTF-16 character codes for some characters (Additional Chinese/Japanese/Korean characters and Private Use Area #2) are represented in two units of 16-bits. These are supplementary characters. A supplementary character consists of two 16-bit values. The first 16-bit value is encoded in the range from 0xD800 to 0xDBFF. The second 16-bit value is encoded in the range from 0xDC00 to 0xDFFF. With supplementary characters, UTF-16 character codes can represent more than one million characters. Without supplementary characters, only 65,536 characters can be represented. Oracle's AL16UTF16 character set supports supplementary characters.

UTF-8 Encoding

The UTF-8 character codes in Table B-2 show that the following conditions are true:

ASCII characters use 1 byte
European (except ASCII), Arabic, and Hebrew characters require 2 bytes
Indic, Thai, Chinese, Japanese, and Korean characters as well as certain symbols such as the euro symbol require 3 bytes
Characters in the Private Use Area #1 require 3 bytes
Supplementary characters require 4 bytes
Characters in the Private Use Area #2 require 4 bytes

Oracle's AL32UTF8 character set supports 1-byte, 2-byte, 3-byte, and 4-byte values. Oracle's UTF8 character set supports 1-byte, 2-byte, and 3-byte values, but not 4-byte values.