Programming Resources

  


Character Encodings

by Jeff Hunter, Sr. Database Administrator

Contents

ISO Encodings

The following character encodings are supported by JDK 1.1.7a and JDK 1.2. All of the encoding names are case-sensitive. The JDK can successfully convert from these encodings to Unicode and vice-versa. This does not, however, mean that the JDK comes with fonts to display these character sets.

ISO Encodings
Encoding Name Character Encoding
ASCII U.S. ASCII (ISO 646, ANSI X3.4)
ISO8859_1 Latin 1 (Western Europe)
ISO8859_2 Latin 2 (Eastern Europe)
ISO8859_3 Latin 3 (Southern Europe)
ISO8859_4 Latin 4 (Northern Europe)
ISO8859_5 Cyrillic
ISO8859_6 Arabic
ISO8859_7 Greek
ISO8859_8 Hebrew
ISO8859_9 Latin 5 (Turkish)
ISO8859_15_FDIS Updated Latin 1 with Euro
ISO8859_15_FDIS Updated Latin 1 with Euro
Big5 Traditional Chinese
EUC_CN Simplified Chinese
EUC_JP Japanese
EUC_TW Traditional Chinese
GBK Simplified Chinese
ISO2022CN Chinese
ISO2022CN_CNS Traditional Chinese
ISO2022CN_GB Simplified Chinese
ISO2022JP Japanese
ISO2022KR Korean
JIS0201 Japanese
JIS0208 Japanese
JIS0212 Japanese
JISAutoDetect JIS autodetect (bytes-to-chars only)
SJIS Shift-JIS Japanese
Unicode Marked big-endian Unicode
UnicodeBig Marked big-endian Unicode
UnicodeBigUnmarked Unmarked big-endian Unicode
UnicodeLittle Marked little-endian Unicode
UnicodeLittleUnmarked Unmarked little-endian Unicode
UTF8 Unicode Transfer Format-8

Unicode 2.0 Block Allocations

Unicode is a 16-bit international character encoding standard that supports the alphabets of many different languages in addition to a variety of mathematical and geometric shapes. Groups of characters from different alphabets and different origins are assigned contigous blocks of the character set; this table list the Unicode 2.0 block allocatesion.

Unicode 2.0 Block Allocations
Start Code End Code Block Name
\u0000 \u007F Basic Latin
\u0080 \u00FF Latin-1 Supplement
\u0100 \u017F Latin Extended-A
\u0180 \u024F Latin Extended-B
\u0250 \u02AF IPA Extensions
\u02B0 \u02FF Spacing Modifier Letters
\u0300 \u036F Combining Diacritical Marks
\u0370 \u03FF Greek
\u0400 \u04FF Cyrillic
\u0530 \u058F Armenian
\u0590 \u05FF Hebrew
\u0600 \u06FF Arabic
\u0900 \u097F Devanagari
\u0980 \u09FF Bengali
\u0A00 \u0A7F Gurmukhi
\u0A80 \u0AFF Gujarati
\u0B00 \u0B7F Oriya
\u0B80 \u0BFF Tamil
\u0C00 \u0C7F Telugu
\u0C80 \u0CFF Kannada
\u0D00 \u0D7F Malayalam
\u0E00 \u0E7F Thai
\u0E80 \u0EFF Lao
\u0F00 \u0FBF Tibetan
\u10A0 \u10FF Georgian
\u1100 \u11FF Hangul Jamo
\u1E00 \u1EFF Latin Extended Additional
\u1F00 \u1FFF Greek Extended
\u2000 \u206F General Puncuation
\u2070 \u209F Superscripts and Subscripts
\u20A0 \u20CF Currency Symbols
\u20D0 \u20FF Combining Marks for Symbols
\u2100 \u214F Letterlink Symbols
\u2150 \u218F Number Forms
\u2190 \u21FF Arrows
\u2200 \u22FF Mathematical Operators
\u2300 \u23FF Miscellaneous Technical
\u2400 \u243F Control Pictures
\u2440 \u245F Optical Character Recognition
\u2460 \u24FF Enclosed Alphanumerics
\u2500 \u257F Box Drawing
\u2580 \u259F Block Elements
\u25A0 \u25FF Geometric Shapes
\u2600 \u26FF Miscellaneous Symbols
\u2700 \u27BF Dingbats
\u3000 \u303F CJK Symbols and Punctuation
\u3040 \u309F Hiragana
\u30A0 \u30FF Katakana
\u3100 \u312F Bopomofo
\u3130 \u318F Hangul Compatibility Jamo
\u3190 \u319F Kanbun
\u3200 \u32FF Enclosed CJK Letters and Months
\u3300 \u33FF CJK Compatibility
\u4E00 \u9FFF CJK Unified Ideographs
\uAC00 \uD7A3 Hangul Syllables
\uD800 \uDB7F High Surrogates
\uDB80 \uDBFF High Private Use Surrogates
\uDC00 \uDFFF Low Surrogates
\uE000 \uF8FF Private Use
\uF900 \uFAFF CJK Compatibility Ideographs
\uFB00 \uFB4F Alphabetic Presentation Forms
\uFB50 \uFDFF Arabic Presentation Forms-A
\uFE20 \uFE2F Combining Half Marks
\uFE30 \uFE4F CJK Compatibility Forms
\uFE50 \uFE6F Small Form Variants
\uFE70 \uFEFF Arabic Presentation Forms-B
\uFF00 \uFFEF Halfwidth and Fullwidth Forms
\uFEFF \uFEFF Specials
\uFFF0 \uFFFF Specials

Modified UTF-8 Encoding

UTF-8 is an efficient encoding of Unicode character strings that recognizes the fact that the majority of text-based communications are in ASCII, adn therefore optimizes the encoding of these characters.

Strings are encoded as two bytes that specify the length of the string followed by the encoded string characters. The 2-byte length is written in network byte order, and indicates the length of the encoded string characters, not just the number of characters in the string.


[lenHI][lenLO] {encoded characters}

The individual characters are encoded according to the following table. ASCII characters are encoded as a single byte; Greek, Hebrew, and Arabic characters are uncoded as two bytes; and all other characters are encoded as three bytes. The variant of UTF-8 used by Java has one modification: the character \u0000 is encoded in two bytes, so that no character will be encoded with the byte zero.

Modified UTF-8 Encoding
Character Encoding
\u0000 [11000000][10000000] (Java)
\u0001 - \u007F [0][bits 0-6]
\u0080 - \u07FF [110][bits 6-10] [10][bits 0-5]
\u0800 - \uFFFF [1110][bits 12-15] [10][bits 6-11] [10][bits 0-5]

About the Author

Jeffrey Hunter is an Oracle Certified Professional, Java Development Certified Professional, Author, and an Oracle ACE. Jeff currently works as a Senior Database Administrator for The DBA Zone, Inc. located in Pittsburgh, Pennsylvania. His work includes advanced performance tuning, Java and PL/SQL programming, developing high availability solutions, capacity planning, database security, and physical / logical database design in a UNIX / Linux server environment. Jeff's other interests include mathematical encryption theory, tutoring advanced mathematics, programming language processors (compilers and interpreters) in Java and C, LDAP, writing web-based database administration tools, and of course Linux. He has been a Sr. Database Administrator and Software Engineer for over 20 years and maintains his own website site at: http://www.iDevelopment.info. Jeff graduated from Stanislaus State University in Turlock, California, with a Bachelor's degree in Computer Science and Mathematics.



Copyright (c) 1998-2017 Jeffrey M. Hunter. All rights reserved.

All articles, scripts and material located at the Internet address of http://www.idevelopment.info is the copyright of Jeffrey M. Hunter and is protected under copyright laws of the United States. This document may not be hosted on any other site without my express, prior, written permission. Application to host any of the material elsewhere can be made by contacting me at jhunter@idevelopment.info.

I have made every effort and taken great care in making sure that the material included on my web site is technically accurate, but I disclaim any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on it. I will in no case be liable for any monetary damages arising from such loss, damage or destruction.

Last modified on
Wednesday, 28-Dec-2011 14:09:10 EST
Page Count: 1326