DBpedia 2014 |

DBpedia 2014

Matches in DBpedia 2014 for { ?s ?p The Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8) is a variant of UTF-8 that is described in Unicode Technical Report #26 [1]. A Unicode code point from the Basic Multilingual Plane (BMP), i.e. a code point in the range U+0000 to U+FFFF, is encoded in the same way as in UTF-8. A Unicode supplementary character, i.e. a code point in the range U+10000 to U+10FFFF, is first represented as a surrogate pair, like in UTF-16, and then each surrogate code point is encoded in UTF-8. Therefore, CESU-8 needs six bytes (3 bytes per surrogate) for each Unicode supplementary character while UTF-8 needs only four. Each CESU-8 character code (1, 2, or 3 bytes) can be converted to exactly one UTF-16 code unit (2 bytes).The encoding of Unicode supplementary characters works out to 11101101 1010yyyy 10xxxxxx 11101101 1011xxxx 10xxxxxx (yyyy represents the top five bits of the character minus one i.e. U+10 becomes 1111, U+01 becomes 0000, x represents the remaining bits of the character).CESU-8 is not an official part of the Unicode Standard, because Unicode Technical Reports are informative documents only. It should be used exclusively for internal processing and never for external data exchange.CESU-8 is similar to Java's Modified UTF-8 but does not have the special encoding of the NUL character (U+0000).The Oracle and MySQL databases both have character sets called "UTF-8" which are actually CESU-8. Standard UTF-8 can be obtained using the character sets "AL32UTF8" (since Oracle version 9.0), or "utf8mb4" (in MySQL) (so named because a maximum of 4 bytes are used for each character).. }

Showing items 1 to 1 of 1 with 100 items per page.

CESU-8 abstract "The Compatibility Encoding Scheme for UTF-16: 8-Bit (CESU-8) is a variant of UTF-8 that is described in Unicode Technical Report #26 [1]. A Unicode code point from the Basic Multilingual Plane (BMP), i.e. a code point in the range U+0000 to U+FFFF, is encoded in the same way as in UTF-8. A Unicode supplementary character, i.e. a code point in the range U+10000 to U+10FFFF, is first represented as a surrogate pair, like in UTF-16, and then each surrogate code point is encoded in UTF-8. Therefore, CESU-8 needs six bytes (3 bytes per surrogate) for each Unicode supplementary character while UTF-8 needs only four. Each CESU-8 character code (1, 2, or 3 bytes) can be converted to exactly one UTF-16 code unit (2 bytes).The encoding of Unicode supplementary characters works out to 11101101 1010yyyy 10xxxxxx 11101101 1011xxxx 10xxxxxx (yyyy represents the top five bits of the character minus one i.e. U+10**** becomes 1111, U+01**** becomes 0000, x represents the remaining bits of the character).CESU-8 is not an official part of the Unicode Standard, because Unicode Technical Reports are informative documents only. It should be used exclusively for internal processing and never for external data exchange.CESU-8 is similar to Java's Modified UTF-8 but does not have the special encoding of the NUL character (U+0000).The Oracle and MySQL databases both have character sets called "UTF-8" which are actually CESU-8. Standard UTF-8 can be obtained using the character sets "AL32UTF8" (since Oracle version 9.0), or "utf8mb4" (in MySQL) (so named because a maximum of 4 bytes are used for each character).".