Unicode vs ASCII
ASCII covers 128 English characters from the 1960s. Unicode covers every script in active use — Devanagari, Tamil, Chinese, emoji — across over 150,000 code points.
ASCII: 128 characters, 1963
ASCII (American Standard Code for Information Interchange) is a 7-bit code table that assigns numbers 0–127 to the English alphabet, digits, basic punctuation, and a handful of control codes like newline and tab. It fits in a single byte with one bit to spare. For early American computing that was enough; for the rest of the planet, obviously not. A whole zoo of single-byte "code pages" tried to bolt other scripts onto the spare top half (ISO-8859-1 for Western European, Windows-1251 for Cyrillic, and so on), and the result was a permanent mess of mojibake whenever a file crossed a language boundary.
Unicode: one number per character, for every script
Unicode, started in 1991, gives every character in every writing system its own number — a code point, written like U+0041 for "A" or U+0928 for Devanagari न. As of 2026 the standard defines roughly 150,000 assigned code points across about 160 scripts, plus symbols, mathematical notation, and emoji. The code point space goes up to U+10FFFF, leaving plenty of room for future additions.
Code points vs encodings
A code point is an abstract number. To actually store or transmit it you need an encoding — a rule for turning numbers into bytes. The two that matter today:
- UTF-8— variable-width, 1 to 4 bytes per code point. ASCII characters are exactly one byte and bit-identical to old ASCII, so any pure-ASCII file is also a valid UTF-8 file. Non-Latin scripts take 2–3 bytes; emoji and rarer characters take 4. UTF-8 dominates the web (well over 98% of pages) for exactly this reason — backwards compatibility plus efficiency for English-heavy text.
- UTF-16 — 2 or 4 bytes per code point. Characters in the Basic Multilingual Plane (
U+0000–U+FFFF) fit in 2 bytes; anything above uses a surrogate pair of two 16-bit units. Used internally by JavaScript strings, Java, Windows APIs, and .NET.
Planes, BMP, and surrogates
The code-point space is divided into 17 planes of 65,536 points each. Plane 0 is the Basic Multilingual Plane (BMP) and holds the bulk of everyday characters. The supplementary planes hold historic scripts, CJK extensions, and — relevant for everyone — emoji, which mostly live in plane 1 starting around U+1F300. UTF-16's surrogate-pair mechanism exists specifically because the BMP is not big enough.
Normalisation
Some characters can be written more than one way. é can be a single code point U+00E9, or the letter e followed by a combining accent U+0301. The two look identical and mean the same thing, but compare as different strings. Unicode defines normalisation forms — NFC (composed) and NFD (decomposed) are the common ones — to convert between them so string comparisons behave.
The emoji-length trap
In JavaScript, "😀".length returns 2, not 1. The string length property counts UTF-16 code units, and that emoji is a surrogate pair. Things get stranger with modern emoji: a person with a skin tone is a base emoji plus a Fitzpatrick modifier, joined family emoji use zero-width joiners (U+200D), and a flag is two regional-indicator letters. A single visible glyph can be six or more code points. If you need a real grapheme count, use Intl.Segmenter rather than .length.
Toolkiya's fancy text generatoruses Unicode's mathematical alphanumeric blocks and other styled ranges — the "styled" letters are real Unicode code points, which is why they paste anywhere — and the Base64 tool handles UTF-8 input correctly so non-ASCII text survives the round trip.
Related Toolkiya tools
Browse the full glossary
Plain-English explanations for the technical terms behind everyday online tools.
See all entries