Unicode vs ASCII

ASCII covers 128 English characters from the 1960s. Unicode covers every script in active use — Devanagari, Tamil, Chinese, emoji — across over 150,000 code points.

ASCII: 128 characters, 1963

ASCII (American Standard Code for Information Interchange) is a 7-bit code table that assigns numbers 0–127 to the English alphabet, digits, basic punctuation, and a handful of control codes like newline and tab. It fits in a single byte with one bit to spare. For early American computing that was enough; for the rest of the planet, obviously not. A whole zoo of single-byte "code pages" tried to bolt other scripts onto the spare top half (ISO-8859-1 for Western European, Windows-1251 for Cyrillic, and so on), and the result was a permanent mess of mojibake whenever a file crossed a language boundary.

Unicode: one number per character, for every script

Unicode, started in 1991, gives every character in every writing system its own number — a code point, written like U+0041 for "A" or U+0928 for Devanagari . As of 2026 the standard defines roughly 150,000 assigned code points across about 160 scripts, plus symbols, mathematical notation, and emoji. The code point space goes up to U+10FFFF, leaving plenty of room for future additions.

Code points vs encodings

A code point is an abstract number. To actually store or transmit it you need an encoding — a rule for turning numbers into bytes. The two that matter today:

  • UTF-8— variable-width, 1 to 4 bytes per code point. ASCII characters are exactly one byte and bit-identical to old ASCII, so any pure-ASCII file is also a valid UTF-8 file. Non-Latin scripts take 2–3 bytes; emoji and rarer characters take 4. UTF-8 dominates the web (well over 98% of pages) for exactly this reason — backwards compatibility plus efficiency for English-heavy text.
  • UTF-16 — 2 or 4 bytes per code point. Characters in the Basic Multilingual Plane (U+0000U+FFFF) fit in 2 bytes; anything above uses a surrogate pair of two 16-bit units. Used internally by JavaScript strings, Java, Windows APIs, and .NET.

Planes, BMP, and surrogates

The code-point space is divided into 17 planes of 65,536 points each. Plane 0 is the Basic Multilingual Plane (BMP) and holds the bulk of everyday characters. The supplementary planes hold historic scripts, CJK extensions, and — relevant for everyone — emoji, which mostly live in plane 1 starting around U+1F300. UTF-16's surrogate-pair mechanism exists specifically because the BMP is not big enough.

Normalisation

Some characters can be written more than one way. é can be a single code point U+00E9, or the letter e followed by a combining accent U+0301. The two look identical and mean the same thing, but compare as different strings. Unicode defines normalisation forms — NFC (composed) and NFD (decomposed) are the common ones — to convert between them so string comparisons behave.

The emoji-length trap

In JavaScript, "😀".length returns 2, not 1. The string length property counts UTF-16 code units, and that emoji is a surrogate pair. Things get stranger with modern emoji: a person with a skin tone is a base emoji plus a Fitzpatrick modifier, joined family emoji use zero-width joiners (U+200D), and a flag is two regional-indicator letters. A single visible glyph can be six or more code points. If you need a real grapheme count, use Intl.Segmenter rather than .length.

Toolkiya's fancy text generatoruses Unicode's mathematical alphanumeric blocks and other styled ranges — the "styled" letters are real Unicode code points, which is why they paste anywhere — and the Base64 tool handles UTF-8 input correctly so non-ASCII text survives the round trip.

Related Toolkiya tools

Browse the full glossary

Plain-English explanations for the technical terms behind everyday online tools.

See all entries