What Is UTF-8? Unicode Encoding Explained
UTF-8 is a variable-width character encoding for Unicode that uses 1–4 bytes per character. It is the dominant encoding on the web (used by 98% of pages), in files, APIs, and databases. UTF-8 is backward-compatible with ASCII — the first 128 characters are identical — making it safe for legacy systems while supporting every language and emoji.
How UTF-8 Encoding Works
ASCII characters (U+0000–U+007F) use 1 byte. Latin/European characters (U+0080–U+07FF) use 2 bytes. Most other characters including CJK (U+0800–U+FFFF) use 3 bytes. Rare characters and emoji (U+10000–U+10FFFF) use 4 bytes. The leading bits indicate how many bytes the character uses: 0xxxxxxx (1 byte), 110xxxxx (2 bytes), 1110xxxx (3 bytes), 11110xxx (4 bytes).
UTF-8 vs UTF-16 vs Latin-1
UTF-16 uses 2 bytes for most characters and 4 bytes for supplementary characters. It is used internally by JavaScript, Java, and Windows. It is less efficient for ASCII-heavy text but more predictable for CJK. Latin-1 (ISO 8859-1) only covers 256 characters and cannot represent most of the world's scripts. Always use UTF-8 for files, APIs, and databases unless there is a specific reason for another encoding.
BOM (Byte Order Mark)
Some editors prepend a BOM (U+FEFF, encoded as EF BB BF in UTF-8) to mark the file as UTF-8. While harmless in most contexts, BOMs can break PHP scripts, shell scripts, and some parsers that don't expect leading bytes. Prefer UTF-8 without BOM for all web and server files. The BOM is required for UTF-16 to indicate byte order (little-endian vs big-endian).