Text Encoding

Understanding text encoding including UTF-8, ASCII, and character set conversions for internationalization

Text encoding defines how characters are represented as bytes in computer systems. In Linux, understanding encoding is crucial for proper file handling, internationalization, and avoiding data corruption when working with different languages and character sets.

Key Concepts

  • Character Set: Collection of characters (ASCII, Unicode)
  • Encoding: Method to represent characters as bytes (UTF-8, ISO-8859-1)
  • Locale: System setting defining language, region, and encoding
  • BOM: Byte Order Mark - optional marker at file start
  • Code Point: Numeric value assigned to each character

Command Syntax

file [options] filename - Detect file encoding iconv [options] -f from-encoding -t to-encoding file locale [options] - Display locale information

Common Options

-i - Show MIME encoding (with file command) -f - Source encoding (iconv) -t - Target encoding (iconv) -o - Output file (iconv) -c - Skip invalid characters (iconv) -l - List available encodings (iconv)

Practical Examples

Example 1: Check file encoding

1
2
file -i document.txt
document.txt: text/plain; charset=utf-8

Shows the file contains UTF-8 encoded text

Example 2: Convert encoding

1
iconv -f ISO-8859-1 -t UTF-8 old.txt -o new.txt

Converts file from Latin-1 to UTF-8 encoding

Example 3: List available encodings

1
2
3
4
5
6
iconv -l | head -10
ANSI_X3.4-1968
ANSI_X3.4-1986
ASCII
BIG5
BIG5-HKSCS

Shows supported character encodings

Example 4: Check system locale

1
2
3
4
locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"

Displays current locale settings

Example 5: Set locale temporarily

1
export LC_ALL=en_US.UTF-8

Changes locale for current session

Use Cases

  • Converting files between different character sets
  • Fixing corrupted text from wrong encoding
  • Preparing files for international distribution
  • Debugging character display issues
  • Processing legacy files with old encodings

hexdump - View raw bytes in files od - Octal/hex dump of file contents
chardet - Python tool for encoding detection uconv - ICU encoding converter (alternative) localectl - Control system locale settings

Tips & Troubleshooting

  • UTF-8 is the standard encoding for modern Linux
  • Always backup files before encoding conversion
  • Use -c flag with iconv to handle invalid chars
  • Check $LANG and $LC_* environment variables
  • Some terminals may not display all characters
  • BOM can cause issues with UTF-8 files in Linux
  • Test converted files thoroughly before deployment

Common Encoding Issues

  • Mojibake: Garbled text from wrong encoding
  • Question marks: Characters not in target set
  • Missing characters: Incomplete conversion
  • BOM problems: Windows UTF-8 files in Linux