Text encoding defines how characters are represented as bytes in computer systems. In Linux, understanding encoding is crucial for proper file handling, internationalization, and avoiding data corruption when working with different languages and character sets.
Key Concepts
- Character Set: Collection of characters (ASCII, Unicode)
- Encoding: Method to represent characters as bytes (UTF-8, ISO-8859-1)
- Locale: System setting defining language, region, and encoding
- BOM: Byte Order Mark - optional marker at file start
- Code Point: Numeric value assigned to each character
Command Syntax
file [options] filename - Detect file encoding
iconv [options] -f from-encoding -t to-encoding file
locale [options] - Display locale information
Common Options
-i - Show MIME encoding (with file command)
-f - Source encoding (iconv)
-t - Target encoding (iconv)
-o - Output file (iconv)
-c - Skip invalid characters (iconv)
-l - List available encodings (iconv)
Practical Examples
Example 1: Check file encoding
|
|
Shows the file contains UTF-8 encoded text
Example 2: Convert encoding
|
|
Converts file from Latin-1 to UTF-8 encoding
Example 3: List available encodings
|
|
Shows supported character encodings
Example 4: Check system locale
|
|
Displays current locale settings
Example 5: Set locale temporarily
|
|
Changes locale for current session
Use Cases
- Converting files between different character sets
- Fixing corrupted text from wrong encoding
- Preparing files for international distribution
- Debugging character display issues
- Processing legacy files with old encodings
Related Commands
hexdump - View raw bytes in files
od - Octal/hex dump of file contents
chardet - Python tool for encoding detection
uconv - ICU encoding converter (alternative)
localectl - Control system locale settings
Tips & Troubleshooting
- UTF-8 is the standard encoding for modern Linux
- Always backup files before encoding conversion
- Use
-cflag with iconv to handle invalid chars - Check
$LANGand$LC_*environment variables - Some terminals may not display all characters
- BOM can cause issues with UTF-8 files in Linux
- Test converted files thoroughly before deployment
Common Encoding Issues
- Mojibake: Garbled text from wrong encoding
- Question marks: Characters not in target set
- Missing characters: Incomplete conversion
- BOM problems: Windows UTF-8 files in Linux