C Character Set
Categories: C language
C Character Set
Backstory
Character is 1-byte information that denotes alphabets, digits, and some special characters like !, @, etc. So simple it seems, but it has a long history of varying standards like EBCDIC, ASCII, etc. Read on...
In the early days, there used to be an encoding system called Extended Binary-Coded Decimal Interchange Code(EBCDIC), developed by IBM. EBCDIC can support 256 different types of characters. A few important features of EBCDIC are:
- Each character fits in 8 bits.
- The same type of characters are not grouped together.
- Different versions of EBCDIC are not compatible.
Slowly, ASCII encoding was developed in 1963 by American Standards Association (ASA). ASCII was simpler and accommodated fewer characters than EBCDIC. It has 128 characters and needs 7 bits to display a single character.
Another Conflict
Most computers were using 8-bit bytes and ASCII requires only 7 bits (i.e., 27 = 128 characters), We have one extra bit to spare. Soon, few organizations developed their own conventions for [128, 255] characters. IBM developed the OEM character set, which included peculiar characters like |, Ã, Æ etc. IBM changed these character sets, i.e., [128, 255] according to every country. For example, character code 130 displays é in Europe, and it shows ℷ in Israel. If this appears as a small issue, wait until Asian languages come into the picture with thousands of characters! In these difficult times, slowly a standard was making its way...
Unicode Era
Unlike directly converting character code into binary, Unicode has a different perspective on characters. This allows Unicode to accommodate an unlimited number of characters (in different types of encodings). This article doesn't discuss the implementations of Unicode, but here are the key points to note:
- Unicode is just a standard. UTF-8, UTF-16 etc... are actual encodings.
- Popular Myth: UTF-8 requires 2 bytes (16 bit) to store a character, Thus at max 216 (65,536) characters can be represented. This is false. Some characters are stored in 1 byte. Some are stored in 2 bytes. Some even require 6 bytes!
- Representing characters is not as simple as converting it into binary. Read more about UTF-8 encoding here
- UTF-8 is a superset of ASCII, i.e., characters with ASCII code [0, 127] can be represented with the same character code.
Introduction of C Character Set
Majorly, there are two character sets in C language.
- Source Character Set: This is the set of characters that can be used to write source code. Before preprocessing phase, the first step of C PreProcessor (CPP) is to convert the source code's encoding into Source Character Set (SCS). Eg: A, Tab, B, SPACE, \n, etc.
- Execution Character Set: This is the set of characters that can be interpreted by the running program. After preprocessing phase, CPP converts character and string constant's encoding into Execution Character Set (ECS). Eg: A, B, \a, etc.
Basic Character Set
Source and Execution Character sets have few common characters. The set of common characters is called Basic Character Set. Let's discuss more about it below:
Alphabets: which includes both uppercase and lowercase characters. ASCII code of uppercase characters is in the range [65, 90] whereas ASCII code of lowercase characters is in the range [97, 122]. Eg: A, B, a, b etc.
- Uppercase and lowercase characters differ by just one bit.
- Utility Functions: isalpha, islower, isupper check whether the character is alphabet, lowercase alphabet, uppercase alphabet respectively. tolower, toupper transforms the alphabets to appropriate case.
Digits: Includes digits from 0 to 9 (inclusive). ASCII code of digits is in the range [48, 57]. Eg: 0, 1, 2 etc.
Utility functions: isdigit checks whether the input character is a digit. isalnum checks whether a character is an alphanumeric character.
Punctuation/Special Characters: The default C locale classifies the below characters as punctuation characters.
Utility functions: ispunct checks whether a character is punctuation character. Below table contains the list of all punctuation characters, ASCII code and their usecases.