#5[ 4 String Data Types] char, nchar, varchar, nvarchar

Article #5: Everything you need to know about the "String" data type

Dec 10, 2024

This article contains 5 sections:
1. Introduction to Basics
2. Simplified Technical Description
3. String Data Types Explained
4. Detailed Technical Understanding
5. Technical Fact Sheet - for real world performance engineering

1. Introduction to Basics
Before you can learn about varchar, nvarchar, char, and nchar data types; there are 4 fundamental data types in which you, first - need to become familiar with.

4 fundamental data types
- Integer: stores numerical values (whole numbers), which can be either a negative, 0, or positive
- Decimal: stores a number which contains a fractional component (example: 1.8 or 1 plus 4/5)
- String: stores any textual data, which can includes numbered values
- Boolean: stores values that can be expressed as one of two possibilities (True/false, on/off, yes/no, 1/0)

Now that we have learned our 4 fundamental data types, you are ready to understand the 4 string data types - which are [varchar, nvcarchar, char, and nchar].

2. Simplified Technical Description

Char (character)
- Stores a fixed-length, non-unicode string

Nchar (national character)
- Stores a fixed-length, unicode string

Varchar (variable character)
- Stores a variable-length, non-unicode string

Nvarchar (national variable character)
- Stores a variable-length, unicode string

3. String Data Types Explained
Fixed-length strings: across various contexts (database systems, programming languages), fixed-length strings specify a set limit in which textual data can be contained and stored (max limit for allowable characters). The stored data cannot go beyond the specified, predetermined, and predefined storage capacity. In addition, this limit (fixed-length string) will be used up (occupy storage capacity) regardless if the contained textual data (characters) uses up the maximum capacity. For example: In SQL (programming language of data), the word “sign” is made up of 4 characters, which takes 4 bytes of storage to represent. But if the defined limit is 10 bytes [char(10)], there will be 10 bytes of storage used up, even if the string value (“sign”) only requires 4 bytes of storage to represent.

Non-unicode strings: non-unicode strings are textual data (characters which are made of glyphs, emoji’s, numbers, signs, and other kinds of symbols) that do not use a globally or universally recognized standard of encoding. “Encoding” refers to the processes and schemes used to implement representations of textual data (representing characters).

Unicode strings: “Unicode” is an internationally recognized standard for character encoding schemes. The most common standards for character encoding is UTF-8, UTF-16, and UTF-32. Character encoding schemes differ between unicode and non-unicode standards.

4. Detailed Technical Description
What relevant information do you need to know about the 4 string data types? You need to know how string data is stored and how string characters are encoded.

1st Concept - 32 Bits: in Microsoft SQL Server Management Studio (MSSQL), a 32-bit numerical system is commonly used to represent string and character data. If you calculate all possible unique values in a 32 bit string (1 value), you would calculate 2^32 for a total of 4,294,967,296 possible distinct/unique values. 32 bit strings require 4 bytes of data to store (logic: 32 bits / 8 bits = 4 bytes). Units of 32 bits (4 bytes of data), with 4.3E9 unique potential values, are used to represent unique character-to-byte mappings found in unicode and non-unicode character encoding schemes and standards.

2nd Concept - 32 bits for character encoding: This paragraph describes the mechanism for character encoding. If you have a 32-bit value (1 numerical value which can be around 4.3E9 possible distinct values), you have 1 character you can represent. If you have a string with a multiple characters; you will need one 32-bit value for each character, and each character will take 1 byte of storage. For example: if you needed to store the word “hello” into a char string data type, you would need FIVE [32-bit] numerical values to represent the word and 5 bytes to store the word.

For character encoding, string data types are treated the same as [“Signed” 32-bit integers (data type)] when calculating for {|1| maximum storage capacity for each string value, |2| for the storage used per given string, |3| the technical character limit for each string data type.}. Each character can be single-byte and/or multi-byte encoding. Single-byte encoding means that 1 character represented requires 1 byte of data to represent.
Signed vs Unsigned 32-bit integers storage capacity
Signed: an integer data type which has a negative or positive indicator
- 1 bit is taken out of the 32 bits in order to symbolize and represent a [positive / negative] integer
- Implication: 1 bit is either 0 [signifies negative integer] or 1 [signifies positive integer]
- Calculation: -(2^31) for max negative integer value and 2^31-1 for max positive integer value
- Range: -2,147,483,648 to 2,147,483,647
Unsigned: an integer data type without a positive or negative indicator, making all values positive
- Unsigned 32 bit integers use all bits to represent potential values
- Calculation: 2^32
- Range: 0 to 4,294,967,295
In summary, calculations for string data types in MSSQL are treated the same as “signed” 32-bit integers.

3rd Concept - Non-unicode character encoding
Non-unicode string data types such as char, varchar, and varchar(max) do not encode characters using the unicode standard. Non-unicode character encoding schemes use a combination of 2 parts in order to successfully representing unique character data contained in strings.
- Collation Settings: Collation setting are references which give you a set of rules for sorting and comparing character data. Examples includes setting rules for case sensitivity, accent sensitivity, and character ordering. Collation settings are independent from the character encoding process - they are references which sets the rules for sorting and comparing character data.
- Codepages: Codepages is the actual character-to-byte mapping which dictates the representation of non-unicode string characters to byte values (31-bit numerical values). Codepages are included in and are not independent from the collation settings.

Collation settings in the real world:
- single-byte [SQL_Latin1_General_CP1_CI_AS] collation with Codepage 1252 (Windows-1252)
- multi-byte [Japanese_CI_AS] collation with Codepage: 932 (Shift_JIS)

5. Technical Fact Sheet

Key facts to remember:
- char(n) and varchar(n) strings define n in bytes, not characters
- nchar(n) and nvarchar(n) define n in unicode characters, with each character requiring 2 bytes per character to store (n * 2 = nchar(n) and nvarchar(n) bytes)
- varchar(max) and nvarchar(max) can store a max total storage capacity of 2^31-1 bytes
Overhead bytes rules:
- Overhead bytes are leading byte length indicators, helping the computer calculate the character length of a given string
- Overhead bytes do not apply for char and nchar
- Overhead bytes are 1 byte for varchar(255) and nvarchar(255), assuming character length is 255 or less.
- Overhead bytes are 2 bytes for varchar(8000) and nvarchar(4000), assuming character length is 256 or more.
- Overhead bytes are 2 bytes for varchar(max) and nvarchar(max), consistently - regardless of character length

Calculations Fact Sheet

char(n) max storage capacity calculation
- n = 8000 bytes, overhead not applicable
- char(n) is fixed-length, always n bytes. Max n=8000. No overhead.

char(n) storage used calculation
- n = 1 to 8000 bytes, overhead not applicable
- For char(n), the entire length is always allocated. If you define char(8000), you always store 8000 bytes. No overhead needed.

nchar(n) max storage capacity calculation
- n = 4000 characters with 2 bytes each = 8000 bytes
- nchar(n) stores Unicode chars at 2 bytes/char. Max n=4000 → 4000 * 2 = 8000 bytes. No overhead.

nchar(n) max storage used calculation
- n = 1 to 4000 characters with 2 bytes per unicode character = 8000 bytes
- nchar(n) is fixed-length. If defined as 4000, it always uses 8000 bytes, regardless of the number of characters actually stored. No overhead.

varchar(n) max storage capacity calculation
- n = 8000 bytes max, 1 byte per non-Unicode char = 8000 bytes
- varchar(n) is defined in bytes. Max n=8000 bytes. Overhead exists but doesn't reduce n.

varchar(n) storage used calculation
- n = 1 to 8000 bytes + overhead. For max length (8000 chars), overhead = 2 bytes, total = 8000 + 2 = 8002 bytes
- If you store more than 255 bytes, overhead is 2 bytes. Data (8000 bytes) + overhead (2 bytes) = 8002 bytes total. The n itself is not reduced by overhead.

nvarchar(n) max storage capacity calculation
- n = 4000 characters with 2 bytes each = 8000 bytes
- Same logic as nchar(n). nvarchar(n) = n Unicode chars * 2 bytes/char. Max n=4000 → 8000 bytes.

nvarchar(n) storage used calculation
- n = 1 to 4000 chars at 2 bytes each = up to 8000 bytes; overhead = 2 bytes if >255 bytes, total at max cap = 8000 + 2 = 8002 bytes
- At smaller lengths ≤255 bytes, overhead would be 1 byte. At max length (4000 chars), definitely >255 bytes, so overhead is 2 bytes total: 8000 (data) + 2 (overhead) = 8002 bytes.

varchar(max) max storage capacity calculation
- 2^31-1 bytes total
- varchar(max) can store up to 2^31-1 bytes (~2GB), including overhead.

varchar(max) storage used calculation
- Minimum 1 char (1 byte) + 2 byte overhead = 3 bytes; up to 2^31-1 bytes total (max data = 2^31-3 bytes because of overhead)
- Overhead is always 2 bytes. If you store only 1 char: 1 byte data + 2 bytes overhead = 3 bytes total. At max capacity, 2^31-1 total bytes - 2 overhead = 2^31-3 bytes for characters.

varchar(max) max character limit calculation
- 2^31-3 bytes because 2 overhead bytes are included within the 2^31-1 bytes
- (single-byte scenario) For single-byte characters, the overhead reduces available character bytes by 2, resulting in 2^31-3 available bytes for characters. For multi-byte encodings, the character count would be fewer.

nvarchar(max) max storage capacity calculation
- 2^31-1 bytes total
- nvarchar(max) also has the same total byte limit, but remember each character is 2 bytes.

nvarchar(max) storage used calculation
- Minimum usage: 1 char (2 bytes) + 2 overhead = 4 bytes. At max, still 2^31-1 bytes total. (Note: nvarchar(max) is always Unicode, so each char is 2 bytes.)

nvarchar(max) max character limit calculation
- Total available bytes for characters = 2^31-3 after subtracting 2 overhead bytes.
- But each character is 2 bytes, so the max character count is (2^31-3)/2.

Beau’s Substack

Discussion about this post