Reworking my earlier notes on Chapter 3: Scanning in Thain 2e (2020); and the classic Dragon Book 2e (2006) by Aho, Lam, Sethi, and Ullman.
I. Simplified take
- Big picture: Symbols live inside strings; strings live inside languages; regex’s describe languages.
- Alternately:
- Alphabet Σ (Capital Sigma) is a finite set of symbols.
- A symbol is an element of Alphabet Σ.
- A string is a finite sequence of symbols from Σ.
- A language is a set of strings (language L is a subset of set Σ*).
- Regex’s are syntactic descriptions that denote languages using operations like:
- union (alternation)
- concatenation
- Kleene star / Kleene plus
II. Complete Hierarchy
Level 0 - Alphabet
- Alphabet Σ
- A finite, nonempty set of symbols.
- Example: Σ = {a, b}.
- Important: * Σ is a set of symbols.
- The elements in Σ are symbols, NOT strings.
Level 1 - Symbols
- An element of Σ. Example: a ∈ Σ.
- aka Symbol / character / atom.
Level 2 - Strings
- A string (or word) is a finite sequence of symbols drawn from alphabet Σ.
- Examples:
a;abba; the empty string with length = 0, denoted with lower-case epsilon 𝜖. - Σ* (Sigma Star) is the formal notation for “the set of all strings over alphabet Σ”.
- The main operation that generates Σ* is repeated concatenation.
- Σ* includes all those strings plus the empty string 𝜖.
- Key facts about empty string 𝜖:
- 𝜖 is a string.
- 𝜖 is an element of Σ* (pronounced “Sigma Star”).
- 𝜖 is not a set.
Level 3 - Languages
- A set of strings.
- Formally: a language L ⊆ Σ*.
- i.e., Any language L is a subset of the Sigma Star Σ* of the related Alphabet Σ.
- Examples:
- A language that only contains one element—the empty string: { 𝜖 }.
- A language with only 3 elements: (
a,ab,abb}. - An empty language (aka no members at all): denoted
{}or ∅.
Critical reminder about Levels 0-3
- Alphabet Σ is a set of symbols; NOT strings.
- Language L and Sigma Star Σ* are sets of strings.
Level 4 - Regex
- Regular expressions (regex) are syntactic objects
- Regex’s denote languages
- Regex’s are not:
- sets
- strings (in the language-theoretic sense)
- languages in and of themselves
- Regex are descriptions
- Example of regex a*. It describes a language L = { 𝜖,
a,aa,aaa, … } over the alphabet Σ.
III. What Level are Operators?
- This is a source of confusion. In fact, an operator can work on two different levels: on Strings (Level 2), and on Languages (Level 3).
- See Sections III.A (on Strings) and B (on Languages) below.
A. Operators on Strings
- At the level of strings (Level 2), operators work on individual sequences of symbols; NOT on sets.
- Consider concatenation, length, and common structural predicates on strings below.
1. Concatenation on Strings
- Given two strings x and y, the concatenation x ⋅ y is the string formed by appending y to x
- Concatenation is associative but not commutative.
- (x ⋅ y) ⋅ z = x ⋅ (y ⋅ z)
- x ⋅ y ≠ y ⋅ x
- The empty string 𝜖 is the identity element.
- x ⋅ 𝜖 = 𝜖 ⋅ x = x
2. Length on Strings
- The length of a string x is the number of symbols it contains.
- Length is denoted by |x|.
- For the empty string, |𝜖| =
0. - For any strings x, y: |x ⋅ y| = |x| + |y|
3. Predicates on Strings
- What about things like substring, suffix, prefix, subsequence, etc.?
- These familiar “string functions” from programming are—in this mathematical context—better called predicates on strings.
- In other words, these “functions familiar from programming” are not operators and are not mathematical functions.
- These predicates do not construct new strings; instead, they assert truth-valued properties about one string relative to another.
- e.g., “Is it true that String b is substring of String a?”
- Formally, these “programming functions” are binary predicates (aka “relations”) on the set Σ*.
- Again they are not operators or operations on the set Σ*.
B. Operators on Languages
- At the level of languages (Level 3), operators work on sets of strings and produce new languages.
1. Union on Languages
- Given two languages L and M, their union L ∪ M contains all strings that belong to either language.
2. Concatenation on Languages
- The concatenation of L and M is the new language: L ⋅ M = { xy | x ∈ L, y ∈ M }
- In other words, this new language contains all strings xy where x ∈ L and y ∈ M.
- This new language is similar to Cartesian Product of L and M; except that the output are not ordered pairs exactly, but are individual strings.
- Language concatenation applies string concatenation collectively to the entire set of strings in L and in M.
3. Kleene Star on Languages
- Given a language L, the Kleene Star operator L* is the set of all strings formed by concatenating zero or more strings from L.
- By definition, the empty string 𝜖 ∈ L*.
4. Kleene Plus on Languages
- Given a language L, the Kleene Plus operator L+ denotes one or more concatenations of strings from L.
- Formally, L+ = L ⋅ L*.
C. One sentence summary on Operators
- Operators on strings (Level 2) define how symbols combine; Operators on languages (Level 3) lift those definitions to sets of strings.
IV. In Brief
- Σ is the alphabet; a set of symbols.
- Σ* is a set of strings (including empty string 𝜖)—but not symbols—generated by any possible concatenation of symbols available in alphabet Σ.
- L ⊆ Σ* is a language.
- r is a syntactic object (a regex) denoting L.