More notes I’ve synthesized from Chapter 3: Scanning in Thain 2e (2020); and the classic Dragon Book 2e (2006) by Aho, Lam, Sethi, and Ullman.
Aho 2006: 3.3.1, p. 117
- An Alphabet is any finite set of symbols. The set
{0,1}is the binary alphabet. Unicode, which includes approximately 100,000 characters is another example of an alphabet. - A String over an alphabet is a finite set of symbols drawn from that alphabet.
- The length of the string s is the number of occurrences in symbols in s.
- Length of string s is denoted by |s|.
- A Language is any countable set of strings over some fixed alphabet. Examples:
- L(𝜖) contains only one element: a string with length 0. Aka an “empty string”.
- L( {} ) has no elements. Called the empty set. Also denoted by L(∅).
- The set of all syntatically well-formed C programs. Quite hard to specify precisely and formally.
- The set of all grammatically correct English sentences. Quite hard to specify precisely and formally.
- Note that a language just denotes syntax. There is no meaning (semantics) attached to them. Semantics and meaning are covered in Aho Chapter 5.
- p. 119: definitions for prefix, suffix, substring, proper versions of same, and subsequence.
Interlude on Hierarchy
- Big picture: Symbols live inside strings; strings live inside languages; regex’s describe languages.
Simple hierarchy
Level 0 - Alphabet
Level 1 - Symbols
Level 2 - Strings
Level 3 - Languages
Level 4 - Regex
Rules that define regular expressions
- These rules define regex’s over some alphabet Σ and the resulting languages those expressions denote.
Basis
- Two rules form a basis:
- 𝜖 is a regular expression, and L(𝜖) is the language whose sole member is a string of
length = 0. - If a is a symbol in alphabet Σ, then a is a regular expression.
- Then L(a) = {a}. i.e., L(a) is a language with one element, the string a which has
length = 1.
- Then L(a) = {a}. i.e., L(a) is a language with one element, the string a which has
- 𝜖 is a regular expression, and L(𝜖) is the language whose sole member is a string of
- By convention, Aho uses italics for symbols; bold for the corresponding regexs.
Induction
- There are 4 parts to induction whereby we use smaller regexs to build larger regexs. Or conversely, decompose larger regexs into smaller sub-regexs.
- Let r, s be regular expressions denoting the languages L(r) and L(s) respectively.
registers
lowercase epsilon 𝜖 Union symbol ∪ Upper case sigma Σ
“ep = 𝜖 “up = ∪ “sp = Σ
“lp = L “dp = D “mp = element of; ∈
macros
@e = turn exactly one character bold