Regular Expressions, Part 2

More notes I’ve synthesized from Chapter 3: Scanning in Thain 2e (2020); and the classic Dragon Book 2e (2006) by Aho, Lam, Sethi, and Ullman.

Aho 2006: 3.3.1, p. 117

An Alphabet is any finite set of symbols. The set {0,1} is the binary alphabet. Unicode, which includes approximately 100,000 characters is another example of an alphabet.
A String over an alphabet is a finite set of symbols drawn from that alphabet.
- The length of the string s is the number of occurrences in symbols in s.
- Length of string s is denoted by |s|.
A Language is any countable set of strings over some fixed alphabet. Examples:
1. L(𝜖) contains only one element: a string with length 0. Aka an “empty string”.
2. L( {} ) has no elements. Called the empty set. Also denoted by L(∅).
3. The set of all syntatically well-formed C programs. Quite hard to specify precisely and formally.
4. The set of all grammatically correct English sentences. Quite hard to specify precisely and formally.
Note that a language just denotes syntax. There is no meaning (semantics) attached to them. Semantics and meaning are covered in Aho Chapter 5.
p. 119: definitions for prefix, suffix, substring, proper versions of same, and subsequence.

Big picture: Symbols live inside strings; strings live inside languages; regex’s describe languages.

These rules define regex’s over some alphabet Σ and the resulting languages those expressions denote.

Two rules form a basis:
1. 𝜖 is a regular expression, and L(𝜖) is the language whose sole member is a string of length = 0.
2. If a is a symbol in alphabet Σ, then a is a regular expression.
  - Then L(a) = {a}. i.e., L(a) is a language with one element, the string a which has length = 1.
By convention, Aho uses italics for symbols; bold for the corresponding regexs.

There are 4 parts to induction whereby we use smaller regexs to build larger regexs. Or conversely, decompose larger regexs into smaller sub-regexs.
Let r, s be regular expressions denoting the languages L(r) and L(s) respectively.

lowercase epsilon 𝜖 Union symbol ∪ Upper case sigma Σ

“ep = 𝜖 “up = ∪ “sp = Σ

“lp = L “dp = D “mp = element of; ∈

@e = turn exactly one character bold