Regular Expressions, Part 1

Here are some notes I’ve synthesized from Chapter 3: Scanning in Thain 2e (2020); and the classic Dragon Book 2e (2006) by Aho, Lam, Sethi, and Ullman.

Aho 2006: 3.3.1, p. 117

An Alphabet is any finite sequence of symbols. The set {0,1} is the binary alphabet. Unicode, which includes approximately 100,000 characters is another example of an alphabet.
A String over an alphabet is a finite set of symbols drawn from that alphabet.
- The length of the string s is the number of occurrences in symbols in s.
- Length of string s is denoted by |s|.
A Language is any countable set of strings over some fixed alphabet. Examples:
1. L(𝜖) contains only one element: a string with length 0. Aka an “empty string”.
2. L( {} ) has no elements. Called the empty set. Also denoted by L(∅).
3. The set of all syntactically well-formed C programs. Quite hard to specify precisely and formally.
4. The set of all grammatically correct English sentences. Quite hard to specify precisely and formally.
Note that a language just denotes syntax. There is no meaning (semantics) attached to them. Semantics and meaning are covered in Aho Chapter 5.
p. 119: definitions for prefix, suffix, substring, proper versions of same, and subsequence.

Concatenation p. 119

If x and y are strings, then the concatenation of x and y is the string formed by appending y to x.
Concatenation is denoted xy.
This is not a commutative operation. i.e., xy ≠ yx
Let x = dog, y = house. Then xy = doghouse.
The empty string is the identity under the concatenation operation.
- For any string s, 𝜖s = s𝜖 = s.
Further, we can carry out the concatenation operation multiple times up to infinity. This is called exponentiation of strings (this is infinite Union operation).
- Define s⁰ = 𝜖 the empty string.
- For all i > 0, define sⁱ = s^i-1s.
- Since 𝜖s = s, s¹ = s.
- Since 𝜖s = s, s² = ss.
- Since 𝜖s = s, s³ = sss.
- and so on.

Aho 2006: 3.3.2, p. 119

In lexical analysis, the most important operations on languages are:
1. Union
2. Concatenation
3. Closure

Operations on Languages, Fig. 3.6, p. 120

Operation	Definition and Notation
Union of L and M	L ∪ M = { s \| s is in L or s is in M }
Concatenation of L and M	LM = { st \| s is in L and t is in M }
Kleene Closure of L	L^* = U^∞_i=0 Lⁱ
Positive Closure of L	L⁺ = U^∞_i=1 Lⁱ

Example 3.3

Consider the two sets L and D.
- L is the set of letters {A, B, …, Z, a, b, …, z}
- D is the set of base 10 digits {0, 1, …, 9}.
On the one hand, we can simply think of these as sets of the lowercase/uppercase letters and of the digits, respectively.
On the other hand, we can also think of them as Languages as defined above. The languages L and D have several strings as members; all of these strings have length = 1.
Here are some other languages we can build using L and D:
1. L ∪ D is a language containing 62 strings. Each string has a length = 1. Each of these strings is one of the English letters (lowercase and uppercase) plus 10 digits.
2. LD is the set of 520 strings of length = 2. Each string consists of one letter followed by one digit.
3. L⁴ is the set of all 4-letter strings.
4. L^* is the set of every string of letters, with length ∈ {1, 2, 3, ...}. I guess since it contains 𝜖, the string of length zero, I should say length ∈ {0, 1, 2, 3, ...}.
5. L( L ∪ D )^* is the set of all strings of letters and digits beginning with a letter.
6. D⁺ is the set of all strings that consist of one or more numbers (aka digits 0-9).

Aho 2006: 3.3.3, p. 120

Regular Expressions

Consider the set of valid C language identifiers. This is almost exactly the language described above as L( L ∪ D )^*;.
- Pretty much all strings of any length consisting of letters and/or numbers but starting with at least 1 letter.
- Just need to include the possibility of using underscore (_) character.
Let the identifier letter_ represent any member of language L OR the underscore character. Aka, any upper or lowercase English letter or _.
Now, let the identifier digit represent any digit (aka any member of language D).
Further, let the pipe (|) character represent the Union operation; let parentheses ( ) group subexpressions, and let the star (^*) character represent zero or more occurrences of.
Finally, let the juxtaposition of letter_ with ( letter_ | digit )^* signify concatenation.
Regex’s are built recursively out of smaller regular expressions.
- In the same way, we can think of each regex r as a specific language L(r).
- Thus, each language L(r) is defined recursively from other sublanguages which are denoted by the subexpressions of r.

Rules that define regular expressions

These rules define regex’s over some alphabet Σ and the resulting languages those expressions denote.

Basis

Two rules form a basis:
1. 𝜖 is a regular expression, and L(𝜖) is the language whose sole member is a string of length = 0.
2. If a is a symbol in alphabet Σ, then a is a regular expression.
  - Then L(a) = {a}. i.e., L(a) is a language with one element, the string a which has length = 1.
By convention, Aho uses italics for symbols; bold for the corresponding regexs.

Induction

There are 4 parts to induction whereby we use smaller regexs to build larger regexs. Or conversely, decompose larger regexs into smaller sub-regexs.
Let r, s be regular expressions denoting the languages L(r) and L(s) respectively.