Here are some notes I’ve synthesized from Chapter 3: Scanning in Thain 2e (2020); and the classic Dragon Book 2e (2006) by Aho, Lam, Sethi, and Ullman.
Aho 2006: 3.3.1, p. 117
- An Alphabet is any finite set of symbols. The set
{0,1}is the binary alphabet. Unicode, which includes approximately 100,000 characters is another example of an alphabet. - A String over an alphabet is a finite set of symbols drawn from that alphabet.
- The length of the string s is the number of occurrences in symbols in s.
- Length of string s is denoted by |s|.
- A Language is any countable set of strings over some fixed alphabet. Examples:
- L(𝜖) contains only one element: a string with length 0. Aka an “empty string”.
- L( {} ) has no elements. Called the empty set. Also denoted by L(∅).
- The set of all syntatically well-formed C programs. Quite hard to specify precisely and formally.
- The set of all grammatically correct English sentences. Quite hard to specify precisely and formally.
- Note that a language just denotes syntax. There is no meaning (semantics) attached to them. Semantics and meaning are covered in Aho Chapter 5.
- p. 119: definitions for prefix, suffix, substring, proper versions of same, and subsequence.
Concatenation p. 119
- If x and y are strings, then the concatenation of x and y is the string formed by appending y to x.
- Concatenation is denoted xy.
- This is not a commutative operation. i.e., xy ≠ yx
- Let x = dog, y = house. Then xy = doghouse.
- The empty string is the identity under the concatenation operation.
- For any string s, 𝜖s = s𝜖 = s.
- Further, we can carry out the concatenation operation multiple times up to infinity. This is called exponentiation of strings (this is infinite Union operation).
- Define s0 = 𝜖 the empty string.
- For all i > 0, define si = si-1s.
- Since 𝜖s = s, s1 = s.
- Since 𝜖s = s, s2 = ss.
- Since 𝜖s = s, s3 = sss.
- and so on.
Aho 2006: 3.3.2, p. 119
- In lexical analysis, the most important operations on languages are:
- Union
- Concatenation
- Closure
Operations on Languages, Fig. 3.6, p. 120
| Operation | Definition and Notation |
|---|---|
| Union of L and M | L ∪ M = { s | s is in L or s is in M } |
| Concatenation of L and M | LM = { st | s is in L and t is in M } |
| Kleene Closure of L | L* = U∞i=0 Li |
| Positive Closure of L | L+ = U∞i=1 Li |
Example 3.3
- Consider the two sets L and D.
- L is the set of letters {
A,B, …,Z,a,b, …,z} - D is the set of base 10 digits {
0,1, …,9}.
- L is the set of letters {
- On the one hand, we can simply think of these as sets of the lowercase/uppercase letters and of the digits, respectively.
- On the other hand, we can also think of them as Languages as defined above. The languages L and D have several strings as members; all of these strings have
length = 1. - Here are some other languages we can build using L and D:
- L ∪ D is a language containing 62 strings. Each string has a length = 1. Each of these strings is one of the English letters (lowercase and uppercase) plus 10 digits.
- LD is the set of 520 strings of
length = 2. Each string consists of one letter followed by one digit. - L4 is the set of all 4-letter strings.
- L* is the set of every string of letters, with
length∈{1, 2, 3, ...}. I guess since it contains 𝜖, the string of length zero, I should saylength∈{0, 1, 2, 3, ...}. - L( L ∪ D )* is the set of all strings of letters and digits beginning with a letter.
- D+ is the set of all strings that consist of one or more numbers (aka digits 0-9).
Aho 2006: 3.3.3, p. 120
Regular Expressions
- Consider the set of valid C language identifiers. This is almost exactly the language described above as L( L ∪ D )*;.
- Pretty much all strings of any length consisting of letters and/or numbers but starting with at least 1 letter.
- Just need to include the possibility of using underscore (
_) character.
- Let the identifier
let_represent any member of language L OR the underscore character. Aka, any upper or lowercase English letter or-. - Now, let the identifier
dig_represent any digit (aka any member of language D). - Further, let the pipe (
|) character represent the Union operation; let parentheses( )group subexpressions, and let the star (*) character represent zero or more occurrences of. - Finally, let the juxtaposition of
let_with(let_|dig_)*signify concatenation. - Regex’s are built recursivley out of smaller regular expressions.
- In the same way, we can think of each regex r as a specific language L(r).
- Thus, each language L(r) is defined recursively from other sublanguages which are denoted by the subexpressions of r.
Rules that define regular expressions
- These rules define regex’s over some alphabet Σ and the resulting languages those expressions denote.
Basis
- Two rules form a basis:
- 𝜖 is a regular expression, and L(𝜖) is the language whose sole member is a string of
length = 0. - If a is a symbol in alphabet Σ, then a is a regular expression.
- Then L(a) = {a}. i.e., L(a) is a language with one element, the string a which has
length = 1.
- Then L(a) = {a}. i.e., L(a) is a language with one element, the string a which has
- 𝜖 is a regular expression, and L(𝜖) is the language whose sole member is a string of
- By convention, Aho uses italics for symbols; bold for the corresponding regexs.
Induction
- There are 4 parts to induction whereby we use smaller regexs to build larger regexs. Or conversely, decompose larger regexs into smaller sub-regexs.
- Let r, s be regular expressions denoting the languages L(r) and L(s) respectively.
Example 3.4, Aho p. 122
registers
lowercase epsilon 𝜖 Union symbol ∪ Upper case sigma Σ
“ep = 𝜖 “up = ∪ “sp = Σ
“lp = L “dp = D “mp = element of; ∈
macros
@e = turn exactly one character bold