# Lecture: Regular Operations, Regular Expressions, and RegExp/NFA equivalence

Continuing from last time, we’ve shown that NFAs and DFAs are equivalent. We’re now well prepared to discuss what operations other than union and intersection that the regular languages are closed under. Having two different ways of representing the regular languages means that we can choose to present our constructions in terms of DFAs or NFAs, depending on which is easier.

So other than union and intersection, other operations that the regular languages are closed under are concatenation, Kleene star, and complement. We’ll go through and define each of these in turn and prove, by construction, that the regular languages are closed under each of them.

First concatenation: we define, for languages $L_1$ and $L_2$ the concatenation $L_1 \circ L_2 = \{ w_1 w_2 | w_1 \in L_1, w_2 \in L_2\}$. In words, the concatenation of two languages is a language that consists of strings of the first language followed by strings of the other. To get some intuition, let’s talk about what some simple concatenations are:

• $\emptyset \circ L = L \circ \emptyset = \emptyset$, why? Because there are no strings in the $\emptyset$ and thus there is no string that can be first or second (respectively) in the concatenation.
• $\{ \epsilon \} \circ L = L \circ \{ \epsilon \} = L$, because there is only the empty string in $\{ \epsilon \}$ and we know that for any $w \in L$ that $w \epsilon = \epsilon w = w$.Now, to prove that the regular languages are closed under concatenation we will assume that we have two regular languages $L_1$ and $L_2$ and that we have NFAs $M$ which decides $L_1$ and $N$ which decides $L_2$. We’ll describe in words how we take these two NFAs and make a new NFA that decides the concatenation. First, we take every accept state of $M$ and draw an $\epsilon$ transition from it to the start state of $N$. Then we take the old accept states of $M$ and demote them to regular states. That’s it! Pictorially, we can see it as being something like if $M$ is

and $N$ is something like

then after step one the concatenation looks like

and after step two will look like

The next construction we’ll look at is complement. Complement is probably what it sounds like, if you have a language $L$ then the complement $\bar{L} = \{ w | w \notin L\}$. Now, this might make you uncomfortable a touch. After all, I don’t find it inherently obvious that just because you can computably tell if a string is in a language that you can computably tell if it’s not in the language. In the case of regular languages, it’s actually pretty easy as we can show using DFAs. If we have a regular language $L$ and a DFA that decides it $M$, then we can construct a new DFA that decides the complement just by taking the complement of the set of accept states and leaving everything else the same. In other words, if a state was an accept state in $M$ then it’s not an accept state in $\bar{M}$ and visa-versa. We can see somewhat intuitively that this is the complement: if a string would end in an accept state of $M$, then it won’t be in an accept state of $M'$ and if it would not end in an accept state of $M$ then it will end in an accept state of $M'$. The NFA case is less simple, but the nice thing about knowing that NFAs and DFAs describe the exact same languages is that we can use whichever representation is the simplest for our purposes.

Finally, we need to describe the Kleene star. This one is slightly more complicated to describe but very simple to construct. For a language $L$, the Kleene star is $L^* = \{ w_1 \ldots w_n | n \geq 0, w_1 \ldots w_n \in L\}$. In words, the Kleene star operation takes a language and returns a new language that’s the concatenation of 0 or more strings in the language. Since “0” is an option, this means that whether or not the language $L$ contains the empty string $\epsilon$, the Kleene star of the language $L^*$ does contain $\epsilon$.

We’ll show the regular languages are closed under this operation using NFAs. In words, what we do is for our NFA $N$ we attach a new state and make it the start state and also an accept state, we make an $\epsilon$ transition from the new start state to the old starte state, and then we make $\epsilon$ transitions from each of the other accept states to the new start state. Essentially, we are making a loop out of our NFA that can be executed an arbitrary number of times. Why do we make the new start state also an accept state though? Well, it’s because we’ve insisted that the Kleene star always include the empty string and this is an easy way to guarantee that our new NFA represents will accept the empty string.

Now given that we have all of these operations, maybe there’s another way we can encode the regular languages in a way that is a bit more familiar: the regular expressions. Essentially, the idea of regular expressions is that we describe the entirety of the regular languages with an inductive type that includes only things that are obviously regular. So we’ll define the regular expressions as being made out of

• $a$ where $a$ is a character in $L$
• $R \circ R'$ where $R$ and $R'$ are regular expressions
• $R \cup R'$ where $R$ and $R'$ are regular expressions
• $R^*$ where $R$ is a regular expression
• $\emptyset$
• $\epsilon$

Now, intuitively we want the regular expressions to be exactly the regular languages. First, though, we should have a way to describe what it means for a regular expression to match a string. We can describe it in terms of expansions and we’ll do so inductively:

• $a$ expands into the literal character $a$
• $R \circ R'$ expands into $w w'$ where $w$ is an expansion of $R$ and $w'$ is an expansion of $R'$
• $R \cup R'$ expands into an either an expansion of $R$ or an expansion of $R'$
• $R^*$ expands into $\epsilon$ or it expands into $w w'$ where $w$ is an expansion of $R$ and $w'$ is an expansion of $R^*$
• $\emptyset$ expands into nothing
• $\epsilon$ expands into the empty string $\epsilon$

Now, we say that a string $w$ is accepted by a regular expression $R$ when there exists some expansion of $R$ that is equal to the string $w$. For example, if we have a regular expression $0^* \cup 1^*$ and we want to match it against the string $000$ we can expand the regular expressions as follows $0^* \cup 1^* \to 0^* \to 00^* \to 000^* \to 0000^* \to 00000^* \to 0000 \epsilon = 0000$. Let’s make sure that this notion of “expansion” is computable according to the informal criterion we’ve been having to use so far. As we can see, expansion only has a finite set of rules so we’re good on that front, and since we can terminate our expansion whenever we’re out of options or we’ve exceeded the length of the target string we only need finite data and finite time. This means that our ability to test whether a string is generated by a regular expression is computable.

So while we can intuitively believe that our definition of regular expressions does, in fact, describe regular languages we want to actually prove it. In order to prove it, we need do what we did for the DFA/NFA correspondence: we first show that we can take any regular expression and turn into into an NFA, then go back the other direction and take any NFA and show we can convert it into a regular expression that decides the same language.

We’ll start, again, with the easy direction: converting a regular expression to an NFA. We’ll define this inductively, that is case by case, over the structure of regular expressions.

• $a$ becomes the NFA that accepts the single character $a$
• $R \circ R'$ becomes the concatenation of the NFAs for $R$ and $R'$
• $R \cup R'$ becomes the union of the NFAs for $R$ and $R'$
• $R^*$ becomes the Kleene star of the NFA for $R$
• $\emptyset$ becomes the NFA for the empty set
• $\epsilon$ becomes the NFA for the language that only has the empty string

So, now for the hard direction which is converting NFAs to RegExps. The way we’ll do this is with the path $NFA \to DFA \to GNFA \to RegExp$. Gosh, GNFAs aren’t something we’ve seen yet are they? Let’s defined them. Informally, they are NFAs where we are allowed to have regular expressions as labels rather than simply characters. The idea being that the transition occurs when some prefix of the input string can be “consumed” as an expansion of the regular expression that labels the transition. We follow Sipser in our insistence that all our GNFAs meet the following conditions

• The start state has transition arrows going to every other state but no incoming arrows
• There is only a single accept state, distinct from the start state, and there is a transition from every other state to it
• Every other state has one transition to every other non-start/non-accept state including itself

Wow, those conditions might feel kinda weird, but they’re meant to make the construction as easy as possible. So the way our construction works is that we can take NFAs to DFAs with the powerset construction we’ve seen earlier, then we can turn DFAs into GNFAs, and ultimately turn GNFAs into RegExps in a principled way. We’ll cover that and the pumping lemma for next time.