Suffix Automaton Study Note
This is a review & note for studying suffix automaton.
What is a Suffix Automaton?
In computer science, a suffix automaton is the smallest partial deterministic finite automaton that recognizes the set of suffixes of a given string(computer_science).
The above definition is a short excerpt from Wikipedia about suffix automaton. From the definiton of deterministic finite automaton, we let $\mathrm{trans(s,ch)}$ be the state transferred from current state by adding a new character $\mathrm{ch}$; we let $\mathrm{null}$ be a empty state; we let $\mathrm{trans(s,str)}$ be the state transferred from current state by adding a new string $\mathrm{str}$. Then, we know that we can obtain $\mathrm{trans(s,str)}$ by adding $\mathrm{ch_i}\in\mathrm{str}$ one by one into the automaton.
From the definition on Wikipedia, it is obvious that a suffix automaton can recognize all suffixes of a given string. The word recognize is defined as a existence of a path in the automaton that the word formed by the path, by adding the characters on the path, is a suffix. Meanwhile, the suffix automaton can also be used to recognize all substrings.
Important Properties of Suffix Automaton
A obvious way is to insert all of the suffixes into a Trie tree. But sometime it may take $\mathrm{O(N^2)}$ space. Lets consider how to simplify the states (reduce the number of vertices). We denote $\mathrm{ST(str)}$ the state / vertex $\mathrm{trans(init,str)}$, where $\mathrm{init}$ is the initial state. Obviously, if $\mathrm{str}$ is not a substring of the original string, then $\mathrm{ST(str)} = \mathrm{null}$, since no suffix will be formed by adding character after the vertex.
Shall we build nodes for all substrings? The anwser is also NO, since the total number of states will also be $O(N^2)$. Lets denote the set of end position, which is the occurence index of substrings in the original string, for each vertex as $\mathrm{right(v)}$. Then, we can see that if $\mathrm{right(a)} = \mathrm{right(b)}$, then $\mathrm{ST(a)} = \mathrm{ST(b)}$, this simplifies the overall states. Let the length of substrings in the vertex be in range $[\min(\mathrm{v}),\max(\mathrm{v})]$. Consider two different state $\mathrm{a},\mathrm{b}$ and their right set $\mathrm{Ra},\mathrm{Rb}$. Then, the following property holds:
Property 1: If $\mathrm{Ra},\mathrm{Rb}$ are intersecting then one set is another’s proper subset.
Lets define a parent tree. where $\mathrm{fa} = \mathrm{Parent(v)}$. Let $\mathrm{right(fa)} = \bigcup_{\mathrm{v}\in \mathrm{child(fa)}}\mathrm{right(v)}$. We also have following properties.
Property 2: $\max(\mathrm{Parent(v)}) = \min(\mathrm{v}) – 1$.
Property 3: $\mathrm{trans(a,c) = b}$, then an edge from $\mathrm{a}\rightarrow \mathrm{b}$ marked with character $\mathrm{c}$.
Property 4: If $\mathrm{ST(s)}\neq \mathrm{null}$, then $\mathrm{s}$ must be a substring of the original string. Otherwise, $s$ is not a substring.
Property 5: The string denoted by $\mathrm{Parent(v)}$ is a suffix of the string denoted by $\mathrm{v}$.
Property 6: If $\mathrm{trans(v,ch)}\neq \mathrm{null}$, then $\mathrm{trans(Parent(v),ch)} \neq \mathrm{null}$.
Property 7: Within a node $\mathrm{v}$, the lengths of strings denoted by the vertex are in range $[\min(\mathrm{v}),\max(\mathrm{v})]$, and all of the lengths of strings denoted by the vertex within such range exist.
To be added on…