character and we mark the last node as final. From then on all words are added
using a four step approach described below. Pseudocode for this algorithm can
be found in Listing~\ref{pseudodawg} named as the function
-\texttt{generate\_dawg(words)}
+\texttt{generate\_dawg(words)}. A \textit{Python} implementation can be found
+in Listing~\ref{dawg.py}.
\begin{enumerate}
\item
Say we add word $w$ to the grahp. Step one is finding the
\texttt{abcd}, \texttt{aecd} and \texttt{aecf}.
\begin{itemize}
- \item Initial\\
+ \item No words added yet\\
Initially we begin with the null graph. This graph is show in
the figure as SG0. This DAWG does not yet accept any words.
- \item \texttt{abcd}\\
+ \item Adding \texttt{abcd}\\
Adding the first entry \texttt{abcd} is trivial because we can
just create a single path which does not require any hard work.
This is because the common prefix we find in Step 1 is empty
back into the graph is also not possible since there are no
nodes except for the first node. The result of adding the
first word is visible in subgraph SG1.
- \item \texttt{aecd}\\
+ \item Adding \texttt{aecd}\\
For the second entry we will have to do some
extra work. The common prefix found in Step 1 is \texttt{a}
which we add to the graph. This leaves us in Step 2 with the
a common suffix \texttt{cd} and we can merge these nodes. In
this way we can reuse the transition from $q_3$ to $q_4$. This
leaves us with subgraph SG2.
- \item \texttt{aecf}\\
+ \item Adding \texttt{aecf}\\
We now add the last entry which is the word \texttt{aecf}. When
we do this without the confluence node checking we encounter an
unwanted extra word. In Step 1 we find the common prefix
\begin{figure}[H]
\label{dawg1}
\centering
- \includegraphics[width=0.9\linewidth]{inccons.eps}
+ \includegraphics[width=0.8\linewidth]{inccons.eps}
\strut\\\strut\\
\caption{Incrementally constructing a DAWG}
\end{figure}
\subsection{Appliance on extraction of patterns}
The text data in combination with the user markings can not be converted
automatically to a DAWG using the algorithm we described. This is because the
-user markings are not necessarily a single character or word. User markings are
-basically one or more characters. When we add a user marking we insert a
-character that is not in the alphabet and later on we change the marking to a
-kind of subgraph. When this is applied it can be possible that non determinism
-is added to the graph. Non determinism is the fact that a single node has
-multiple edges with the same transition, in practise this means that a word can
-be present in the graph in multiple paths. This is shown in
-Figure~\ref{nddawg} with the following words: \texttt{ab<1>c}, \texttt{a<1>bbc}.
+user markings are not necessarily a single character or word. Currently user
+markings are basically multiple random characters. When we add a user marking,
+we are inserting a kind of subgraph in the place of the node with the marking.
+By doing this we can introduce non determinism to the graph. Non determinism is
+the fact that a single node has multiple edges with the same transition, in
+practise this means it could happen that a word can be present in the graph in
+multiple paths. An example of non determinism in one of our DAWGs is shown in
+Figure~\ref{nddawg}. This figure represents a generated DAWG with the following
+entries: \texttt{ab<1>c}, \texttt{a<1>bbc}.
+
In this graph the word \texttt{abdc} will be accepted and the user pattern
-\texttt{<1>} will be filled with the word \texttt{d}. However if we try the
+\texttt{<1>} will be filled with the subword \texttt{d}. However if we try the
word \texttt{abdddbc} both paths can be chosen. In the first case the user
pattern \texttt{<1>} will be filled with \texttt{dddb} and in the second case
with \texttt{bddd}. In such a case we need to choose the hopefully smartest
\subsection{Minimality and non-determinism}
The Myhill-Nerode theorem~\cite{Hopcroft1979} states that for every number of
graphs accepting the same language there is a single graph with the least
-amount of states. Mihov\cite{Mihov1998} has proven that the algorithm is
-minimal in its original form. Our program converts the node-lists to DAWGs that
-can possibly contain non deterministic transitions from nodes and therefore one
-can argue about Myhill-Nerodes theorem and Mihovs proof holding.. Due to the
-nature of the determinism this is not the case and both hold. In reality the
-graph itself is only non-deterministic when expanding the categories and thus
-only during matching.
+amount of states. Mihov\cite{Mihov1998} has proven that the algorithm for
+generating DAWGs is minimal in its original form. Our program converts the
+node-lists to DAWGs that can possibly contain non deterministic transitions
+from nodes and therefore one can argue about Myhill-Nerodes theorem and Mihovs
+proof holding. Due to the nature of the determinism this is not the case and
+both hold. In reality the graph itself is only non-deterministic when expanding
+the categories and thus only during matching.
Choosing the smartest path during matching the program has to choose
deterministically between possibly multiple path with possibly multiple