converted to a graph representation.
\section{Minimizing DAWGs}
-As a representation of the patterns we use slightly altered DAWGs. Normally
-DAWGs have as edgelabels letters from an alphabet, in our case the DAWGs
-alphabet contains all letters, whitespace and punctuation but also the
-specified user markers. DAWGs are a graph but by using them as an automaton we
-can check if a word is accepted by the automaton, or in other words, if the
-word matches the specified pattern. The first algorithm to generate DAWGs from
-node-lists was proposed by Hopcroft et al\cite{Hopcroft1971}. It is an
-incremental approach in generating the graph. Meaning that entry by entry the
-graph was built. The only constraint that the algorithm has is that the entries
-must be sorted lexicographically. Later on Daciuk et al.\cite{Daciuk2000}
-improved on the original algorithm and their algorithm is the algorithm we used
-to minimize or optimize our DAWGs.
+We represent the user generated patterns as DAWGs by converting the node-lists
+to DAWGS. Normally DAWGs have as edgelabels letters from an alphabet, in our
+case the DAWGs alphabet contains all letters, whitespace and punctuation but
+also the specified user markers. DAWGs are a graph but by using them as an
+automaton we can check if a word is accepted by the automaton, or in other
+words, if the word matches the specified pattern. The first algorithm to
+generate DAWGs from node-lists was proposed by Hopcroft et
+al\cite{Hopcroft1971}. It is an incremental approach in generating the graph.
+Meaning that entry by entry the graph was built. The only constraint that the
+algorithm has is that the entries must be sorted lexicographically. Later on
+Daciuk et al.\cite{Daciuk2000} improved on the original algorithm and their
+algorithm is the algorithm we used to minimize or optimize our DAWGs.
Pseudocode for the algorithm can be found in Listing~\ref{pseudodawg}.
+Incrementally node-lists are added to create a graph. For example in
+{Subgraphs in Figure}~\ref{dawg1} visualizes the construction of the
+DAWG from the entries: \texttt{a.bc}, \texttt{a,bc} and \texttt{a,bd}.
-extended the algorithm and created an incremental one without increasing the
-computational complexity. The non incremental algorithm from Daciuk et al. is
-used to convert the nodelists to a graph.
+In SG0 the graph is only the null graph, described by
+$G=(\{q_0\},\{q_0\},\{\}\{\})$ and does not contain any entries.
-For example constructing a graph that from the entry: \textit{a,bc} and
-\textit{a.bc} goes in the following steps:
+Adding the first entry \texttt{a,bc} is not a
+hard task because there is no common prefix nor suffix so the entry becomes a
+strain of nodes starting from $q_0$ and is visualized in subgraph SG1.
-\begin{figure}[H]
- \caption{Sample DAG, first entry}
- \label{fig:f22}
- \centering
- \digraph[]{graph22}{
- rankdir=LR;
- 1,2,3,5 [shape="circle"];
- 5 [shape="doublecircle"];
- 1 -> 2 [label="a"];
- 2 -> 3 [label="."];
- 3 -> 4 [label="b"];
- 4 -> 5 [label="c"];
- }
-\end{figure}
+For the second entry \texttt{a.bc} some smart optimization has to be applied.
+The common prefix is \texttt{a} and the common suffix is \texttt{bc}. Therefore
+the first node in which the common prefix starts needs to be copied and split
+off to make room for an alternative route. This is exactly what happens in SG2.
+Node $q_2$ is copied and gets a connection to node $q_3$. From the last node of
+the common prefix a route is built towards the first node, that is the copied
+node, of the common suffix resulting in an extra path $q_1\rightarrow
+q_5\rightarrow q_3$. This results in subgraph SG2.
+
+Adding the thing node \texttt{a.bd} in the same naive way as the second node
+results in subgraph SG3 and introduces a bad side effect. Namely that a new
+word, that was not specifically added was introduced, is added to the DAWG.
+This is because within the common prefix there is a node that has some other
+arrow leading to it. When a new path is added the algorithm checks the common
+prefix and suffix for confluence nodes. Confluence nodes are nodes that have
+multiple arrows leading in and because of the multiple arrows they can lead to
+unwanted additional words. The first confluence node found must be copied and
+detached and the specific suffix from the entry must be copied with it to
+separate the path from the existing paths. Applying this technique in the
+example leads to subgraph SG4. With common prefix \texttt{a.b} and an empty
+common suffix node $q_3$ is found as a confluence node in the common prefix and
+therefore node $q_3$ is copied to the new node $q_6$ taking the paths leading
+to the final state with it. In this way no new words are added to the DAWG and
+the DAWG is still optimal.
\begin{figure}[H]
- \caption{Sample DAG, second entry}
- \label{fig:f23}
+ \caption{Step 1}
+ \label{dawg1}
\centering
- \digraph[]{graph23}{
+ \digraph[width=\linewidth]{inccons}{
rankdir=LR;
- 1,2,3,5,6 [shape="circle"];
- 5 [shape="doublecircle"];
- 1 -> 2 [label="a"];
- 2 -> 3 [label="."];
- 3 -> 4 [label="b"];
- 4 -> 5 [label="c"];
-
- 2 -> 6 [label=","];
- 6 -> 4 [label="b"];
+ n4 [style=invis];
+ q40 [label="q0"];
+ q41 [label="q1"];
+ q42 [label="q2"];
+ q43 [label="q3"];
+ q44 [label="q4" shape=doublecircle];
+ q45 [label="q5"];
+ q46 [label="q6"];
+ n4 -> q40[label="SG4"];
+ q40 -> q41[label="a"];
+ q41 -> q42[label=","];
+ q42 -> q43[label="b"];
+ q43 -> q44[label="c"];
+ q41 -> q45[label="."];
+ q45 -> q46[label="b"];
+ q46 -> q44[label="d"];
+ q46 -> q44[label="c"];
+
+ n3 [style=invis];
+ q30 [label="q0"];
+ q31 [label="q1"];
+ q32 [label="q2"];
+ q33 [label="q3"];
+ q34 [label="q4",shape=doublecircle];
+ q35 [label="q5"];
+ n3 -> q30[label="SG3"];
+ q30 -> q31[label="a"];
+ q31 -> q32[label=","];
+ q32 -> q33[label="b"];
+ q33 -> q34[label="c"];
+ q33 -> q34[label="d",constraint=false];
+ q31 -> q35[label="."];
+ q35 -> q33[label="b"];
+
+ n2 [style=invis];
+ q20 [label="q0"];
+ q21 [label="q1"];
+ q22 [label="q2"];
+ q23 [label="q3"];
+ q24 [label="q4",shape=doublecircle];
+ q25 [label="q5"];
+ n2 -> q20[label="SG2"];
+ q20 -> q21[label="a"];
+ q21 -> q22[label=","];
+ q22 -> q23[label="b"];
+ q23 -> q24[label="c"];
+ q21 -> q25[label="."];
+ q25 -> q23[label="b"];
+
+ n1 [style=invis];
+ q10 [label="q0"];
+ q11 [label="q1"];
+ q12 [label="q2"];
+ q13 [label="q3"];
+ q14 [label="q4",shape=doublecircle];
+ n1 -> q10[label="SG1"];
+ q10 -> q11[label="a"];
+ q11 -> q12[label=","];
+ q12 -> q13[label="b"];
+ q13 -> q14[label="c"];
+
+ n [style=invis];
+ q0 [label="q0"];
+ n -> q0[label="SG0"];
}
\end{figure}