From: Mart Lubbers Date: Mon, 15 Dec 2014 11:54:17 +0000 (+0100) Subject: long method part added X-Git-Url: https://git.martlubbers.net/?a=commitdiff_plain;h=9d24cb5b6ec7435b2d5ba86ce3a458c5b30c8c37;p=bsc-thesis1415.git long method part added --- diff --git a/thesis2/3.methods.tex b/thesis2/3.methods.tex index fc5bd6d..812cff0 100644 --- a/thesis2/3.methods.tex +++ b/thesis2/3.methods.tex @@ -42,59 +42,127 @@ then added as a word. The nodelists are then sent to the actual algorithm to be converted to a graph representation. \section{Minimizing DAWGs} -As a representation of the patterns we use slightly altered DAWGs. Normally -DAWGs have as edgelabels letters from an alphabet, in our case the DAWGs -alphabet contains all letters, whitespace and punctuation but also the -specified user markers. DAWGs are a graph but by using them as an automaton we -can check if a word is accepted by the automaton, or in other words, if the -word matches the specified pattern. The first algorithm to generate DAWGs from -node-lists was proposed by Hopcroft et al\cite{Hopcroft1971}. It is an -incremental approach in generating the graph. Meaning that entry by entry the -graph was built. The only constraint that the algorithm has is that the entries -must be sorted lexicographically. Later on Daciuk et al.\cite{Daciuk2000} -improved on the original algorithm and their algorithm is the algorithm we used -to minimize or optimize our DAWGs. +We represent the user generated patterns as DAWGs by converting the node-lists +to DAWGS. Normally DAWGs have as edgelabels letters from an alphabet, in our +case the DAWGs alphabet contains all letters, whitespace and punctuation but +also the specified user markers. DAWGs are a graph but by using them as an +automaton we can check if a word is accepted by the automaton, or in other +words, if the word matches the specified pattern. The first algorithm to +generate DAWGs from node-lists was proposed by Hopcroft et +al\cite{Hopcroft1971}. It is an incremental approach in generating the graph. +Meaning that entry by entry the graph was built. The only constraint that the +algorithm has is that the entries must be sorted lexicographically. Later on +Daciuk et al.\cite{Daciuk2000} improved on the original algorithm and their +algorithm is the algorithm we used to minimize or optimize our DAWGs. Pseudocode for the algorithm can be found in Listing~\ref{pseudodawg}. +Incrementally node-lists are added to create a graph. For example in +{Subgraphs in Figure}~\ref{dawg1} visualizes the construction of the +DAWG from the entries: \texttt{a.bc}, \texttt{a,bc} and \texttt{a,bd}. -extended the algorithm and created an incremental one without increasing the -computational complexity. The non incremental algorithm from Daciuk et al. is -used to convert the nodelists to a graph. +In SG0 the graph is only the null graph, described by +$G=(\{q_0\},\{q_0\},\{\}\{\})$ and does not contain any entries. -For example constructing a graph that from the entry: \textit{a,bc} and -\textit{a.bc} goes in the following steps: +Adding the first entry \texttt{a,bc} is not a +hard task because there is no common prefix nor suffix so the entry becomes a +strain of nodes starting from $q_0$ and is visualized in subgraph SG1. -\begin{figure}[H] - \caption{Sample DAG, first entry} - \label{fig:f22} - \centering - \digraph[]{graph22}{ - rankdir=LR; - 1,2,3,5 [shape="circle"]; - 5 [shape="doublecircle"]; - 1 -> 2 [label="a"]; - 2 -> 3 [label="."]; - 3 -> 4 [label="b"]; - 4 -> 5 [label="c"]; - } -\end{figure} +For the second entry \texttt{a.bc} some smart optimization has to be applied. +The common prefix is \texttt{a} and the common suffix is \texttt{bc}. Therefore +the first node in which the common prefix starts needs to be copied and split +off to make room for an alternative route. This is exactly what happens in SG2. +Node $q_2$ is copied and gets a connection to node $q_3$. From the last node of +the common prefix a route is built towards the first node, that is the copied +node, of the common suffix resulting in an extra path $q_1\rightarrow +q_5\rightarrow q_3$. This results in subgraph SG2. + +Adding the thing node \texttt{a.bd} in the same naive way as the second node +results in subgraph SG3 and introduces a bad side effect. Namely that a new +word, that was not specifically added was introduced, is added to the DAWG. +This is because within the common prefix there is a node that has some other +arrow leading to it. When a new path is added the algorithm checks the common +prefix and suffix for confluence nodes. Confluence nodes are nodes that have +multiple arrows leading in and because of the multiple arrows they can lead to +unwanted additional words. The first confluence node found must be copied and +detached and the specific suffix from the entry must be copied with it to +separate the path from the existing paths. Applying this technique in the +example leads to subgraph SG4. With common prefix \texttt{a.b} and an empty +common suffix node $q_3$ is found as a confluence node in the common prefix and +therefore node $q_3$ is copied to the new node $q_6$ taking the paths leading +to the final state with it. In this way no new words are added to the DAWG and +the DAWG is still optimal. \begin{figure}[H] - \caption{Sample DAG, second entry} - \label{fig:f23} + \caption{Step 1} + \label{dawg1} \centering - \digraph[]{graph23}{ + \digraph[width=\linewidth]{inccons}{ rankdir=LR; - 1,2,3,5,6 [shape="circle"]; - 5 [shape="doublecircle"]; - 1 -> 2 [label="a"]; - 2 -> 3 [label="."]; - 3 -> 4 [label="b"]; - 4 -> 5 [label="c"]; - - 2 -> 6 [label=","]; - 6 -> 4 [label="b"]; + n4 [style=invis]; + q40 [label="q0"]; + q41 [label="q1"]; + q42 [label="q2"]; + q43 [label="q3"]; + q44 [label="q4" shape=doublecircle]; + q45 [label="q5"]; + q46 [label="q6"]; + n4 -> q40[label="SG4"]; + q40 -> q41[label="a"]; + q41 -> q42[label=","]; + q42 -> q43[label="b"]; + q43 -> q44[label="c"]; + q41 -> q45[label="."]; + q45 -> q46[label="b"]; + q46 -> q44[label="d"]; + q46 -> q44[label="c"]; + + n3 [style=invis]; + q30 [label="q0"]; + q31 [label="q1"]; + q32 [label="q2"]; + q33 [label="q3"]; + q34 [label="q4",shape=doublecircle]; + q35 [label="q5"]; + n3 -> q30[label="SG3"]; + q30 -> q31[label="a"]; + q31 -> q32[label=","]; + q32 -> q33[label="b"]; + q33 -> q34[label="c"]; + q33 -> q34[label="d",constraint=false]; + q31 -> q35[label="."]; + q35 -> q33[label="b"]; + + n2 [style=invis]; + q20 [label="q0"]; + q21 [label="q1"]; + q22 [label="q2"]; + q23 [label="q3"]; + q24 [label="q4",shape=doublecircle]; + q25 [label="q5"]; + n2 -> q20[label="SG2"]; + q20 -> q21[label="a"]; + q21 -> q22[label=","]; + q22 -> q23[label="b"]; + q23 -> q24[label="c"]; + q21 -> q25[label="."]; + q25 -> q23[label="b"]; + + n1 [style=invis]; + q10 [label="q0"]; + q11 [label="q1"]; + q12 [label="q2"]; + q13 [label="q3"]; + q14 [label="q4",shape=doublecircle]; + n1 -> q10[label="SG1"]; + q10 -> q11[label="a"]; + q11 -> q12[label=","]; + q12 -> q13[label="b"]; + q13 -> q14[label="c"]; + + n [style=invis]; + q0 [label="q0"]; + n -> q0[label="SG0"]; } \end{figure}