rcrc1

[bsc-thesis1415.git] / thesis2 / 3.methods.tex
diff --git a/thesis2/3.methods.tex b/thesis2/3.methods.tex

index 250ef7e..5168de1 100644 (file)
--- a/thesis2/3.methods.tex
+++ b/thesis2/3.methods.tex
@@ -106,7 +106,8 @@ which is a path graph. We just create a new node for every transition of
  character and we mark the last node as final. From then on all words are added
  using a four step approach described below. Pseudocode for this algorithm can
  be found in Listing~\ref{pseudodawg} named as the function
-\texttt{generate\_dawg(words)}
+\texttt{generate\_dawg(words)}. A \textit{Python} implementation can be found
+in Listing~\ref{dawg.py}.
  \begin{enumerate}
         \item 
                 Say we add word $w$ to the grahp. Step one is finding the
@@ -168,10 +169,10 @@ Figure}~\ref{dawg1} that builds a DAWG with the following entries:
  \texttt{abcd}, \texttt{aecd} and \texttt{aecf}.
  
  \begin{itemize}
-       \item Initial\\
+       \item No words added yet\\
                 Initially we begin with the null graph. This graph is show in
                 the figure as SG0. This DAWG does not yet accept any words.
-       \item \texttt{abcd}\\
+       \item Adding \texttt{abcd}\\
                 Adding the first entry \texttt{abcd} is trivial because we can
                 just create a single path which does not require any hard work.
                 This is because the common prefix we find in Step 1 is empty
@@ -179,7 +180,7 @@ Figure}~\ref{dawg1} that builds a DAWG with the following entries:
                 back into the graph is also not possible since there are no
                 nodes except for the first node.  The result of adding the
                 first word is visible in subgraph SG1.
-       \item \texttt{aecd}\\
+       \item Adding \texttt{aecd}\\
                 For the second entry we will have to do some
                 extra work. The common prefix found in Step 1 is \texttt{a}
                 which we add to the graph. This leaves us in Step 2 with the
@@ -190,7 +191,7 @@ Figure}~\ref{dawg1} that builds a DAWG with the following entries:
                 a common suffix \texttt{cd} and we can merge these nodes. In
                 this way we can reuse the transition from $q_3$ to $q_4$. This
                 leaves us with subgraph SG2.
-       \item \texttt{aecf}\\
+       \item Adding \texttt{aecf}\\
                 We now add the last entry which is the word \texttt{aecf}. When
                 we do this without the confluence node checking we encounter an
                 unwanted extra word. In Step 1 we find the common prefix
@@ -213,7 +214,7 @@ Figure}~\ref{dawg1} that builds a DAWG with the following entries:
  \begin{figure}[H]
         \label{dawg1}
         \centering
-       \includegraphics[width=0.9\linewidth]{inccons.eps}
+       \includegraphics[width=0.8\linewidth]{inccons.eps}
         \strut\\\strut\\
         \caption{Incrementally constructing a DAWG}
  \end{figure}
@@ -221,16 +222,18 @@ Figure}~\ref{dawg1} that builds a DAWG with the following entries:
  \subsection{Appliance on extraction of patterns}
  The text data in combination with the user markings can not be converted
  automatically to a DAWG using the algorithm we described. This is because the
-user markings are not necessarily a single character or word. User markings are
-basically one or more characters. When we add a user marking we insert a
-character that is not in the alphabet and later on we change the marking to a
-kind of subgraph. When this is applied it can be possible that non determinism
-is added to the graph. Non determinism is the fact that a single node has
-multiple edges with the same transition, in practise this means that a word can
-be present in the graph in multiple paths. This is shown in
-Figure~\ref{nddawg} with the following words: \texttt{ab<1>c}, \texttt{a<1>bbc}.
+user markings are not necessarily a single character or word. Currently user
+markings are basically multiple random characters. When we add a user marking,
+we are inserting a kind of subgraph in the place of the node with the marking.
+By doing this we can introduce non determinism to the graph. Non determinism is
+the fact that a single node has multiple edges with the same transition, in
+practise this means it could happen that a word can be present in the graph in
+multiple paths. An example of non determinism in one of our DAWGs is shown in
+Figure~\ref{nddawg}. This figure represents a generated DAWG with the following
+entries: \texttt{ab<1>c}, \texttt{a<1>bbc}.
+
  In this graph the word \texttt{abdc} will be accepted and the user pattern
-\texttt{<1>} will be filled with the word \texttt{d}. However if we try the
+\texttt{<1>} will be filled with the subword \texttt{d}. However if we try the
  word \texttt{abdddbc} both paths can be chosen. In the first case the user
  pattern \texttt{<1>} will be filled with \texttt{dddb} and in the second case
  with \texttt{bddd}. In such a case we need to choose the hopefully smartest
@@ -247,13 +250,13 @@ choice.
  \subsection{Minimality and non-determinism}
  The Myhill-Nerode theorem~\cite{Hopcroft1979} states that for every number of
  graphs accepting the same language there is a single graph with the least
-amount of states. Mihov\cite{Mihov1998} has proven that the algorithm is
-minimal in its original form. Our program converts the node-lists to DAWGs that
-can possibly contain non deterministic transitions from nodes and therefore one
-can argue about Myhill-Nerodes theorem and Mihovs proof holding.. Due to the
-nature of the determinism this is not the case and both hold. In reality the
-graph itself is only non-deterministic when expanding the categories and thus
-only during matching. 
+amount of states. Mihov\cite{Mihov1998} has proven that the algorithm for
+generating DAWGs is minimal in its original form. Our program converts the
+node-lists to DAWGs that can possibly contain non deterministic transitions
+from nodes and therefore one can argue about Myhill-Nerodes theorem and Mihovs
+proof holding. Due to the nature of the determinism this is not the case and
+both hold. In reality the graph itself is only non-deterministic when expanding
+the categories and thus only during matching. 
  
  Choosing the smartest path during matching the program has to choose
  deterministically between possibly multiple path with possibly multiple