From: Mart Lubbers <mart@martlubbers.net>
Date: Mon, 2 Mar 2015 19:13:02 +0000 (+0100)
Subject: v0.95
X-Git-Url: https://git.martlubbers.net/?a=commitdiff_plain;h=3144ae7c010d7f7fd069f190ca4bf0360acda928;p=bsc-thesis1415.git

v0.95
---

diff --git a/thesis2/3.methods.tex b/thesis2/3.methods.tex
index be8c4b4..b98fce0 100644
--- a/thesis2/3.methods.tex
+++ b/thesis2/3.methods.tex
@@ -229,9 +229,6 @@ added.
 	}
 \end{figure}
 
-\subsection{Minimality of the algorithm}
-
-
 \subsection{Appliance on extraction of patterns}
 The text data in combination with the user markings can not be converted
 automatically to a DAWG using the algorithm we described. This is because the
@@ -254,6 +251,7 @@ choice.
 	\label{nddawg}
 	\centering
 	\includegraphics[width=\linewidth]{nddawg.eps}
+	\strut\\
 	\caption{Example non determinism}
 \end{figure}
 
@@ -262,8 +260,38 @@ The Myhill-Nerode theorem~\cite{Hopcroft1979} states that for every number of
 graphs accepting the same language there is a single graph with the least
 amount of states. Mihov\cite{Mihov1998} has proven that the algorithm is
 minimal in its original form. Our program converts the node-lists to DAWGs that
-can possibly contain non deterministic nodes and therefore one can argue about
-the minimality. Due to the nature of the determinism this is not the case. The
-non determinism is only visible when matching the data and not in the real
-graph since in the real graph we ...
+can possibly contain non deterministic transitions from nodes and therefore one
+can argue about Myhill-Nerodes theorem and Mihovs proof holding.. Due to the
+nature of the determinism this is not the case and both hold. In reality the
+graph itself is only non-deterministic when expanding the categories and thus
+only during matching. 
+
+Choosing the smartest path during matching the program has to choose
+deterministically between possibly multiple path with possibly multiple
+results. There are several possibilities or heuristics to choose from.
+\begin{itemize}
+	\item Maximum fields heuristic\\
+
+		This heuristic prefers the result that has the highest amount
+		of categories filled with actual text. Using this method the
+		highest amount of data fields will be getting filled at all
+		times. The downside of this method is that because of this it
+		might be that some data is not put in the right field because a
+		suboptimal splitting occurred that has put the data in two
+		separate fields whereas it should be in one field.
+	\item Maximum path heuristic\\
+
+		Maximum path heuristic tries to find a match with the highest
+		amount of fixed path transitions. Fixed path transitions are
+		transitions that occur not within a category. The philosophy
+		behind is, is that because the path are hard coded in the graph
+		they must be important. The downside of this method is when
+		overlap occurs between hard coded paths and information within
+		the categories. For example a band that is called
+		\texttt{Location} could interfere greatly with a hard coded
+		path that marks a location using the same words.
+\end{itemize}
 
+If we would know more about the categories the best heuristic automatically
+becomes the maximum path heuristic. When, as in our implementation, there is
+very little information both heuristics perform about the same.