From: Mart Lubbers Date: Mon, 2 Mar 2015 19:13:02 +0000 (+0100) Subject: v0.95 X-Git-Url: https://git.martlubbers.net/?a=commitdiff_plain;h=3144ae7c010d7f7fd069f190ca4bf0360acda928;p=bsc-thesis1415.git v0.95 --- diff --git a/thesis2/3.methods.tex b/thesis2/3.methods.tex index be8c4b4..b98fce0 100644 --- a/thesis2/3.methods.tex +++ b/thesis2/3.methods.tex @@ -229,9 +229,6 @@ added. } \end{figure} -\subsection{Minimality of the algorithm} - - \subsection{Appliance on extraction of patterns} The text data in combination with the user markings can not be converted automatically to a DAWG using the algorithm we described. This is because the @@ -254,6 +251,7 @@ choice. \label{nddawg} \centering \includegraphics[width=\linewidth]{nddawg.eps} + \strut\\ \caption{Example non determinism} \end{figure} @@ -262,8 +260,38 @@ The Myhill-Nerode theorem~\cite{Hopcroft1979} states that for every number of graphs accepting the same language there is a single graph with the least amount of states. Mihov\cite{Mihov1998} has proven that the algorithm is minimal in its original form. Our program converts the node-lists to DAWGs that -can possibly contain non deterministic nodes and therefore one can argue about -the minimality. Due to the nature of the determinism this is not the case. The -non determinism is only visible when matching the data and not in the real -graph since in the real graph we ... +can possibly contain non deterministic transitions from nodes and therefore one +can argue about Myhill-Nerodes theorem and Mihovs proof holding.. Due to the +nature of the determinism this is not the case and both hold. In reality the +graph itself is only non-deterministic when expanding the categories and thus +only during matching. + +Choosing the smartest path during matching the program has to choose +deterministically between possibly multiple path with possibly multiple +results. There are several possibilities or heuristics to choose from. +\begin{itemize} + \item Maximum fields heuristic\\ + + This heuristic prefers the result that has the highest amount + of categories filled with actual text. Using this method the + highest amount of data fields will be getting filled at all + times. The downside of this method is that because of this it + might be that some data is not put in the right field because a + suboptimal splitting occurred that has put the data in two + separate fields whereas it should be in one field. + \item Maximum path heuristic\\ + + Maximum path heuristic tries to find a match with the highest + amount of fixed path transitions. Fixed path transitions are + transitions that occur not within a category. The philosophy + behind is, is that because the path are hard coded in the graph + they must be important. The downside of this method is when + overlap occurs between hard coded paths and information within + the categories. For example a band that is called + \texttt{Location} could interfere greatly with a hard coded + path that marks a location using the same words. +\end{itemize} +If we would know more about the categories the best heuristic automatically +becomes the maximum path heuristic. When, as in our implementation, there is +very little information both heuristics perform about the same.