Merge branch 'master' of github.com:dopefishh/bachelor

author Mart Lubbers <mart@martlubbers.net>

Sun, 8 Mar 2015 08:15:36 +0000 (09:15 +0100)

committer Mart Lubbers <mart@martlubbers.net>

Sun, 8 Mar 2015 08:15:36 +0000 (09:15 +0100)
author Mart Lubbers <mart@martlubbers.net>
Sun, 8 Mar 2015 08:15:36 +0000 (09:15 +0100)
committer Mart Lubbers <mart@martlubbers.net>
Sun, 8 Mar 2015 08:15:36 +0000 (09:15 +0100)
diff --git a/thesis2/3.methods.tex b/thesis2/3.methods.tex

index f72b2e3..b98fce0 100644 (file)
--- a/thesis2/3.methods.tex
+++ b/thesis2/3.methods.tex
@@ -229,9 +229,6 @@ added.
         }
  \end{figure}
  
-\subsection{Minimality of the algorithm}
-
-
  \subsection{Appliance on extraction of patterns}
  The text data in combination with the user markings can not be converted
  automatically to a DAWG using the algorithm we described. This is because the
@@ -263,8 +260,38 @@ The Myhill-Nerode theorem~\cite{Hopcroft1979} states that for every number of
  graphs accepting the same language there is a single graph with the least
  amount of states. Mihov\cite{Mihov1998} has proven that the algorithm is
  minimal in its original form. Our program converts the node-lists to DAWGs that
-can possibly contain non deterministic nodes and therefore one can argue about
-the minimality. Due to the nature of the determinism this is not the case. The
-non determinism is only visible when matching the data and not in the real
-graph since in the real graph we ...
+can possibly contain non deterministic transitions from nodes and therefore one
+can argue about Myhill-Nerodes theorem and Mihovs proof holding.. Due to the
+nature of the determinism this is not the case and both hold. In reality the
+graph itself is only non-deterministic when expanding the categories and thus
+only during matching. 
+
+Choosing the smartest path during matching the program has to choose
+deterministically between possibly multiple path with possibly multiple
+results. There are several possibilities or heuristics to choose from.
+\begin{itemize}
+       \item Maximum fields heuristic\\
+
+               This heuristic prefers the result that has the highest amount
+               of categories filled with actual text. Using this method the
+               highest amount of data fields will be getting filled at all
+               times. The downside of this method is that because of this it
+               might be that some data is not put in the right field because a
+               suboptimal splitting occurred that has put the data in two
+               separate fields whereas it should be in one field.
+       \item Maximum path heuristic\\
+
+               Maximum path heuristic tries to find a match with the highest
+               amount of fixed path transitions. Fixed path transitions are
+               transitions that occur not within a category. The philosophy
+               behind is, is that because the path are hard coded in the graph
+               they must be important. The downside of this method is when
+               overlap occurs between hard coded paths and information within
+               the categories. For example a band that is called
+               \texttt{Location} could interfere greatly with a hard coded
+               path that marks a location using the same words.
+\end{itemize}
  
+If we would know more about the categories the best heuristic automatically
+becomes the maximum path heuristic. When, as in our implementation, there is
+very little information both heuristics perform about the same.
diff --git a/thesis2/5.appendices.tex b/thesis2/5.appendices.tex

index 9e2a611..5614e5a 100644 (file)
--- a/thesis2/5.appendices.tex
+++ b/thesis2/5.appendices.tex
@@ -30,6 +30,7 @@
                 }
         }
         \caption{Generating DAWGs pseudocode}
+       \label{pseudodawg}
  \end{algorithm}
  
  \section{Schemes}
diff --git a/thesis2/Makefile b/thesis2/Makefile

index ad0c690..3d00019 100644 (file)
--- a/thesis2/Makefile
+++ b/thesis2/Makefile
@@ -1,5 +1,5 @@
  SHELL:=/bin/bash
-VERSION:=0.95
+VERSION:=1.0RC1
  
  all: thesis
  
diff --git a/thesis2/abstract.tex b/thesis2/abstract.tex

index e69de29..9f5e594 100644 (file)
--- a/thesis2/abstract.tex
+++ b/thesis2/abstract.tex
@@ -0,0 +1,12 @@
+Within the leisure activity field, information is often bundled badly and
+contains empty or wrong data. Hyperleap tries to solve this problem by bundling
+the information from various sources including RSS feeds. Currently the
+feedback loop for fixing site-specific crawlers requires multiple steps which
+demand someone with a computer science background to perform. We introduce a
+new adaptable crawler generation system using subword matching via an adapted
+form of directed acyclic word graphs. The application allows users with no
+particular computer science background to create, edit and test crawlers for
+RSS feeds. In this way the feedback loop for broken crawlers is shortened, new
+sources can be incorporated in the database quicker and, most importantly, the
+information about the latest movie show, theater production or conference will
+reach the people looking for it as fast as possible.
diff --git a/thesis2/thesis.tex b/thesis2/thesis.tex

index 6f94dde..58e6d2e 100644 (file)
--- a/thesis2/thesis.tex
+++ b/thesis2/thesis.tex
@@ -1,6 +1,4 @@
-\documentclass[twopage,a4paper,titlepage]{book}
-
-%\usepackage[british]{babel}
+\documentclass[twopage,titlepage]{book}
  
  \usepackage{algorithm2e}
  \usepackage{a4wide}
@@ -14,6 +12,7 @@
  \usepackage{amssymb}
  \usepackage{amsmath}
  \usepackage{marvosym}
+\usepackage{setspace}
  
  % Set listings settings
  \definecolor{mintedbackground}{rgb}{0.95,0.95,0.95}
@@ -52,6 +51,8 @@ leisure activity RSS feeds}
  \author{
         Mart Lubbers\\
         s4109053\\
+       Artificial Intelligence\\
+       Radboud University Nijmegen\\
         \strut\\
         External supervisor: Alessandro Paula\\
         Internal supervisor: Franc Grootjen\\
@@ -66,12 +67,13 @@ leisure activity RSS feeds}
  % Surrogate abstract
  \chapter*{
         \centering 
-       \begin{normalsize}
+       \begin{large}
                 Abstract
-       \end{normalsize}
+       \end{large}
  }
  \begin{quotation}
         \noindent
+       \onehalfspacing
         \input{abstract.tex}
  \end{quotation}
  \clearpage
@@ -80,7 +82,7 @@ leisure activity RSS feeds}
  \input{1.introduction.tex}
  
  \chapter{Requirements and design}
-\input{2.requirementsanddesign}
+\input{2.requirementsanddesign.tex}
  
  \chapter{Algorithm}
  \input{3.methods.tex}
author	Mart Lubbers <mart@martlubbers.net>
	Sun, 8 Mar 2015 08:15:36 +0000 (09:15 +0100)
committer	Mart Lubbers <mart@martlubbers.net>
	Sun, 8 Mar 2015 08:15:36 +0000 (09:15 +0100)
thesis2/3.methods.tex		patch \| blob \| history
thesis2/5.appendices.tex		patch \| blob \| history
thesis2/Makefile		patch \| blob \| history
thesis2/abstract.tex		patch \| blob \| history
thesis2/thesis.tex		patch \| blob \| history