From c61463256df7ce79738c659081c3d50b676ae7a0 Mon Sep 17 00:00:00 2001 From: Mart Lubbers Date: Fri, 1 Aug 2014 20:55:49 +0200 Subject: [PATCH] final commit before laptop --- thesis/introduction.tex | 32 ++++++++++++++++---------------- thesis/methods.tex | 26 ++++++++++++++++++++------ thesis/thesis.tex | 1 + 3 files changed, 37 insertions(+), 22 deletions(-) diff --git a/thesis/introduction.tex b/thesis/introduction.tex index 5d19ce3..553f1dc 100644 --- a/thesis/introduction.tex +++ b/thesis/introduction.tex @@ -3,17 +3,17 @@ Within the entertainment business there is no consistent style of informing people about the events. Different venues display their, often incomplete, information in entirely different ways. Because of this, converting raw information from venues to structured consistent data is a challenging and, -relevant problem. +a relevant problem. \section{HyperLeap} -Hyperleap is a small company that is specialized in infotainment -(information + entertainment) and administrates several websites which bundle -information about entertainment in an ordered way and as complete as possible. -Right now, most of the input data is added to the database by by hand which is -very labor intensive. Therefore Hyperleap is looking for a smart solution to -automate a part of the data injection in the database, the crux however is that -the system must not be too complicated from the outside and be useable for a -non IT professional(NIP). +Hyperleap\footnote{\url{http://hyperleap.nl/}} is a small company that is +specialized in infotainment (information + entertainment) and administrates +several websites which bundle information about entertainment in an ordered way +and as complete as possible. Right now, most of the input data is added to the +database by by hand which is very labor intensive. Therefore Hyperleap is +looking for a smart solution to automate a part of the data injection in the +database, the crux however is that the system must not be too complicated from +the outside and be usable for a non IT professional(NIP). \section{Research question and practical goals} This brings up the main research question: \textit{How can we make an adaptive, @@ -21,18 +21,18 @@ autonomous and programmable data mining program that can be set up by a NIP which is able to transform raw data into structured data.}\\ In practice the goal and aim of the project is to create an application that -can, with NIP input, give computer parseable patterns which a separate crawler +can, with NIP input, give computer parsable patterns which a separate crawler can periodically crawl. The NIP has to be able to enter the information about the data source in a user friendly interface which sends the information together with the data source to the data processing application. The -dataprocessing application then in turn processes the data into a extraction +data processing application then in turn processes the data into a extraction pattern which is sent to the crawler. The crawler can visit sources specified -by the NIP accompanied by the extraction pattern created by the dataprocessing -application. This workflow is described in graph~\ref{fig:ig1}. +by the NIP accompanied by the extraction pattern created by the data processing +application. This work flow is described in graph~\ref{fig:ig1}. \begin{figure}[H] \centering - \caption{Workflow within the applications} + \caption{Work flow within the applications} \label{fig:ig1} \includegraphics[width=150mm]{./dots/graph3.png} \end{figure} @@ -42,11 +42,11 @@ sources without too much technical knowledge. The main goal of this project is to extract the underlying structure rather then to extract the substructures. The project is in principle a continuation of a past project done by Wouter Roelofs\cite{Roelofs2009} which was also supervised by Franc Grootjen and -Alessandro Paula, however it was neven taken out of the experimental phase. The +Alessandro Paula, however it was never taken out of the experimental phase. The techniques described by Roelofs et al. are more focussed on extracting data from substructures so it can be an addition to the current project.\\ -As a very important sidenote, the crawler needs to notify the administrators if +As a very important side note, the crawler needs to notify the administrators if a source has become problematic to crawl, in this way the NIP can very easily retrain the application to fit the latest structural patterns. diff --git a/thesis/methods.tex b/thesis/methods.tex index 7919c4e..05bc2ed 100644 --- a/thesis/methods.tex +++ b/thesis/methods.tex @@ -1,4 +1,15 @@ +The program can be divided into three components: input, data processing and +the crawler. The applications have separate tasks within the workflow, the +input application defines together with the NIP the patterns for the source, +the data processing application processes the patterns it is given by the input +application and compiles them into computer interpretable patterns and the +crawler interprets the patterns and visits the sources from time to time to +extract the information. + \section{Input application} +The purpose of the input application is to define the patterns together with +the user so that the information can be transferred to the data processing +application. The user input all goes through the familiar interface of the user's preferred web browser. By visiting the crawler's train website the user can specify the metadata of the source it wants to be periodically crawled through simple web @@ -11,12 +22,15 @@ forms as seen in figure~\ref{fig:mf1} \end{figure} \section{Data processing application} -\subsection{Directed acyclic graphs and finiti automata} -Directed acyclic graphs(DAG) and finite state automaton(FSA) have a lot in -common concerning pattern recognition and information extraction. By feeding -words into an algorithm a DAG can be generated so that it matches certain -patters present in the given words. Figure~\ref{fig:mg1} for example shows a -FSA that matches on the words \textit{ab} and \textit{ac}. +\subsection{Directed acyclic graphs and finite state automata} Directed acyclic +graphs(DAG) and finite state automata(FSA) have a lot in common concerning +pattern recognition and information extraction. By feeding words\footnote{A +word is a finite combination of letters from the graphs alphabet, thus a word +is not limited to linguistic words but can be anything as long as the +components are in the graphs alphabet} into an algorithm a DAG can be generated +so that it matches certain patters present in the given words. +Figure~\ref{fig:mg1} for example shows a FSA that matches on the words +\textit{ab} and \textit{ac}. \begin{figure}[H] \centering \caption{Example DAG/FSA} diff --git a/thesis/thesis.tex b/thesis/thesis.tex index 95d6898..98ff026 100644 --- a/thesis/thesis.tex +++ b/thesis/thesis.tex @@ -4,6 +4,7 @@ \usepackage{graphicx} \usepackage{float} \usepackage{listings} +\usepackage{hyperref} \lstset{ basicstyle=\scriptsize, -- 2.20.1