people about the events. Different venues display their, often incomplete,
information in entirely different ways. Because of this, converting raw
information from venues to structured consistent data is a challenging and,
-relevant problem.
+a relevant problem.
\section{HyperLeap}
-Hyperleap is a small company that is specialized in infotainment
-(information + entertainment) and administrates several websites which bundle
-information about entertainment in an ordered way and as complete as possible.
-Right now, most of the input data is added to the database by by hand which is
-very labor intensive. Therefore Hyperleap is looking for a smart solution to
-automate a part of the data injection in the database, the crux however is that
-the system must not be too complicated from the outside and be useable for a
-non IT professional(NIP).
+Hyperleap\footnote{\url{http://hyperleap.nl/}} is a small company that is
+specialized in infotainment (information + entertainment) and administrates
+several websites which bundle information about entertainment in an ordered way
+and as complete as possible. Right now, most of the input data is added to the
+database by by hand which is very labor intensive. Therefore Hyperleap is
+looking for a smart solution to automate a part of the data injection in the
+database, the crux however is that the system must not be too complicated from
+the outside and be usable for a non IT professional(NIP).
\section{Research question and practical goals}
This brings up the main research question: \textit{How can we make an adaptive,
which is able to transform raw data into structured data.}\\
In practice the goal and aim of the project is to create an application that
-can, with NIP input, give computer parseable patterns which a separate crawler
+can, with NIP input, give computer parsable patterns which a separate crawler
can periodically crawl. The NIP has to be able to enter the information about
the data source in a user friendly interface which sends the information
together with the data source to the data processing application. The
-dataprocessing application then in turn processes the data into a extraction
+data processing application then in turn processes the data into a extraction
pattern which is sent to the crawler. The crawler can visit sources specified
-by the NIP accompanied by the extraction pattern created by the dataprocessing
-application. This workflow is described in graph~\ref{fig:ig1}.
+by the NIP accompanied by the extraction pattern created by the data processing
+application. This work flow is described in graph~\ref{fig:ig1}.
\begin{figure}[H]
\centering
- \caption{Workflow within the applications}
+ \caption{Work flow within the applications}
\label{fig:ig1}
\includegraphics[width=150mm]{./dots/graph3.png}
\end{figure}
to extract the underlying structure rather then to extract the substructures.
The project is in principle a continuation of a past project done by Wouter
Roelofs\cite{Roelofs2009} which was also supervised by Franc Grootjen and
-Alessandro Paula, however it was neven taken out of the experimental phase. The
+Alessandro Paula, however it was never taken out of the experimental phase. The
techniques described by Roelofs et al. are more focussed on extracting data
from substructures so it can be an addition to the current project.\\
-As a very important sidenote, the crawler needs to notify the administrators if
+As a very important side note, the crawler needs to notify the administrators if
a source has become problematic to crawl, in this way the NIP can very easily
retrain the application to fit the latest structural patterns.
+The program can be divided into three components: input, data processing and
+the crawler. The applications have separate tasks within the workflow, the
+input application defines together with the NIP the patterns for the source,
+the data processing application processes the patterns it is given by the input
+application and compiles them into computer interpretable patterns and the
+crawler interprets the patterns and visits the sources from time to time to
+extract the information.
+
\section{Input application}
+The purpose of the input application is to define the patterns together with
+the user so that the information can be transferred to the data processing
+application.
The user input all goes through the familiar interface of the user's preferred
web browser. By visiting the crawler's train website the user can specify the
metadata of the source it wants to be periodically crawled through simple web
\end{figure}
\section{Data processing application}
-\subsection{Directed acyclic graphs and finiti automata}
-Directed acyclic graphs(DAG) and finite state automaton(FSA) have a lot in
-common concerning pattern recognition and information extraction. By feeding
-words into an algorithm a DAG can be generated so that it matches certain
-patters present in the given words. Figure~\ref{fig:mg1} for example shows a
-FSA that matches on the words \textit{ab} and \textit{ac}.
+\subsection{Directed acyclic graphs and finite state automata} Directed acyclic
+graphs(DAG) and finite state automata(FSA) have a lot in common concerning
+pattern recognition and information extraction. By feeding words\footnote{A
+word is a finite combination of letters from the graphs alphabet, thus a word
+is not limited to linguistic words but can be anything as long as the
+components are in the graphs alphabet} into an algorithm a DAG can be generated
+so that it matches certain patters present in the given words.
+Figure~\ref{fig:mg1} for example shows a FSA that matches on the words
+\textit{ab} and \textit{ac}.
\begin{figure}[H]
\centering
\caption{Example DAG/FSA}