Within the entertainment business there is no consistent style of informing
people about the events. Different venues display their, often incomplete,
information in entirely different ways. Because of this, converting raw
-information from venues to structured consistent data is a relevant problem.
+information from venues to structured consistent data is a challenging and,
+relevant problem.
\section{HyperLeap}
Hyperleap is a small company that is specialized in infotainment
-(information+entertainment) and administrates several websites which bundle
-information about entertainment in a ordered and complete way. Right now, most
-of the data input is done by hand and takes a lot of time to type in.
+(information + entertainment) and administrates several websites which bundle
+information about entertainment in an ordered way and as complete as possible.
+Right now, most of the input data is added to the database by by hand which is
+very labor intensive. Therefore Hyperleap is looking for a smart solution to
+automate a part of the data injection in the database, the crux however is that
+the system must not be too complicated from the outside and be useable for a
+non IT professional(NIP).
-\section{Research question}
-The main research question is: \textit{How can we make an adaptive, autonomous
-and programmable data mining program that can be set up by a non IT
-professional(NIP) which is able to transform raw data into structured data.}\\
+\section{Research question and practical goals}
+This brings up the main research question: \textit{How can we make an adaptive,
+autonomous and programmable data mining program that can be set up by a NIP
+which is able to transform raw data into structured data.}\\
-The practical goal and aim of the project is to make a crawler(web or other
-document types) that can autonomously gather information after it has been
-setup by a, not necessarily IT trained, employer via an intuitive interface.
-Optionally the crawler shouldn't be susceptible by small structure changes in
-the website, be able to handle advanced website display techniques such as
-javascript and should be able to notify the administrator when the site has
-become uncrawlable and the crawler needs to be reprogrammed for that particular
-site. But the main purpose is the translation from raw data to structured data.
-The projects is in principle a continuation of a past project done by Wouter
+In practice the goal and aim of the project is to create an application that
+can, with NIP input, give computer parseable patterns which a separate crawler
+can periodically crawl. The NIP has to be able to enter the information about
+the data source in a user friendly interface which sends the information
+together with the data source to the data processing application. The
+dataprocessing application then in turn processes the data into a extraction
+pattern which is sent to the crawler. The crawler can visit sources specified
+by the NIP accompanied by the extraction pattern created by the dataprocessing
+application. This workflow is described in graph~\ref{fig:ig1}.
+
+\begin{figure}[H]
+ \centering
+ \caption{Workflow within the applications}
+ \label{fig:ig1}
+ \includegraphics[width=150mm]{./dots/graph3.png}
+\end{figure}
+
+In this way the NIP can train the crawler to periodically crawl different data
+sources without too much technical knowledge. The main goal of this project is
+to extract the underlying structure rather then to extract the substructures.
+The project is in principle a continuation of a past project done by Wouter
Roelofs\cite{Roelofs2009} which was also supervised by Franc Grootjen and
-Alessandro Paula, however it was never taken out of the experimental phase and
-therefore is in need continuation.
+Alessandro Paula, however it was neven taken out of the experimental phase. The
+techniques described by Roelofs et al. are more focussed on extracting data
+from substructures so it can be an addition to the current project.\\
+
+As a very important sidenote, the crawler needs to notify the administrators if
+a source has become problematic to crawl, in this way the NIP can very easily
+retrain the application to fit the latest structural patterns.
\section{Scientific relevance}
Currently the techniques for conversion from non structured data to structured
-\section{Directed acyclic graphs and finitie automata}
-Directed acyclic graphs(DAG) and finite state automatas(FSA) have a lot in
+\section{Input application}
+The user input all goes through the familiar interface of the user's preferred
+web browser. By visiting the crawler's train website the user can specify the
+metadata of the source it wants to be periodically crawled through simple web
+forms as seen in figure~\ref{fig:mf1}
+\begin{figure}[H]
+ \centering
+ \caption{Webforms for source metadata}
+ \label{fig:mf1}
+ \includegraphics[width=80mm]{./img/img1.png}
+\end{figure}
+
+\section{Data processing application}
+\subsection{Directed acyclic graphs and finiti automata}
+Directed acyclic graphs(DAG) and finite state automaton(FSA) have a lot in
common concerning pattern recognition and information extraction. By feeding
words into an algorithm a DAG can be generated so that it matches certain
patters present in the given words. Figure~\ref{fig:mg1} for example shows a
describes. And with a little adaptation we can extract dynamic information from
semi-structured data.\\
-\section{NIP input}
-\section{Back to DAG's and FSA's}
-Nodes in this datastructure can be single letters but also bigger
+
+\subsection{Back to DAG's and FSA's}
+Nodes in this data structure can be single letters but also bigger
constructions. The example in Figure~\ref{fig:mg2} describes different
separator pattern for event data with its three component: what, when, where.
In this example the nodes with the labels \textit{what, when, where} can also
-be complete subgrahps. In this way data on a larger scale
+be complete subgrahps. In this way data on a larger level can be using the
+NIP markings and data within the categories can be processed autonomously.
\begin{figure}[H]
\centering
\caption{Example event data}
\includegraphics[width=\linewidth]{./dots/graph2.png}
\end{figure}
+\subsection{Algorithm}
-
-\section{Algorithm}
-Hello Wordl
-
+\section{Crawler application}