--- /dev/null
+\section{Application overview and workflow}
+The program can be divided into two main components namely the \textit{Crawler
+application} and the \textit{Input application}. The components are strictly
+separated by task and by application. The crawler is an application dedicated
+to the sole task of periodically crawling the sources asynchronously. The input
+is a web interface to a set of tools that can create, edit, remove and test
+crawlers via simple point and click user interfaces that can be worked with by
+someone without a computer science background.
+
+\section{Minimizing DAWGs}
+The first algorithm to generate DAG's was proposed by Hopcroft et
+al\cite{Hopcroft1971}. The algorithm they described wasn't incremental and had
+a complexity of $\mathcal{O}(N\log{N})$. \cite{Daciuk2000} et al. later
+extended the algorithm and created an incremental one without increasing the
+computational complexity. The non incremental algorithm from Daciuk et al. is
+used to convert the nodelists to a graph.
+
+For example constructing a graph that from the entry: \textit{a,bc} and
+\textit{a.bc} goes in the following steps:
+
+\begin{figure}[H]
+ \caption{Sample DAG, first entry}
+ \label{fig:f22}
+ \centering
+ \digraph[]{graph22}{
+ rankdir=LR;
+ 1,2,3,5 [shape="circle"];
+ 5 [shape="doublecircle"];
+ 1 -> 2 [label="a"];
+ 2 -> 3 [label="."];
+ 3 -> 4 [label="b"];
+ 4 -> 5 [label="c"];
+ }
+\end{figure}
+
+\begin{figure}[H]
+ \caption{Sample DAG, second entry}
+ \label{fig:f23}
+ \centering
+ \digraph[]{graph23}{
+ rankdir=LR;
+ 1,2,3,5,6 [shape="circle"];
+ 5 [shape="doublecircle"];
+ 1 -> 2 [label="a"];
+ 2 -> 3 [label="."];
+ 3 -> 4 [label="b"];
+ 4 -> 5 [label="c"];
+
+ 2 -> 6 [label=","];
+ 6 -> 4 [label="b"];
+ }
+\end{figure}
+
+\section{Input application}
+\subsection{Components}
+Add new crawler
+
+Editing or remove crawlers
+
+Test crawler
+
+Generate xml
+
+
+\section{Crawler application}
+\subsection{Interface}
+
+\subsection{Preprocessing}
+When the data is received by the crawler the data is embedded as POST data in a
+HTTP request. The POST data consists of several fields with information about
+the feed and a container that has the table with the user markers embedded.
+After that the entries are extracted and processed line by line.
+
+The line processing converts the raw string of html data from a table row to a
+string. The string is stripped of all the html tags and is accompanied by a
+list of marker items. The entries that don't contain any markers are left out
+in the next step of processing. All data, including entries without user
+markers, is stored in the object too for possible later reference, for example
+for editing the patterns.
+
+The last step is when the entries with markers are then processed to build
+node-lists. Node-lists are basically lists of words that, when concatenated,
+form the original entry. A word isn't a word in the linguistic sense. A word
+can be one letter or a category. The node-list is generated by putting all the
+separate characters one by one in the list and when a user marking is
+encountered, this marking is translated to the category code and that code is
+then added as a word. The nodelists are then sent to the actual algorithm to be
+converted to a graph representation.
+
+\subsection{Defining categories}
+pass
+
+\subsection{Process}
+Proposal was written
+
+
+First html/mail/fax/rss, worst case rss
+
+
+After some research and determining the scope of the project we decided only to
+do RSS, this because RSS tends to force structure in the data because RSS feeds
+are often generated by the website and thus reliable and consistent. We found a
+couple of good RSS feeds.
+
+
+At first the general framework was designed and implemented, no method yet.
+
+
+Started with method for recognizing separators.
+
+
+Found research paper about algorithm that can create directed acyclic graphs
+from string, although it was designed to compress word lists it can be
+(mis)used to extract information.
+
+
+Implementation of DAG algorithm found and tied to the program.
+
+
+Command line program ready. Conversation with both supervisors, gui had to be
+made.
+
+Step by step gui created. Web interface as a control center for the crawlers.
+
+
+Gui optimized.
+
+
+Concluded that the program doesn't reach wide audience due to lack of well
+structured rss feeds.