by definition a programmer, can be instructed to change the structure and this
can also greatly reduce programmer intervention time.
+\section{RSS/Atom}
+RSS/Atom feeds, from now on called RSS feeds, are publishing
+feeds in the XML format\footnote{\url{http://www.w3.org/TR/REC-xml/}} that are
+used to publish events. Every event or entry consists of several standardized
+fields. The main data fields are the \textit{title} and the
+\textit{description} field. In these fields the raw data is stored that
+describes the event. Further there are several auxiliary fields that store the
+link to the full article, store the publishing data, store some media in the
+form of an image or video URL and store a \textit{Globally Unique
+Identifier}(GUID).
+
+The RSS feeds are mostly used by news sites to publish their articles, the feed
+then only contains the headlines and if the user that is reading the feed is
+interested he can click the so called deeplink and he is then sent to the full
+website containing the article. Users often use programs that bundle user
+specified RSS feeds into big summary feeds that can be used to sift through a
+lot of news feeds. The RSS feed reader uses the GUID to skip feeds that are
+already present in its internal database.
+
+Generating RSS feeds by hand is a tedious task but usually RSS feeds are
+generated by the Content Management Systems(CMS) the website runs on. With this
+automatic method the RSS feeds are generated using the content published on the
+website. Because the process is automatic the RSS feeds are generally very
+structured and consistent in its structure. In the entertainment industry
+venues often use a CMS for their website to allow users with no programming or
+website background be able to post news items and event information.
+
\section{Why RSS/Atom}
Information from venues comes in various different format with for each format
several positive and negative points. For this project we chose to focus on
-RSS/Atom feeds because they are in general already structured and consistent in
-their structure. For example websites change a lot in structure and layout and
-thus making it hard to keep crawlers up to date. RSS/Atom feeds generally only
-change structure because the website or content management system gets migrated
-or upgraded.
-
-RSS/Atom feeds in comparison to websites doesn't have a structural dimension in
-the data. Because of this we have to use different techniques of isolation of
-information than present techniques used for extracting information from
-websites. RSS/Atom feeds are basically two fields with plain text, however the
-text almost always has the same structure and keywords and therefore the
-information can be extracted learning from keywords and structure.
+RSS feeds. RSS feeds are, as explained earlier, very structured and consistent
+in their structure. The structure basically only changes if the CMS changes or
+upgrades. Because of this patterns don't have to be retrained a lot.
+
+In comparison to websites RSS feeds don't have a structural dimension in
+the data, all the information in an entry is basically two fields of plain
+text. Therefore an entirely new strategy has to be applied to train the
+patterns.
\section{Scientific relevance and similar research}
Currently the techniques for conversion from non structured data to structured
great need of data mining in non structured data because the data within
companies and on the internet is piling up and are usually left to catch dust.
-The project is a side project of the past project done by Roelofs et
-al.\cite{Roelofs2009}. Roelofs et al. describes a technique to recognize date
-strings using an adapted Levensteins algorithm. This technique can be fitted in
-the current project because it works on a low level and the technique we
-describe works on a high level. The algorithm only works on already isolated
-data.
+The project is a followup project of the past project done by Roelofs et
+al.\cite{Roelofs2008} and \cite{Roelofs2009}. Roelofs et al. described
+techniques on recognizing patterns in website data and published about an
+adapted levenstein distance algorithm to recognize data in isolated text. These
+techniques of recognizing data in text can still be used to interpret the
+isolated extracted parts from this project.
+
+\section{Directed Acyclic Graphs}
+A graph is a mathematical structure described with the ordered pair: $G=(V,E)$.
+In this ordered pair $V$ is the set of nodes and $E$ is set of ordered pairs
+that we will call \textit{undirected edges}.
+
+Directed graphs are special kinds of graphs. A directed graph is described as
+the ordered pair: $G=(V,A)$. Where $A$ is a set of ordered pairs that we will
+call \textit{directed edges}. Directed edges can be visualized as arrows in a
+graph whereas undirected edges are just lines connecting the nodes with no
+particular direction.
+
+Directed Acyclic Graphs(DAGs) are again a special kind of directed graph. DAGs
+don't allow cycles to be created. Every node is only reachable at maximum once
+when starting from an arbitrary point in the graph. DAGs have the nice property
+of checking if a sequence appears in the graph, checking if a sequence is
+present in a graph only has a computational complexity of $\mathcal{O}(L)$
+where $L$ is the length of the sequence.
+
+The type of graph used in the project is another special king of DAG called
+Directed Acyclic Word Graphs(DAWGs). This type of graph can be used to store
+large dictionaries of words. A DAWGs nodes basically represent indices of a
+word and the edges describe the character added when traversing. For example
+the graph in Figure~\ref{fig:2.1.1} can describe a dictionary that holds the
+words: \textit{abd, bad, bae}. Testing if a word is present in the DAWG is,
+just as for a DAG, falls als in the computational complexity class of
+$\mathcal{O}(L)$ meaning that it grows linearly with the length of the word.
+
+\begin{figure}[H]
+ \caption{Example DAWG}
+ \label{fig:2.1.1}
+ \centering
+ \digraph[]{graph21}{
+ rankdir=LR;
+ 1,2,3,4,5 [shape="circle"];
+ 6 [shape="doublecircle"];
+ 1 -> 2 [label="a"];
+ 2 -> 3 [label="b"];
+ 3 -> 6 [label="d"];
+ 1 -> 4 [label="b"];
+ 4 -> 5 [label="a"];
+ 5 -> 6 [label="a"];
+ 5 -> 6 [label="e"];
+ }
+\end{figure}