\subsection{Functional requirements}
\subsubsection{Original functional requirements}
\begin{itemize}
- \item[I1:] Be able to crawl several source types.
+ \item[I1:] The system should be able to crawl several source types.
\begin{itemize}
\item[I1a:] Fax/email.
\item[I1b:] XML feeds.
\item[I1d:] Websites.
\end{itemize}
\item[I2:] Apply low level matching techniques on isolated data.
- \item[I3:] Insert the data in the database.
- \item[I4:] User interface to train crawlers that is usable someone
- without a particular computer science background.
+ \item[I3:] Insert data in the database.
+ \item[I4:] The system should have an user interface to train crawlers that is
+ usable someone without a particular computer science background.
\item[I5:] The system should be able to report to the user or
maintainer when a source has been changed too much for
successful crawling.
system we have built is tested and can provide the necessary tools for a user
with no particular programming skills to generate crawlers and thus the number
of interventions where a programmer is needed is greatly reduced. Although we
-have solved the problem we stated the results are not strictly positive. For a
-problem to be solved the problem must be present.
+have solved the problem we stated the results are not strictly positive. This
+is because a if the problem space is not large the interest of solving the
+problem is also not large, this basically means that there is not much data to
+apply the solution on.
Although the research question is answered the underlying goal of the project
has not been completely achieved. The application is an intuitive system that
extraction performance.
The second most occurring common misuse is to use HTML formatted text in the RSS
-feeds text fields. Our algorithm is designed to detect and extract information
+feeds text fields. The algorithm is designed to detect and extract information
via patterns in plain text and the performance on HTML is very bad compared to
plain text. A text field with HTML is almost useless to gather information
-from. Via a small study on available RSS feeds we found that about $50\%$ of
-the RSS feeds misuse the protocol in such a way that extraction of data is
-almost impossible. This reduces the domain of good RSS feeds to less then $5\%$
-of the venues.
+from because they usually include all kinds of information in other modalities
+then text. Via a small study on a selecteion of RSS feeds($N=10$) we found that
+about $50\%$ of the RSS feeds misuse the protocol in such a way that extraction
+of data is almost impossible. This reduces the domain of good RSS feeds to less
+then $5\%$ of the venues.
\section{Discussion \& Future Research}
\label{sec:discuss}
% low level stuff
-The application we created does not apply any techniques on the isolated
-chunks. The application is built only to extract and not to process the labeled
-chunks of text. When we would combine the information about the global
-structure and information about structure in a marked area we increase
+The application we created does not apply any techniques on the extracted
+data fields. The application is built only to extract and not to process the
+labeled data fields with text. When we would combine the information about the
+global structure and information about structure in a marked area we increase
performance in two ways. A higher levels of performance are reached due to the
structural information of marked areas. Hereby extra knowledge as extra
constraint while matching the data in marked areas. The second increase in