\section{Goal \& Research question}
Maintaining the automated crawlers and the infrastructure that provides the
-\textit{Temporum} and its matching aid automization are the parts within the
-dataflow that require the most amount of resources. Both of these parts require
+\textit{Temporum} and its matching aid automation are the parts within the
+data flow that require the most amount of resources. Both of these parts require
a programmer to execute and therefore are costly. In the case of the automated
crawlers it requires a programmer because the crawlers are scripts or programs
created are website-specific. Changing such a script or program requires
adapted to the new structure and will produce good data again. This feedback
loop, shown in Figure~\ref{feedbackloop}, can take days and can be the reason
for gaps and faulty information in the database. The figure shows information
-flow with arrows. The solid and dotted lines form the current feedbackloop.
+flow with arrows. The solid and dotted lines form the current feedback loop.
\begin{figure}[H]
\label{feedbackloop}
\centering
lot of time repairing
crawlers and make the task of adapting, editing and removing
crawlers feasible for someone without programming experience. In practice this
-means shortening the feedbackloop. The shorter feedback loop is also shown in
-Figure~\ref{feedbackloop}. The dashed line shows the shorter feedbackloop that
+means shortening the feedback loop. The shorter feedback loop is also shown in
+Figure~\ref{feedbackloop}. The dashed line shows the shorter feedback loop that
relieves the programmer.
For this project a system has been developed that provides an
node $n2$ and $n3$ are final. Finally $v_0$ describes the initial node, this is
visualized in figures as an incoming arrow. Because of the property of labeled
edges, data can be stored in a DAWG. When traversing a DAWG and saving all the
-edgelabels one can construct words. Using graph minimalization big sets of
-words can be stored using a small amouth of storage because edges can be
+edge labels one can construct words. Using graph minimisation big sets of
+words can be stored using a small amount of storage because edges can be
re-used to specify transitions. For example the graph in
Figure~\ref{exampledawg} can describe the language $L$ where all words $w$ that
are accepted $w\in\{abd, bad, bae\}$. Testing if a word is present in the DAWG
\item[F2:] Export the data to a strict XML feed.
This requirement is an adapted version of requirement I3, this
- is als done to limit the scope. We chose to no interact
+ is also done to limit the scope. We chose to no interact
directly with the database or the \textit{Temporum}. The
application however is able to output XML data that is
formatted following a string XSD scheme so that it is easy to
This requirement is formed from I4. Initially the user
interface for adding and training crawlers was done via a
- webinterface that was user friendly and usable by someone
+ web interface that was user friendly and usable by someone
without a particular computer science background as the
requirement stated. However in the first prototypes the control
center that could test, edit and remove crawlers was a command
this can be due to any reason, a message is sent to the people
using the program so that they can edit or remove the faulty
crawler. Updating without the need of a programmer is essential
- in shortening the feedbackloop explained in
+ in shortening the feedback loop explained in
Figure~\ref{feedbackloop}.
\end{itemize}
extensions are discussed in Section~\ref{sec:discuss}.
\item[N2:] Operate standalone on a server.
- Non-functional requirement N1 is dropped because we want to
+ Non-functional requirement O1 is dropped because we want to
keep the program as modular as possible and via an XML
interface we still have a very intimate connection with the
database without having to maintain a direct connection. The
When looking for an activity in a bar or trying to find a good movie to watch
-it often seems difficult to find complete information about the event without
-empty or wrong data. Hyperleap tries to solve problem of bad information
-giving by bundling the information from various sources and invest in good
-quality checking. Currently information retrievel is performed using
-site-specific crawlers, when a crawler breaks the feedback loop for fixing the
-it contains of different steps and requires someone with a computer science
-background. A crawler generation system has been created that uses directed
-acyclic word graphs that assist solving the feedback loop problem. The system
-allows users with no particular computer science background to create, edit and
-test crawlers for \textit{RSS} feeds. In this way the feedback loop for broken
-crawlers is shortened, new sources can be incorporated in the database quicker
-and, most importantly, the information about the latest movie show, theater
-production or conference will reach the people looking for it as fast as
-possible.
+it often seems difficult to find complete and correct information about the
+event. Hyperleap tries to solve problem of bad information giving by bundling
+the information from various sources and invest in good quality checking.
+Currently information retrieval is performed using site-specific crawlers, when
+a crawler breaks the feedback loop for fixing the it contains of different
+steps and requires someone with a computer science background. A crawler
+generation system has been created that uses directed acyclic word graphs to
+assist solving the feedback loop problem. The system allows users with no
+particular computer science background to create, edit and test crawlers for
+\textit{RSS} feeds. In this way the feedback loop for broken crawlers is
+shortened, new sources can be incorporated in the database quicker and, most
+importantly, the information about the latest movie show, theater production or
+conference will reach the people looking for it as fast as possible.