The backend consists of several processing steps that the input has go through
before it is converted to a crawler specification. These steps are visualized
in Figure~\ref{appinternals}. All the nodes are important milestones in the
-process of processing the user data. Arrows indicate informatio transfer
+process of processing the user data. Arrows indicate information transfer
between these steps. The Figure is a detailed explanation of the
\textit{Backend} node in Figure~\ref{appoverview}.
\end{figure}
\section{HTML data}
-The raw data from the Frontend with the user markings enter the backend as a
+The raw data from the frontend with the user markings enter the backend as a
HTTP \textit{POST} request. This \textit{POST} request consists of several
information data fields. These data fields are either fields from the static
description boxes in the frontend or raw \textit{HTML} data from the table
showing the processed RSS feed entries which contain the markings made by the
-user. The table is sent in whole precicely at the time the user presses the
+user. The table is sent in whole precisely at the time the user presses the
submit button. Within the \textit{HTML} data of the table markers are placed
before sending. These markers make the parsing of the tables more easy and
remove the need for an advanced \textit{HTML} parser to extract the markers.
in Listing~\ref{dawg.py}.
\begin{enumerate}
\item
- Say we add word $w$ to the grahp. Step one is finding the
+ Say we add word $w$ to the graph. Step one is finding the
common prefix of the word already in the graph. The common
prefix is defined as the longest subword $w'$ for which there
is a $\delta^*(q_0, w')$. When the common prefix is found we
via patterns in plain text and the performance on HTML is very bad compared to
plain text. A text field with HTML is almost useless to gather information
from because they usually include all kinds of information in other modalities
-then text. Via a small study on a selecteion of RSS feeds($N=10$) we found that
+then text. Via a small study on a selection of RSS feeds($N=10$) we found that
about $50\%$ of the RSS feeds misuse the protocol in such a way that extraction
of data is almost impossible. This reduces the domain of good RSS feeds to less
then $5\%$ of the venues.