From 9d5489aa08c7352c471ad988a7b0cbf9afc2a430 Mon Sep 17 00:00:00 2001 From: Mart Lubbers Date: Wed, 13 May 2015 18:32:38 +0200 Subject: [PATCH] fixed up to and including 1.4 --- thesis2/1.introduction.tex | 133 +++++++++++++++++++------------------ 1 file changed, 70 insertions(+), 63 deletions(-) diff --git a/thesis2/1.introduction.tex b/thesis2/1.introduction.tex index b15167b..a23705d 100644 --- a/thesis2/1.introduction.tex +++ b/thesis2/1.introduction.tex @@ -79,7 +79,7 @@ the nodes are processing steps and the arrows denote information transfer or flow. \begin{figure}[H] - \label{informationflow} +\label{informationflow} \centering \includegraphics[width=\linewidth]{informationflow.pdf} \caption{Information flow Hyperleap database} @@ -165,83 +165,90 @@ change in the source malformed data can pass through. As a safety net and final check the information first goes to the \textit{Temporum} before it will be entered in the database. -\subsection*{Temporum} +\subsection{Temporum} The \textit{Temporum} is a big bin that contains raw data extracted from -different sources and has to be post processed to be suitable enough for the -actual database. This post-processing encompasses several possible tasks. The -first task is to check the validity of the entry. This is not done for every -entry but random samples are taken and these entries will be tested to see -if the crawler is not malfunctioning and there is no nonsense in the -data. - -The second step is matching the entry to several objects. Objects in the -broadest sense can basically be anything, in the case of a theater show -performance the matched object can be the theater show itself, describing the -show for all locations. In case of a concert it can be that the object is a -tour the band is doing. When the source is a ticket vendor the object is the -venue where the event takes place since the source itself is not a venue. -Many of these tasks are done by a human or with aid of a computer program. -The \textit{Temporum} acts as a safety net because no data is going straight -through to the database. It is first checked, matched and validated. - -\subsection*{Database \& Publication} -When the data is post processed it is entered in the final database. The -database contains all the events that happened in the past and all the events -that are going to happen in the future. The database also contains information -about the venues. The database is linked to several categorical websites that -offer the information to users and accompany it with the other part of the -\textit{infotainment} namely the subjectual information that is usually +different sources using automated crawlers. All the information in the +\textit{Temporum} might not be suitable for the final database and therefore +has to be post processed. The post-processing encompasses several different +steps. + +The first step is to check the validity of the event entries from a certain +source. Validity checking is useful to detect faulty automated crawlers +before the data can leak into the database. Validity checking happens at random +on certain event entries. + +An event entry usually contains one occurrence of an event. In a lot of cases +there is parent information that the event entry is part of. For example in the +case of a concert tour the parent information is the concert tour and the event +entry is a certain performance. The second step in post processing is matching +the event entries to possible parent information. This parent information can +be a venue, a tour, a showing, a tournament and much more. + +Both of the post processing tasks are done by people with the aid of automated +functionality. Within the two post processing steps malformed data can be +spotted very fast and the \textit{Temporum} thus acts as a safety net to keep +the probability of malformed data leaking into the database as low as possible. + +\subsection{Database \& Publication} +Postprocessed data that leaves the \textit{Temporum} will enter the final +database. This database contains all the information about all the events that +happened in the past and the events that will happen in the future. The +database also contains the parent information such as information about venues. +Several categorical websites use the database to offer the information to users +and accompany it with the second part of \textit{infotainment} namely the +subjectual information. The \textit{entertainment} part will usually be presented in the form of trivia, photos, interviews, reviews, previews and much more. \section{Goal \& Research question} -Crawling the sources and applying the preprocessing is the most expensive task -in the entire information flow because it requires a programmer to perform -these tasks. Programmers are needed because all the crawlers are scripts or -programs that were created specifically for a website. Changing a crawler or -preprocessing job means changing the code. -A big group of sources often changes and that makes that this task -becomes very expensive and has a long feedback loop. When a source changes the -source is first preprocessed in the old way, send to the \textit{Temporum} and -checked by a human and matched. The human then notices the error in the data -and has to contact the programmer. The programmer then has to reprogram the -specific crawler or job to the new structure. This feedback loop, shown in -Figure~\ref{feedbackloop} can take days and can be the reason for gaps and -faulty information in the database. In the figure the dotted arrows denote the -current feedback loop for crawlers. +Maintaining the automated crawlers and the infrastructure that provides the +\textit{Temporum} and its matching aid automization are the parts within the +dataflow that require the most amount of resources. Both of these parts require +a programmer to execute and therefore are costly. In the case of the automated +crawlers it requires a programmer because the crawlers are scripts or programs +created are website-specific. Changing such a script or program requires +knowledge about the source, the programming framework and about the +\textit{Temporum}. In practice both of the tasks mean changing code. + +A large group of sources often changes in structure. Because of such changes +the task of reprogramming crawlers has to be repeated a lot. The detection of +malfunctioning crawlers happens in the \textit{Temporum} and not in an earlier +stage. Late detection elongates the feedback loop because there is not always a +tight communication between the programmers and the \textit{Temporum} workers. +In the case of a malfunction the source is first crawled. Most likely the +malformed data will get processed and will produce rubbish that is sent to the +\textit{Temporum}. Within the \textit{Temporum} after a while the error is +detected and the programmers have to be contacted. Finally the crawler will be +adapted to the new structure and will produce good data again. This feedback +loop, shown in Figure~\ref{feedbackloop}, can take days and can be the reason +for gaps and faulty information in the database. The figure shows information +flow with arrows. The solid and dotted lines form the current feedbackloop. \begin{figure}[H] - \label{feedbackloop} +\label{feedbackloop} \centering \includegraphics[width=0.8\linewidth]{feedbackloop.pdf} - \strut\\\strut\\ \caption{Feedback loop for malfunctioning crawlers} \end{figure} -\strut\\ -The goal of this project is specifically to relieve the programmer of repairing -crawlers all the time and make the task of adapting, editing and removing -crawlers doable for someone without programming experience. In practice this -means in Figure~\ref{feedbackloop} that the project will remove the replace -part of the feedback loop that contains dotted arrows with the shorter loop -with the dashed arrows. - -For this project an application has been developed that can provide an + +The specific goal of this project is to relieve the programmer of spending a +lot of time repairing +crawlers and make the task of adapting, editing and removing +crawlers feasible for someone without programming experience. In practice this +means shortening the feedbackloop. The shorter feedback loop is also shown in +Figure~\ref{feedbackloop}. The dashed line shows the shorter feedbackloop that +relieves the programmer. + +For this project a system has been developed that provides an interface to a crawler generation system that is able to crawl RSS\cite{Rss} and Atom\cite{Atom} publishing feeds. The interface provides the user with point and click interfaces meaning that no computer science background is needed to use the interface and to create, modify, test and remove crawlers. -The current Hyperleap backend system that handles the data can, via an query, -generate XML feeds that contain the crawled data. The structure of the -application is very modular and generic and therefore it is easy to change -things in the program without having to know a lot about the programming -language used. To achieve this all visual things such as buttons are defined in -a human readable text files. In practice it means that one person, not by -definition a programmer, can be instructed to change the structure and this can -also greatly reduce programmer intervention time. - -The actual problem statement then becomes: +The current Hyperleap backend system that handles the data can query XML feeds that contain the crawled data. + +The actual research question can then be formulated as: \begin{center} - \textit{Is it possible to shorten the feedback loop for repairing and% -adding crawlers by making a system that can create, add and maintain crawlers% + \textit{Is it possible to shorten the feedback loop for repairing and % +adding crawlers by making a system that can create, add and maintain crawlers % for RSS feeds} \end{center} -- 2.20.1