From: Mart Lubbers Date: Tue, 9 Dec 2014 13:29:20 +0000 (+0100) Subject: stuff added in methods X-Git-Url: https://git.martlubbers.net/?a=commitdiff_plain;h=1f6d0e9b8152b86c1b008be849c9abadafd50679;p=bsc-thesis1415.git stuff added in methods --- diff --git a/thesis2/3.methods.tex b/thesis2/3.methods.tex index d8c0e3a..8670cd0 100644 --- a/thesis2/3.methods.tex +++ b/thesis2/3.methods.tex @@ -1,11 +1,45 @@ -\section{Application overview and workflow} -The program can be divided into two main components namely the \textit{Crawler -application} and the \textit{Input application}. The components are strictly -separated by task and by application. The crawler is an application dedicated -to the sole task of periodically crawling the sources asynchronously. The input -is a web interface to a set of tools that can create, edit, remove and test -crawlers via simple point and click user interfaces that can be worked with by -someone without a computer science background. +\section{Internals of the crawler generation module} +Data marked by the user is in principle just raw html data that contains a +table with for every RSS feed entry a table row. Within the html the code +severel markers are placed to make the parsing more easy and removing the need +of an advanced HTML parser. When the user presses submit a http POST request is +prepared to send all the gathered data to the backend to get processed. The +sending does not happen asynchronously, so the user has to wait for the data to +be processed. Because of the sychronous data transmission the user is notified +immediatly when the crawler is succesfully added. The data preparation that +makes the backend able to process it is all done on the user side using +Javascript. + +When the backend receives the http POST data it saves all the raw data and the +data from the information fields. The raw data is processed line by line to +extract the entries that contain user markings. The entries containing user +markings are stripped from all html while creating an data structure that +contains the locations of the markings and the original text. All data is +stored, for example also the entries without user data, but not all data is +processed. The storage of, at first sight, useless data is done because later +when the crawlers needs editing the old data can be used to update the crawler. + +When the entries are isolated and processed they are converted to node-lists. A +node-list is a literal list of words where a word is interpreted in the +broadest sense, meaning that a word can be a character, a single byte, a string +or basically anything. These node-lists are in this case the original entry +string where all the user markers are replaced by single nodes. As an example +lets take the following entry and its corresponding node-list assuming that the +user marked the time, title and date correctly. Markers are visualized by +enclosing the name of the marker in angle brackets. +\begin{flushleft} + Entry: \texttt{19:00, 2014-11-12 - Foobar}\\ + Node-list: \texttt{['