From 4a0746eeaf3f526e718c6395d3b20b6104dfadc3 Mon Sep 17 00:00:00 2001 From: Mart Lubbers Date: Sat, 7 Feb 2015 18:02:37 +0100 Subject: [PATCH] update --- thesis2/3.methods.tex | 129 +++++----- thesis2/Makefile | 1 + thesis2/nodelistexample.dot | 13 + thesis2/nodelistexample.eps | 465 ++++++++++++++++++++++++++++++++++++ 4 files changed, 546 insertions(+), 62 deletions(-) create mode 100644 thesis2/nodelistexample.dot create mode 100644 thesis2/nodelistexample.eps diff --git a/thesis2/3.methods.tex b/thesis2/3.methods.tex index fb94604..c878087 100644 --- a/thesis2/3.methods.tex +++ b/thesis2/3.methods.tex @@ -7,70 +7,75 @@ is ready. \centering \includegraphics[width=\linewidth]{backend.eps} \strut\\ - \caption{Backend overview} + \caption{Main module internals} \end{figure} -\section{Internals of the crawler generation module} -Data marked by the user is in principle just raw html data that contains a -table with for every RSS feed entry a table row. Within the html the code -several markers are placed to make the parsing more easy and removing the need -of an advanced HTML parser. When the user presses submit a http POST request is -prepared to send all the gathered data to the backend to get processed. The -sending does not happen asynchronously, so the user has to wait for the data to -be processed. Because of the sychronous data transmission the user is notified -immediatly when the crawler is succesfully added. The data preparation that -makes the backend able to process it is all done on the user side using -Javascript. - -When the backend receives the http POST data it saves all the raw data and the -data from the information fields. The raw data is processed line by line to -extract the entries that contain user markings. The entries containing user -markings are stripped from all html while creating an data structure that -contains the locations of the markings and the original text. All data is -stored, for example also the entries without user data, but not all data is -processed. The storage of, at first sight, useless data is done because later -when the crawlers needs editing the old data can be used to update the crawler. - -When the entries are isolated and processed they are converted to node-lists. A -node-list is a literal list of words where a word is interpreted in the -broadest sense, meaning that a word can be a character, a single byte, a string -or basically anything. These node-lists are in this case the original entry -string where all the user markers are replaced by single nodes. As an example -lets take the following entry and its corresponding node-list assuming that the -user marked the time, title and date correctly. Markers are visualized by -enclosing the name of the marker in angle brackets. -\begin{flushleft} - Entry: \texttt{19:00, 2014-11-12 - Foobar}\\ - Node-list: \texttt{['