From: Mart Lubbers Date: Wed, 29 Oct 2014 15:29:20 +0000 (+0100) Subject: part of algo explained X-Git-Url: https://git.martlubbers.net/?a=commitdiff_plain;h=5cc599b4cbfcaef87ebdb72bd0f352ea616beda4;p=bsc-thesis1415.git part of algo explained --- diff --git a/thesis2/2.methods.tex b/thesis2/2.methods.tex index 477190d..3f67cb3 100644 --- a/thesis2/2.methods.tex +++ b/thesis2/2.methods.tex @@ -23,3 +23,25 @@ Generate xml \subsection{Interface} \subsection{Algorithm} +\subsection{Preprocessing} +When the data is received by the crawler the data is embedded as POST data in a +HTTP request. The POST data consists of several fields with information about +the feed and a container that has the table with the user markers embedded. +After that the entries are extracted and processed line by line. + +The line processing converts the raw string of html data from a table row to a +string. The string is stripped of all the html tags and is accompanied by a +list of marker items. + +The entries that don't contain any markers are left out in the next step of +processing. All data, including entries without user markers, is stored in the +object too for possible later reference, for example for editing the patterns. + +The last step is when the entries with markers are then processed to build +node-lists. Node-lists are basically strings where the user markers are +replaced by patterns so that the variable data, the isolated data, is not used +in the node-lists. + +\subsection{Directed acyclic graphs} + +