thesis2/2.methods.tex

   1 \section{Application overview and workflow}
   2 The program can be divided into two main components namely the \textit{Crawler
   3 application} and the \textit{Input application}. The components are strictly
   4 separated by task and by application. The crawler is an application dedicated
   5 to the sole task of periodically crawling the sources asynchronously. The input
   6 is a web interface to a set of tools that can create, edit, remove and test
   7 crawlers via simple point and click user interfaces that can be worked with by
   8 someone without a computer science background.
   9
  10
  11 \section{Input application}
  12 \subsection{Components}
  13 Add new crawler
  14
  15 Editing or remove crawlers
  16
  17 Test crawler
  18
  19 Generate xml
  20
  21
  22 \section{Crawler application}
  23 \subsection{Interface}
  24
  25 \subsection{Algorithm}
  26 \subsection{Preprocessing}
  27 When the data is received by the crawler the data is embedded as POST data in a
  28 HTTP request. The POST data consists of several fields with information about
  29 the feed and a container that has the table with the user markers embedded.
  30 After that the entries are extracted and processed line by line.
  31
  32 The line processing converts the raw string of html data from a table row to a
  33 string. The string is stripped of all the html tags and is accompanied by a
  34 list of marker items.
  35
  36 The entries that don't contain any markers are left out in the next step of
  37 processing. All data, including entries without user markers, is stored in the
  38 object too for possible later reference, for example for editing the patterns.
  39
  40 The last step is when the entries with markers are then processed to build
  41 node-lists. Node-lists are basically strings where the user markers are
  42 replaced by patterns so that the variable data, the isolated data, is not used
  43 in the node-lists.
  44
  45 \subsection{Directed acyclic graphs}
  46
  47