thesis2/2.methods.tex

   1 \section{Application overview and workflow}
   2 The program can be divided into two main components namely the \textit{Crawler
   3 application} and the \textit{Input application}. The components are strictly
   4 separated by task and by application. The crawler is an application dedicated
   5 to the sole task of periodically crawling the sources asynchronously. The input
   6 is a web interface to a set of tools that can create, edit, remove and test
   7 crawlers via simple point and click user interfaces that can be worked with by
   8 someone without a computer science background.
   9
  10
  11 \section{Input application}
  12 \subsection{Components}
  13 Add new crawler
  14
  15 Editing or remove crawlers
  16
  17 Test crawler
  18
  19 Generate xml
  20
  21
  22 \section{Crawler application}
  23 \subsection{Interface}
  24
  25 \subsection{Algorithm}
  26 \subsection{Preprocessing}
  27 When the data is received by the crawler the data is embedded as POST data in a
  28 HTTP request. The POST data consists of several fields with information about
  29 the feed and a container that has the table with the user markers embedded.
  30 After that the entries are extracted and processed line by line.
  31
  32 The line processing converts the raw string of html data from a table row to a
  33 string. The string is stripped of all the html tags and is accompanied by a
  34 list of marker items. The entries that don't contain any markers are left out
  35 in the next step of processing. All data, including entries without user
  36 markers, is stored in the object too for possible later reference, for example
  37 for editing the patterns.
  38
  39 The last step is when the entries with markers are then processed to build
  40 node-lists. Node-lists are basically lists of words that, when concatenated,
  41 form the original entry. A word isn't a word in the linguistic sense. A word
  42 can be one letter or a category. The node-list is generated by putting all the
  43 separate characters one by one in the list and when a user marking is
  44 encountered, this marking is translated to the category code and that code is
  45 then added as a word. The nodelists are then sent to the actual algorithm to be
  46 converted to a graph representation.
  47
  48 \subsection{Directed acyclic graphs(DAG)}
  49 Directed acyclic graphs are a special kind of graph that is used to store big
  50 sets of words and has a complexity of $\mathcal{O}(L)$, meaning the length of
  51 the word you are searching. Figure~\ref{fig:f21} shows a DAG that accepts the
  52 words \textit{aa} and \textit{ab}. Final states are marked with a double circle
  53 and intermediate states are marked with a single circle.
  54
  55 \begin{figure}[H]
  56         \caption{Sample DAG}
  57         \label{fig:f21}
  58         \centering
  59         \digraph[]{graph2}{
  60                 rankdir=LR;
  61                 1, 2 [shape="circle"];
  62                 3, 4 [shape="doublecircle"];
  63                 1 -> 2 [label="a"];
  64                 2 -> 3 [label="a"];
  65                 2 -> 4 [label="b"];
  66         }
  67 \end{figure}
  68
  69 Using the algorithm described by Hopcroft et al\cite{Hopcroft1971} the
  70 nodelists are converted into minimal directed acyclic graphs with a low
  71 complexity($\mathcal{O}(N\log{N})$).