thesis2/3.methods.tex

   1 \section{Internals of the crawler generation module}
   2 Data marked by the user is in principle just raw html data that contains a
   3 table with for every RSS feed entry a table row. Within the html the code
   4 severel markers are placed to make the parsing more easy and removing the need
   5 of an advanced HTML parser. When the user presses submit a http POST request is
   6 prepared to send all the gathered data to the backend to get processed. The
   7 sending does not happen asynchronously, so the user has to wait for the data to
   8 be processed. Because of the sychronous data transmission the user is notified
   9 immediatly when the crawler is succesfully added. The data preparation that
  10 makes the backend able to process it is all done on the user side using
  11 Javascript.
  12
  13 When the backend receives the http POST data it saves all the raw data and the
  14 data from the information fields. The raw data is processed line by line to
  15 extract the entries that contain user markings. The entries containing user
  16 markings are stripped from all html while creating an data structure that
  17 contains the locations of the markings and the original text. All data is
  18 stored, for example also the entries without user data, but not all data is
  19 processed. The storage of, at first sight, useless data is done because later
  20 when the crawlers needs editing the old data can be used to update the crawler.
  21
  22 When the entries are isolated and processed they are converted to node-lists. A
  23 node-list is a literal list of words where a word is interpreted in the
  24 broadest sense, meaning that a word can be a character, a single byte, a string
  25 or basically anything. These node-lists are in this case the original entry
  26 string where all the user markers are replaced by single nodes. As an example
  27 lets take the following entry and its corresponding node-list assuming that the
  28 user marked the time, title and date correctly. Markers are visualized by
  29 enclosing the name of the marker in angle brackets.
  30 \begin{flushleft}
  31         Entry: \texttt{19:00, 2014-11-12 - Foobar}\\
  32         Node-list: \texttt{['<time>', ',', ' ', '<date>', ' ', '-', ' ', '<title>']}
  33 \end{flushleft}
  34
  35 The last step is when the entries with markers are then processed to build
  36 node-lists. Node-lists are basically lists of words that, when concatenated,
  37 form the original entry. A word isn't a word in the linguistic sense. A word
  38 can be one letter or a category. The node-list is generated by putting all the
  39 separate characters one by one in the list and when a user marking is
  40 encountered, this marking is translated to the category code and that code is
  41 then added as a word. The nodelists are then sent to the actual algorithm to be
  42 converted to a graph representation.
  43
  44 \section{Minimizing DAWGs}
  45 The first algorithm to generate DAG's was proposed by Hopcroft et
  46 al\cite{Hopcroft1971}. The algorithm they described wasn't incremental and had
  47 a complexity of $\mathcal{O}(N\log{N})$. \cite{Daciuk2000} et al. later
  48 extended the algorithm and created an incremental one without increasing the
  49 computational complexity. The non incremental algorithm from Daciuk et al. is
  50 used to convert the nodelists to a graph.
  51
  52 For example constructing a graph that from the entry: \textit{a,bc} and
  53 \textit{a.bc} goes in the following steps:
  54
  55 \begin{figure}[H]
  56         \caption{Sample DAG, first entry}
  57         \label{fig:f22}
  58         \centering
  59         \digraph[]{graph22}{
  60                 rankdir=LR;
  61                 1,2,3,5 [shape="circle"];
  62                 5 [shape="doublecircle"];
  63                 1 -> 2 [label="a"];
  64                 2 -> 3 [label="."];
  65                 3 -> 4 [label="b"];
  66                 4 -> 5 [label="c"];
  67         }
  68 \end{figure}
  69
  70 \begin{figure}[H]
  71         \caption{Sample DAG, second entry}
  72         \label{fig:f23}
  73         \centering
  74         \digraph[]{graph23}{
  75                 rankdir=LR;
  76                 1,2,3,5,6 [shape="circle"];
  77                 5 [shape="doublecircle"];
  78                 1 -> 2 [label="a"];
  79                 2 -> 3 [label="."];
  80                 3 -> 4 [label="b"];
  81                 4 -> 5 [label="c"];
  82
  83                 2 -> 6 [label=","];
  84                 6 -> 4 [label="b"];
  85         }
  86 \end{figure}
  87
  88 %\subsection{Process}
  89 %Proposal was written
  90 %
  91 %
  92 %First html/mail/fax/rss, worst case rss
  93 %
  94 %
  95 %After some research and determining the scope of the project we decided only to
  96 %do RSS, this because RSS tends to force structure in the data because RSS feeds
  97 %are often generated by the website and thus reliable and consistent. We found a
  98 %couple of good RSS feeds.
  99 %
 100 %
 101 %At first the general framework was designed and implemented, no method yet.
 102 %
 103 %
 104 %Started with method for recognizing separators.
 105 %
 106 %
 107 %Found research paper about algorithm that can create directed acyclic graphs
 108 %from string, although it was designed to compress word lists it can be
 109 %(mis)used to extract information.
 110 %
 111 %
 112 %Implementation of DAG algorithm found and tied to the program.
 113 %
 114 %
 115 %Command line program ready. Conversation with both supervisors, gui had to be
 116 %made.
 117 %
 118 %Step by step gui created. Web interface as a control center for the crawlers.
 119 %
 120 %
 121 %Gui optimized.
 122 %
 123 %
 124 %Concluded that the program doesn't reach wide audience due to lack of well
 125 %structured rss feeds.