thesis2/3.methods.tex

   1 \section{Internals of the crawler generation module}
   2 Data marked by the user is in principle just raw html data that contains a
   3 table with for every RSS feed entry a table row. Within the html the code
   4 severel markers are placed to make the parsing more easy and removing the need
   5 of an advanced HTML parser. When the user presses submit a http POST request is
   6 prepared to send all the gathered data to the backend to get processed. The
   7 sending does not happen asynchronously, so the user has to wait for the data to
   8 be processed. Because of the sychronous data transmission the user is notified
   9 immediatly when the crawler is succesfully added. The data preparation that
  10 makes the backend able to process it is all done on the user side using
  11 Javascript.
  12
  13 When the backend receives the http POST data it saves all the raw data and the
  14 data from the information fields. The raw data is processed line by line to
  15 extract the entries that contain user markings. The entries containing user
  16 markings are stripped from all html while creating an data structure that
  17 contains the locations of the markings and the original text. All data is
  18 stored, for example also the entries without user data, but not all data is
  19 processed. The storage of, at first sight, useless data is done because later
  20 when the crawlers needs editing the old data can be used to update the crawler.
  21
  22 When the entries are isolated and processed they are converted to node-lists. A
  23 node-list is a literal list of words where a word is interpreted in the
  24 broadest sense, meaning that a word can be a character, a single byte, a string
  25 or basically anything. These node-lists are in this case the original entry
  26 string where all the user markers are replaced by single nodes. As an example
  27 lets take the following entry and its corresponding node-list assuming that the
  28 user marked the time, title and date correctly. Markers are visualized by
  29 enclosing the name of the marker in angle brackets.
  30 \begin{flushleft}
  31         Entry: \texttt{19:00, 2014-11-12 - Foobar}\\
  32         Node-list: \texttt{['<time>', ',', ' ', '<date>', ' ', '-', ' ', '<title>']}
  33 \end{flushleft}
  34
  35 The last step is when the entries with markers are then processed to build
  36 node-lists. Node-lists are basically lists of words that, when concatenated,
  37 form the original entry. A word isn't a word in the linguistic sense. A word
  38 can be one letter or a category. The node-list is generated by putting all the
  39 separate characters one by one in the list and when a user marking is
  40 encountered, this marking is translated to the category code and that code is
  41 then added as a word. The nodelists are then sent to the actual algorithm to be
  42 converted to a graph representation.
  43
  44 \section{Minimizing DAWGs}
  45 \subsection{Datastructure}
  46 We represent the user generated patterns as DAWGs by converting the node-lists
  47 to DAWGS. Normally DAWGs have as edgelabels letters from an alphabet, in our
  48 case the DAWGs alphabet contains all letters, whitespace and punctuation but
  49 also the specified user markers which can be multiple characters in length.
  50
  51 DAWGs are graphs but due to the constraints we can check if a match occurs by
  52 checking if a path exists that creates the word by concatenating all the edge
  53 labels. The first algorithm to
  54 generate DAWGs from words was proposed by Hopcroft et
  55 al\cite{Hopcroft1971}. It is an incremental approach in generating the graph.
  56 Meaning that entry by entry the graph will be expanded. Hopcrofts algorithm has
  57 the constraint of lexicographical ordering. Later on Daciuk et
  58 al.\cite{Daciuk2000} improved on the original algorithm and their algorithm is
  59 the algorithm we used to minimize our DAWGs. A minimal graph is a graph $G$ for
  60 which there is no graph $G'$ that has less paths and $\mathcal{L}(G)=\mathcal{L
  61 }(G')$ where $\mathcal{L}(G)$ is the set of all words present in the DAWG.
  62
  63 \subsection{Algorithm}
  64 The algorithm of building DAWGs is an iterative process that goes roughly in
  65 three steps. We start with the null graph that can be described by
  66 $G_0=(\{q_0\},\{q_0\},\{\}\{\})$ and does not contain any edges, one node and
  67 $\mathcal{L}(G_0)=\emptyset$. The first word that is added to the graph will be
  68 added in a naive way. We just create a new node for every transition of
  69 character and we mark the last node as final. From then on all words are added
  70 using a stepwise approach.
  71 \begin{itemize}
  72         \item
  73
  74
  75
  76 Incrementally node-lists are added to create a graph. For example in
  77 {Subgraphs in Figure}~\ref{dawg1} visualizes the construction of the
  78 DAWG from the entries: \texttt{aibc}, \texttt{aibc} and \texttt{ajbd}.
  79
  80 Pseudocode for the algorithm can be found in Listing~\ref{pseudodawg}.
  81
  82 In SG0 the graph is only the null graph, described by
  83 $G=(\{q_0\},\{q_0\},\{\}\{\})$ and does not contain any entries.
  84
  85 Adding the first entry \texttt{a,bc} is not a
  86 hard task because there is no common prefix nor suffix so the entry becomes a
  87 strain of nodes starting from $q_0$ and is visualized in subgraph SG1.
  88
  89 For the second entry \texttt{a.bc} some smart optimization has to be applied.
  90 The common prefix is \texttt{a} and the common suffix is \texttt{bc}. Therefore
  91 the first node in which the common prefix starts needs to be copied and split
  92 off to make room for an alternative route. This is exactly what happens in SG2.
  93 Node $q_2$ is copied and gets a connection to node $q_3$. From the last node of
  94 the common prefix a route is built towards the first node, that is the copied
  95 node, of the common suffix resulting in an extra path $q_1\rightarrow
  96 q_5\rightarrow q_3$. This results in subgraph SG2.
  97
  98 Adding the thing node \texttt{a.bd} in the same naive way as the second node
  99 results in subgraph SG3 and introduces a bad side effect. Namely that a new
 100 word, that was not specifically added was introduced, is added to the DAWG.
 101 This is because within the common prefix there is a node that has some other
 102 arrow leading to it. When a new path is added the algorithm checks the common
 103 prefix and suffix for confluence nodes. Confluence nodes are nodes that have
 104 multiple arrows leading in and because of the multiple arrows they can lead to
 105 unwanted additional words. The first confluence node found must be copied and
 106 detached and the specific suffix from the entry must be copied with it to
 107 separate the path from the existing paths. Applying this technique in the
 108 example leads to subgraph SG4. With common prefix \texttt{a.b} and an empty
 109 common suffix node $q_3$ is found as a confluence node in the common prefix and
 110 therefore node $q_3$ is copied to the new node $q_6$ taking the paths leading
 111 to the final state with it. In this way no new words are added to the DAWG and
 112 the DAWG is still optimal.
 113
 114 \begin{figure}[H]
 115         \caption{Step 1}
 116         \label{dawg1}
 117         \centering
 118         \digraph[width=\linewidth]{inccons}{
 119                 rankdir=LR;
 120                 n4 [style=invis];
 121                 q40 [label="q0"];
 122                 q41 [label="q1"];
 123                 q42 [label="q2"];
 124                 q43 [label="q3"];
 125                 q44 [label="q4" shape=doublecircle];
 126                 q45 [label="q5"];
 127                 q46 [label="q6"];
 128                 n4 -> q40[label="SG4"];
 129                 q40 -> q41[label="a"];
 130                 q41 -> q42[label="i"];
 131                 q42 -> q43[label="b"];
 132                 q43 -> q44[label="c"];
 133                 q41 -> q45[label="j"];
 134                 q45 -> q46[label="b"];
 135                 q46 -> q44[label="d"];
 136                 q46 -> q44[label="c"];
 137
 138                 n3 [style=invis];
 139                 q30 [label="q0"];
 140                 q31 [label="q1"];
 141                 q32 [label="q2"];
 142                 q33 [label="q3"];
 143                 q34 [label="q4"ishape=doublecircle];
 144                 q35 [label="q5"];
 145                 n3 -> q30[label="SG3"];
 146                 q30 -> q31[label="a"];
 147                 q31 -> q32[label="i"];
 148                 q32 -> q33[label="b"];
 149                 q33 -> q34[label="c"];
 150                 q33 -> q34[label="d"iconstraint=false];
 151                 q31 -> q35[label="j"];
 152                 q35 -> q33[label="b"];
 153
 154                 n2 [style=invis];
 155                 q20 [label="q0"];
 156                 q21 [label="q1"];
 157                 q22 [label="q2"];
 158                 q23 [label="q3"];
 159                 q24 [label="q4"ishape=doublecircle];
 160                 q25 [label="q5"];
 161                 n2 -> q20[label="SG2"];
 162                 q20 -> q21[label="a"];
 163                 q21 -> q22[label="i"];
 164                 q22 -> q23[label="b"];
 165                 q23 -> q24[label="c"];
 166                 q21 -> q25[label="j"];
 167                 q25 -> q23[label="b"];
 168
 169                 n1 [style=invis];
 170                 q10 [label="q0"];
 171                 q11 [label="q1"];
 172                 q12 [label="q2"];
 173                 q13 [label="q3"];
 174                 q14 [label="q4"ishape=doublecircle];
 175                 n1 -> q10[label="SG1"];
 176                 q10 -> q11[label="a"];
 177                 q11 -> q12[label="i"];
 178                 q12 -> q13[label="b"];
 179                 q13 -> q14[label="c"];
 180
 181                 n [style=invis];
 182                 q0 [label="q0"];
 183                 n -> q0[label="SG0"];
 184         }
 185 \end{figure}