thesis2/3.methods.tex

   1 \section{Internals of the crawler generation module}
   2 Data marked by the user is in principle just raw html data that contains a
   3 table with for every RSS feed entry a table row. Within the html the code
   4 severel markers are placed to make the parsing more easy and removing the need
   5 of an advanced HTML parser. When the user presses submit a http POST request is
   6 prepared to send all the gathered data to the backend to get processed. The
   7 sending does not happen asynchronously, so the user has to wait for the data to
   8 be processed. Because of the sychronous data transmission the user is notified
   9 immediatly when the crawler is succesfully added. The data preparation that
  10 makes the backend able to process it is all done on the user side using
  11 Javascript.
  12
  13 When the backend receives the http POST data it saves all the raw data and the
  14 data from the information fields. The raw data is processed line by line to
  15 extract the entries that contain user markings. The entries containing user
  16 markings are stripped from all html while creating an data structure that
  17 contains the locations of the markings and the original text. All data is
  18 stored, for example also the entries without user data, but not all data is
  19 processed. The storage of, at first sight, useless data is done because later
  20 when the crawlers needs editing the old data can be used to update the crawler.
  21
  22 When the entries are isolated and processed they are converted to node-lists. A
  23 node-list is a literal list of words where a word is interpreted in the
  24 broadest sense, meaning that a word can be a character, a single byte, a string
  25 or basically anything. These node-lists are in this case the original entry
  26 string where all the user markers are replaced by single nodes. As an example
  27 lets take the following entry and its corresponding node-list assuming that the
  28 user marked the time, title and date correctly. Markers are visualized by
  29 enclosing the name of the marker in angle brackets.
  30 \begin{flushleft}
  31         Entry: \texttt{19:00, 2014-11-12 - Foobar}\\
  32         Node-list: \texttt{['<time>', ',', ' ', '<date>', ' ', '-', ' ', '<title>']}
  33 \end{flushleft}
  34
  35 The last step is when the entries with markers are then processed to build
  36 node-lists. Node-lists are basically lists of words that, when concatenated,
  37 form the original entry. A word isn't a word in the linguistic sense. A word
  38 can be one letter or a category. The node-list is generated by putting all the
  39 separate characters one by one in the list and when a user marking is
  40 encountered, this marking is translated to the category code and that code is
  41 then added as a word. The nodelists are then sent to the actual algorithm to be
  42 converted to a graph representation.
  43
  44 \section{Minimizing DAWGs}
  45 We represent the user generated patterns as DAWGs by converting the node-lists
  46 to DAWGS. Normally DAWGs have as edgelabels letters from an alphabet, in our
  47 case the DAWGs alphabet contains all letters, whitespace and punctuation but
  48 also the specified user markers. DAWGs are a graph but by using them as an
  49 automaton we can check if a word is accepted by the automaton, or in other
  50 words, if the word matches the specified pattern. The first algorithm to
  51 generate DAWGs from node-lists was proposed by Hopcroft et
  52 al\cite{Hopcroft1971}. It is an incremental approach in generating the graph.
  53 Meaning that entry by entry the graph was built. The only constraint that the
  54 algorithm has is that the entries must be sorted lexicographically. Later on
  55 Daciuk et al.\cite{Daciuk2000} improved on the original algorithm and their
  56 algorithm is the algorithm we used to minimize or optimize our DAWGs.
  57
  58 Pseudocode for the algorithm can be found in Listing~\ref{pseudodawg}.
  59
  60 Incrementally node-lists are added to create a graph. For example in
  61 {Subgraphs in Figure}~\ref{dawg1} visualizes the construction of the
  62 DAWG from the entries: \texttt{a.bc}, \texttt{a,bc} and \texttt{a,bd}.
  63
  64 In SG0 the graph is only the null graph, described by
  65 $G=(\{q_0\},\{q_0\},\{\}\{\})$ and does not contain any entries.
  66
  67 Adding the first entry \texttt{a,bc} is not a
  68 hard task because there is no common prefix nor suffix so the entry becomes a
  69 strain of nodes starting from $q_0$ and is visualized in subgraph SG1.
  70
  71 For the second entry \texttt{a.bc} some smart optimization has to be applied.
  72 The common prefix is \texttt{a} and the common suffix is \texttt{bc}. Therefore
  73 the first node in which the common prefix starts needs to be copied and split
  74 off to make room for an alternative route. This is exactly what happens in SG2.
  75 Node $q_2$ is copied and gets a connection to node $q_3$. From the last node of
  76 the common prefix a route is built towards the first node, that is the copied
  77 node, of the common suffix resulting in an extra path $q_1\rightarrow
  78 q_5\rightarrow q_3$. This results in subgraph SG2.
  79
  80 Adding the thing node \texttt{a.bd} in the same naive way as the second node
  81 results in subgraph SG3 and introduces a bad side effect. Namely that a new
  82 word, that was not specifically added was introduced, is added to the DAWG.
  83 This is because within the common prefix there is a node that has some other
  84 arrow leading to it. When a new path is added the algorithm checks the common
  85 prefix and suffix for confluence nodes. Confluence nodes are nodes that have
  86 multiple arrows leading in and because of the multiple arrows they can lead to
  87 unwanted additional words. The first confluence node found must be copied and
  88 detached and the specific suffix from the entry must be copied with it to
  89 separate the path from the existing paths. Applying this technique in the
  90 example leads to subgraph SG4. With common prefix \texttt{a.b} and an empty
  91 common suffix node $q_3$ is found as a confluence node in the common prefix and
  92 therefore node $q_3$ is copied to the new node $q_6$ taking the paths leading
  93 to the final state with it. In this way no new words are added to the DAWG and
  94 the DAWG is still optimal.
  95
  96 \begin{figure}[H]
  97         \caption{Step 1}
  98         \label{dawg1}
  99         \centering
 100         \digraph[width=\linewidth]{inccons}{
 101                 rankdir=LR;
 102                 n4 [style=invis];
 103                 q40 [label="q0"];
 104                 q41 [label="q1"];
 105                 q42 [label="q2"];
 106                 q43 [label="q3"];
 107                 q44 [label="q4" shape=doublecircle];
 108                 q45 [label="q5"];
 109                 q46 [label="q6"];
 110                 n4 -> q40[label="SG4"];
 111                 q40 -> q41[label="a"];
 112                 q41 -> q42[label=","];
 113                 q42 -> q43[label="b"];
 114                 q43 -> q44[label="c"];
 115                 q41 -> q45[label="."];
 116                 q45 -> q46[label="b"];
 117                 q46 -> q44[label="d"];
 118                 q46 -> q44[label="c"];
 119
 120                 n3 [style=invis];
 121                 q30 [label="q0"];
 122                 q31 [label="q1"];
 123                 q32 [label="q2"];
 124                 q33 [label="q3"];
 125                 q34 [label="q4",shape=doublecircle];
 126                 q35 [label="q5"];
 127                 n3 -> q30[label="SG3"];
 128                 q30 -> q31[label="a"];
 129                 q31 -> q32[label=","];
 130                 q32 -> q33[label="b"];
 131                 q33 -> q34[label="c"];
 132                 q33 -> q34[label="d",constraint=false];
 133                 q31 -> q35[label="."];
 134                 q35 -> q33[label="b"];
 135
 136                 n2 [style=invis];
 137                 q20 [label="q0"];
 138                 q21 [label="q1"];
 139                 q22 [label="q2"];
 140                 q23 [label="q3"];
 141                 q24 [label="q4",shape=doublecircle];
 142                 q25 [label="q5"];
 143                 n2 -> q20[label="SG2"];
 144                 q20 -> q21[label="a"];
 145                 q21 -> q22[label=","];
 146                 q22 -> q23[label="b"];
 147                 q23 -> q24[label="c"];
 148                 q21 -> q25[label="."];
 149                 q25 -> q23[label="b"];
 150
 151                 n1 [style=invis];
 152                 q10 [label="q0"];
 153                 q11 [label="q1"];
 154                 q12 [label="q2"];
 155                 q13 [label="q3"];
 156                 q14 [label="q4",shape=doublecircle];
 157                 n1 -> q10[label="SG1"];
 158                 q10 -> q11[label="a"];
 159                 q11 -> q12[label=","];
 160                 q12 -> q13[label="b"];
 161                 q13 -> q14[label="c"];
 162
 163                 n [style=invis];
 164                 q0 [label="q0"];
 165                 n -> q0[label="SG0"];
 166         }
 167 \end{figure}
 168
 169 %\subsection{Process}
 170 %Proposal was written
 171 %
 172 %
 173 %First html/mail/fax/rss, worst case rss
 174 %
 175 %
 176 %After some research and determining the scope of the project we decided only to
 177 %do RSS, this because RSS tends to force structure in the data because RSS feeds
 178 %are often generated by the website and thus reliable and consistent. We found a
 179 %couple of good RSS feeds.
 180 %
 181 %
 182 %At first the general framework was designed and implemented, no method yet.
 183 %
 184 %
 185 %Started with method for recognizing separators.
 186 %
 187 %
 188 %Found research paper about algorithm that can create directed acyclic graphs
 189 %from string, although it was designed to compress word lists it can be
 190 %(mis)used to extract information.
 191 %
 192 %
 193 %Implementation of DAG algorithm found and tied to the program.
 194 %
 195 %
 196 %Command line program ready. Conversation with both supervisors, gui had to be
 197 %made.
 198 %
 199 %Step by step gui created. Web interface as a control center for the crawlers.
 200 %
 201 %
 202 %Gui optimized.
 203 %
 204 %
 205 %Concluded that the program doesn't reach wide audience due to lack of well
 206 %structured rss feeds.