thesis2/3.methods.tex

   1 \section{Application overview}
   2 The backend consists of several processing steps before a crawler specification
   3 is ready.
   4
   5 \begin{figure}[H]
   6         \label{dawg1}
   7         \centering
   8         \includegraphics[width=\linewidth]{backend.eps}
   9         \strut\\
  10         \caption{Main module internals}
  11 \end{figure}
  12
  13 \section{HTML data}
  14 Data marked by the user will be returned as a \texttt{POST} http request and is
  15 basically the information in the description fields accompanied by the raw html
  16 of the table with the patterns. For every RSS feed entry there is a row in the
  17 table and every marker is a \texttt{SPAN} element containing the color of the
  18 text. Within the HTML code of the table markers are placed to make parsing more
  19 easy and remove the need of an advanced HTML parser. The \texttt{POST} request
  20 that will be send when the user presses the submit button will not be sent
  21 asynchronously, the user has to wait for the processing to finish. Because of
  22 this the user will be immediately notified when the processing has finished
  23 successfully. The preparation of the data for the \texttt{POST} request is all
  24 done in Javascript on the user side, the processing of the data in the backend
  25 is all done server side.
  26
  27 \section{Table rows}
  28 When the backend receives the HTTP POST data it saves all the table HTML data and the
  29 data from the description fields. The program first extracts the table rows
  30 using simple pattern matching on the added markers. The table rows are
  31 processed one at the time, the processing mainly consists of finding the
  32 \texttt{SPAN} elements, extract the color, remove the elements and save the
  33 plain text and the location and type of the marking. When the table rows are
  34 processed a datastructure with all the entries accompanied by their markers
  35 will go to the next step. All intermediate data will be saved for later use.
  36
  37 \section{Node lists}
  38 Every entry from the datastructure is processed to convert them to node-lists.
  39 A node-list can be seen as a path graph of every character and marking. A path
  40 graph $G$ is defined as $G=(V,n_1,E,n_i)$ where $V=\{n_1, n_2, \cdots, n_{i-1}, n_i\}$
  41 and $E=\{(n_1, n_2), (n_2, n_3), ... (n_{i-1}, n_{i})\}$. A path graph is
  42 basically a graph that is a path of nodes where every node is connected to the
  43 next on and the last on being final. The transitions between two nodes is
  44 either a character or a marking. As an example we take the entry
  45 \texttt{19:00, 2014-11-12 - Foobar} and create the corresponding node-lists and
  46 make it visible in Figure~\ref{nodelistexample}. Characters are denoted with
  47 single quotes, spaces with an underscore and markers with angle brackets.
  48 \begin{figure}[H]
  49         \label{nodelistexample}
  50         \centering
  51         \includegraphics[width=\linewidth]{nodelistexample.eps}
  52         \caption{Node list example}
  53 \end{figure}
  54
  55
  56 \section{DAWGs}
  57 \subsection{Terminology}
  58 \textbf{Parent nodes} are nodes that have an arrow to the child.\\
  59 \textbf{Confluence nodes} are nodes that have multiple parents.
  60
  61 \subsection{Datastructure}
  62 We represent the user generated patterns as DAWGs by converting the node-lists
  63 to DAWGS. Normally DAWGs have single letters from an alphabet as edgelabel but
  64 in our implementation the DAWGs alphabet contains all letters, whitespace and
  65 punctuation but also the specified user markers which can be multiple
  66 characters in length.
  67
  68 DAWGs are graphs but due to the constraints we can use the DAWG to check if a
  69 match occurs by checking if a path exists that creates the word by
  70 concatenating all the edge labels. The first algorithm to generate DAWGs from
  71 words was proposed by Hopcroft et al\cite{Hopcroft1971}. It is an incremental
  72 approach in generating the graph.  Meaning that entry by entry the graph will
  73 be expanded. Hopcrofts algorithm has the constraint of lexicographical
  74 ordering. Later on Daciuk et al.\cite{Daciuk2000} improved on the original
  75 algorithm and their algorithm is the algorithm we used to minimize our DAWGs. A
  76 minimal graph is a graph $G$ for which there is no graph $G'$ that has less
  77 paths and $\mathcal{L}(G)=\mathcal{L }(G')$ where $\mathcal{L}(G)$ is the set
  78 of all words present in the DAWG.
  79
  80 \subsection{Algorithm}
  81 The algorithm of building DAWGs is an iterative process that goes roughly in
  82 three steps. We start with the null graph that can be described by
  83 $G_0=(\{q_0\},\{q_0\},\{\}\{\})$ and does not contain any edges, one node and
  84 $\mathcal{L}(G_0)=\emptyset$. The first word that is added to the graph will be
  85 added in a naive way. We just create a new node for every transition of
  86 character and we mark the last node as final. From then on all words are added
  87 using a stepwise approach.
  88 \begin{enumerate}
  89         \item
  90                 Step one is finding the common prefix of the word in the graph and we chop
  91                 it of the word. When the suffix that remains is empty and the last state of
  92                 the common prefix is a final state we are done adding the word. When the
  93                 suffix is non empty or the common prefix does not end in a final state we
  94                 apply step two.
  95         \item
  96                 We find the first confluence state in the common prefix and from there on
  97                 we add the suffix. When there is a confluence state the route all the way
  98                 back to the starting node will be cloned to make sure no extra words are
  99                 added to the graph unwillingly.
 100         \item
 101                 Finally if in the suffix propagating from the end to the beginning there
 102                 are equal states we merge the state from the suffix to the equal state in
 103                 the graph.
 104 \end{enumerate}
 105
 106 Pseudocode for the algorithm can be found in Listing~\ref{pseudodawg}.
 107
 108 \subsection{Example}
 109 We visualize this with an example shown in the {Subgraphs in
 110 Figure}~\ref{dawg1} that builds a DAWG with the following entries:
 111 \texttt{abcd}, \texttt{aecd} and \texttt{aecf}. In SG0 we begin with the null
 112 graph.
 113
 114 Adding the first entry \texttt{abcd} is trivial and just creates a single path
 115 and does not require any hard work. The result of adding the first word is
 116 visible in subgraph SG1.
 117
 118 For the second entry \texttt{aecd} we will have to follow the steps described
 119 earlier. The common prefix is \texttt{a} and the common suffix is \texttt{bc}.
 120 Therefore the first node in which the common prefix starts needs to be copied
 121 and split off to make room for an alternative route. This is exactly what
 122 happens in SG2. Node $q_2$ is copied and gets a connection to node $q_3$. From
 123 the last node of the common prefix a route is built towards the first node,
 124 that is the copied node, of the common suffix resulting in an extra path
 125 $q_1\rightarrow q_5\rightarrow q_3$. Since there are no confluence nodes we can
 126 leave the prefix as it is. This results in subgraph SG2.
 127
 128 The last entry to add is the word \texttt{aecf}. When this is done without
 129 confluence node checking we create extra paths in the graph that are unwanted.
 130 The common prefix is \texttt{aec} and the common suffix is \texttt{f}.
 131 \texttt{f} can not be merged into existing edges and thus a new edge will be
 132 created from the end of the common suffix to the end. This results in subgraph
 133 SG3. Note that we did not take care of confluence nodes in the common prefix
 134 and because of that the word \texttt{abcf} is also accepted even when it was
 135 not added to the graph. This is unwanted behaviour. Going back to the common
 136 prefix \texttt{aec}. When we check for confluence nodes we see that the last
 137 node of the common prefix is such a node and that we need to clone the path
 138 and separate the route to the final node. $q_3$ will be cloned into $q_6$ with
 139 the route of the common prefix. Tracking the route back we do not encounter any
 140 other confluence states and therefore we can merge in the suffix which results
 141 in the final DAWG visible in subgraph SG4. No new words are added.
 142
 143 \begin{figure}[H]
 144         \caption{Step 1}
 145         \label{dawg1}
 146         \centering
 147         \digraph[scale=0.4]{inccons}{
 148                 rankdir=LR;
 149                 n4 [style=invis];
 150                 q40 [label="q0"];
 151                 q41 [label="q1"];
 152                 q42 [label="q2"];
 153                 q43 [label="q3"];
 154                 q44 [label="q4" shape=doublecircle];
 155                 q45 [label="q5"];
 156                 q46 [label="q6"];
 157                 n4 -> q40[label="SG4"];
 158                 q40 -> q41[label="a"];
 159                 q41 -> q42[label="b"];
 160                 q42 -> q43[label="c"];
 161                 q43 -> q44[label="d"];
 162                 q41 -> q45[label="e"];
 163                 q45 -> q46[label="c"];
 164                 q46 -> q44[label="d"];
 165                 q46 -> q44[label="f"];
 166
 167                 n3 [style=invis];
 168                 q30 [label="q0"];
 169                 q31 [label="q1"];
 170                 q32 [label="q2"];
 171                 q33 [label="q3"];
 172                 q34 [label="q4"ishape=doublecircle];
 173                 q35 [label="q5"];
 174                 n3 -> q30[label="SG3"];
 175                 q30 -> q31[label="a"];
 176                 q31 -> q32[label="b"];
 177                 q32 -> q33[label="c"];
 178                 q33 -> q34[label="d"];
 179                 q33 -> q34[label="f"iconstraint=false];
 180                 q31 -> q35[label="e"];
 181                 q35 -> q33[label="c"];
 182
 183                 n2 [style=invis];
 184                 q20 [label="q0"];
 185                 q21 [label="q1"];
 186                 q22 [label="q2"];
 187                 q23 [label="q3"];
 188                 q24 [label="q4"ishape=doublecircle];
 189                 q25 [label="q5"];
 190                 n2 -> q20[label="SG2"];
 191                 q20 -> q21[label="a"];
 192                 q21 -> q22[label="b"];
 193                 q22 -> q23[label="c"];
 194                 q23 -> q24[label="d"];
 195                 q21 -> q25[label="e"];
 196                 q25 -> q23[label="c"];
 197
 198                 n1 [style=invis];
 199                 q10 [label="q0"];
 200                 q11 [label="q1"];
 201                 q12 [label="q2"];
 202                 q13 [label="q3"];
 203                 q14 [label="q4"ishape=doublecircle];
 204                 n1 -> q10[label="SG1"];
 205                 q10 -> q11[label="a"];
 206                 q11 -> q12[label="b"];
 207                 q12 -> q13[label="c"];
 208                 q13 -> q14[label="d"];
 209
 210                 n [style=invis];
 211                 q0 [label="q0"];
 212                 n -> q0[label="SG0"];
 213         }
 214 \end{figure}
 215
 216 \subsection{Appliance on extraction of patterns}
 217 The algorithm for minimizing DAWGs can not be applied directly on the user
 218 generated pattern. This is mainly because there is no conclusive information
 219 about the contents of the pattern and therefore non deterministic choices have
 220 to be made. Because of this there can be multiple matches and the best match
 221 has to be determined. There are several possible criteria for determining the
 222 best match. The method we use is ... TODO