thesis2/3.methods.tex

   1 \section{Application overview}
   2 The backend consists of several processing steps before a crawler specification
   3 is ready.
   4
   5 \begin{figure}[H]
   6         \label{dawg1}
   7         \centering
   8         \includegraphics[width=\linewidth]{backend.eps}
   9         \strut\\
  10         \caption{Main module internals}
  11 \end{figure}
  12
  13 \section{HTML data}
  14 Data marked by the user will be returned as a \texttt{POST} http request and is
  15 basically the information in the description fields accompanied by the raw html
  16 of the table with the patterns. For every RSS feed entry there is a row in the
  17 table and every marker is a \texttt{SPAN} element containing the color of the
  18 text. Within the HTML code of the table markers are placed to make parsing more
  19 easy and remove the need of an advanced HTML parser. The \texttt{POST} request
  20 that will be send when the user presses the submit button will not be sent
  21 asynchronously, the user has to wait for the processing to finish. Because of
  22 this the user will be immediately notified when the processing has finished
  23 successfully. The preparation of the data for the \texttt{POST} request is all
  24 done in Javascript on the user side, the processing of the data in the backend
  25 is all done server side.
  26
  27 \section{Table rows}
  28 When the backend receives the HTTP POST data it saves all the table HTML data
  29 and the data from the description fields. The program first extracts the table
  30 rows using simple pattern matching on the added markers. The table rows are
  31 processed one at the time, the processing mainly consists of finding the
  32 \texttt{SPAN} elements, extract the color, remove the elements and save the
  33 plain text and the location and type of the marking. When the table rows are
  34 processed a datastructure with all the entries accompanied by their markers
  35 will go to the next step. All intermediate data will be saved for later use.
  36
  37 \section{Node lists}
  38 Every entry from the datastructure is processed to convert them to node-lists.
  39 A node-list can be seen as a path graph of every character and marking. A path
  40 graph $G$ is defined as $G=(V,n_1,E,n_i)$ where $V=\{n_1, n_2, \cdots, n_{i-1},
  41 n_i\}$ and $E=\{(n_1, n_2), (n_2, n_3), ... (n_{i-1}, n_{i})\}$. A path graph
  42 is basically a graph that is a path of nodes where every node is connected to
  43 the next on and the last on being final. The transitions between two nodes is
  44 either a character or a marking. As an example we take the entry \texttt{19:00,
  45 2014-11-12 - Foobar} and create the corresponding node-lists and make it
  46 visible in Figure~\ref{nodelistexample}. Characters are denoted with single
  47 quotes, spaces with an underscore and markers with angle brackets.
  48 \begin{figure}[H]
  49         \label{nodelistexample}
  50         \centering
  51         \includegraphics[width=\linewidth]{nodelistexample.eps}
  52         \caption{Node list example}
  53 \end{figure}
  54
  55
  56 \section{DAWGs}
  57 \subsection{Terminology}
  58 \textbf{Parent nodes} are nodes that have an arrow to the child.\\
  59 \textbf{Confluence nodes} are nodes that have multiple parents.
  60
  61 \subsection{Datastructure}
  62 We represent the user generated patterns as DAWGs by converting the node-lists
  63 to DAWGS. Normally DAWGs have single letters from an alphabet as edgelabel but
  64 in our implementation the DAWGs alphabet contains all letters, whitespace and
  65 punctuation but also the specified user markers which can be multiple
  66 characters in length.
  67
  68 DAWGs are graphs but due to the constraints we can use the DAWG to check if a
  69 match occurs by checking if a path exists that creates the word by
  70 concatenating all the edge labels. The first algorithm to generate DAWGs from
  71 words was proposed by Hopcroft et al\cite{Hopcroft1971}. It is an incremental
  72 approach in generating the graph.  Meaning that entry by entry the graph will
  73 be expanded. Hopcrofts algorithm has the constraint of lexicographical
  74 ordering. Later on Daciuk et al.\cite{Daciuk2000} improved on the original
  75 algorithm and their algorithm is the algorithm we used to minimize our DAWGs. A
  76 minimal graph is a graph $G$ for which there is no graph $G'$ that has less
  77 paths and $\mathcal{L}(G)=\mathcal{L }(G')$ where $\mathcal{L}(G)$ is the set
  78 of all words present in the DAWG.
  79
  80 \subsection{Algorithm}
  81 The algorithm of building DAWGs is an iterative process that goes roughly in
  82 two steps. We start with the null graph that can be described by
  83 $G_0=(\{q_0\},\{q_0\},\{\}\{\})$ and does not contain any edges, one node and
  84 $\mathcal{L}(G_0)=\emptyset$. The first word that is added to the graph will be
  85 added in a naive way. We just create a new node for every transition of
  86 character and we mark the last node as final. From then on all words are added
  87 using a stepwise approach.
  88 \begin{enumerate}
  89         \item
  90                 Say we add word $w$ to the grahp. Step one is finding the common prefix of
  91                 the word already in the graph. The common prefix is defined as the longest
  92                 subword $w'$ for which there is a $\delta^*(q_0, w')$.  When the common
  93                 prefix is found we change the starting state to the last state of the
  94                 common prefix and remain with suffix $w''$ where $w=w'w''$.
  95         \item
  96                 We add the suffix to the graph starting from the last node of the common
  97                 prefix.
  98         \item
  99                 When there are confluence nodes present within the common prefix they are
 100                 cloned starting from the last confluence node to avoid adding unwanted
 101                 words.
 102         \item
 103                 From the last node of the suffix up to the first node we replace the nodes
 104                 by checking if there are equivalent nodes present in the graph that we can
 105                 merge with.
 106 \end{enumerate}
 107
 108 Pseudocode for the algorithm can be found in Listing~\ref{pseudodawg}.
 109
 110 \subsection{Example}
 111 We visualize this with an example shown in the {Subgraphs in
 112 Figure}~\ref{dawg1} that builds a DAWG with the following entries:
 113 \texttt{abcd}, \texttt{aecd} and \texttt{aecf}. In SG0 we begin with the null
 114 graph.
 115
 116 Adding the first entry \texttt{abcd} is trivial because just creates a single
 117 path and does not require any hard work. Because the common prefix we find in
 118 Step 1 is empty the suffix will be the entire word. There is also no
 119 possibility of merging nodes because there were no nodes to begin with. The
 120 result of adding the first word is visible in subgraph SG1.
 121
 122 For the second entry \texttt{aecd} we will have to do some extra work.
 123 The common prefix found in Step 1 is \texttt{a} which we add to the graph. This
 124 leaves us in Step 2 with the suffix \texttt{ecd} which we add too. In Step 3 we
 125 see that there are no confluence nodes in our common prefix and therefore we
 126 do not have to clone nodes. In Step 4 we traverse from the last node back to
 127 the beginning of the suffix and find a common suffix \texttt{cd} and we can
 128 merge these nodes. In this way we can reuse the transition from $q_3$ to $q_4$.
 129 This leaves us with subgraph SG2.
 130
 131 We now add the last entry which is the word \texttt{aecf}. When we do this
 132 without the confluence node checking we encounter an unwanted extra word. In
 133 Step 1 we find the common prefix \texttt{aec} and that leaves us with the
 134 suffix \texttt{f} which we will add. This creates subgraph SG3 and we notice
 135 there is an extra word present in the graph, namely the word \texttt{abcf}.
 136 This extra word appeared because of the confluence node $q_3$ which was present
 137 in the common prefix introducing an unwanted path. Therefore in Step 3 when we
 138 check for confluence node we would see that node $q_3$ is a confluence node and
 139 needs to be cloned off the original path, by cloning we take the added suffix
 140 with it to create an entirely new route. Tracking the route back again we do
 141 not encounter any other confluence nodes so we can start Step 4. For this word
 142 there is no common suffix and therefore no merging will be applied. This
 143 results in subgraph SG4 which is the final DAWG containing only the words
 144 added.
 145
 146 \begin{figure}[H]
 147         \caption{Step 1}
 148         \label{dawg1}
 149         \centering
 150         \digraph[scale=0.4]{inccons}{
 151                 rankdir=LR;
 152                 n4 [style=invis];
 153                 q40 [label="q0"];
 154                 q41 [label="q1"];
 155                 q42 [label="q2"];
 156                 q43 [label="q3"];
 157                 q44 [label="q4" shape=doublecircle];
 158                 q45 [label="q5"];
 159                 q46 [label="q6"];
 160                 n4 -> q40[label="SG4"];
 161                 q40 -> q41[label="a"];
 162                 q41 -> q42[label="b"];
 163                 q42 -> q43[label="c"];
 164                 q43 -> q44[label="d"];
 165                 q41 -> q45[label="e"];
 166                 q45 -> q46[label="c"];
 167                 q46 -> q44[label="d"];
 168                 q46 -> q44[label="f"];
 169
 170                 n3 [style=invis];
 171                 q30 [label="q0"];
 172                 q31 [label="q1"];
 173                 q32 [label="q2"];
 174                 q33 [label="q3"];
 175                 q34 [label="q4" shape=doublecircle];
 176                 q35 [label="q5"];
 177                 n3 -> q30[label="SG3"];
 178                 q30 -> q31[label="a"];
 179                 q31 -> q32[label="b"];
 180                 q32 -> q33[label="c"];
 181                 q33 -> q34[label="d"];
 182                 q33 -> q34[label="f" constraint=false];
 183                 q31 -> q35[label="e"];
 184                 q35 -> q33[label="c"];
 185
 186                 n2 [style=invis];
 187                 q20 [label="q0"];
 188                 q21 [label="q1"];
 189                 q22 [label="q2"];
 190                 q23 [label="q3"];
 191                 q24 [label="q4" shape=doublecircle];
 192                 q25 [label="q5"];
 193                 n2 -> q20[label="SG2"];
 194                 q20 -> q21[label="a"];
 195                 q21 -> q22[label="b"];
 196                 q22 -> q23[label="c"];
 197                 q23 -> q24[label="d"];
 198                 q21 -> q25[label="e"];
 199                 q25 -> q23[label="c"];
 200
 201                 n1 [style=invis];
 202                 q10 [label="q0"];
 203                 q11 [label="q1"];
 204                 q12 [label="q2"];
 205                 q13 [label="q3"];
 206                 q14 [label="q4" shape=doublecircle];
 207                 n1 -> q10[label="SG1"];
 208                 q10 -> q11[label="a"];
 209                 q11 -> q12[label="b"];
 210                 q12 -> q13[label="c"];
 211                 q13 -> q14[label="d"];
 212
 213                 n [style=invis];
 214                 q0 [label="q0"];
 215                 n -> q0[label="SG0"];
 216         }
 217 \end{figure}
 218
 219 \subsection{Appliance on extraction of patterns}
 220 The text data in combination with the user markings are not just plain text and
 221 thus we can not create a DAWG out of them without a few adaptations to the
 222 structure.
 223
 224 Adaption for our data
 225
 226 Non determinism
 227
 228 How to get best match
 229
 230 The algorithm for minimizing DAWGs can not be applied directly on the user
 231 generated pattern. This is mainly because there is no conclusive information
 232 about the contents of the pattern and therefore non deterministic choices have
 233 to be made. Because of this there can be multiple matches and the best match
 234 has to be determined. There are several possible criteria for determining the
 235 best match. The method we use is ... TODO