thesis2/3.methods.tex

   1 \section{Application overview}
   2 The backend consists of several processing steps before a crawler specification
   3 is ready. These steps are visualized in Figure~\ref{appinternals} where all the
   4 nodes are important milestones in the process of processing the user data.
   5 Arrows indicate informatio transfer between these steps. The Figure is a
   6 detailed explanation of the \textit{Backend} node in Figure~\ref{appoverview}.
   7
   8 \begin{figure}[H]
   9         \label{appinternals}
  10         \centering
  11         \includegraphics[width=\linewidth]{backend.eps}
  12         \strut\\
  13         \caption{Main module internals}
  14 \end{figure}
  15
  16 \section{HTML data}
  17 The raw data from the \textit{Frontend} with the user markings will enter the
  18 backend as a HTTP \textit{POST} request and consists of several information
  19 fields extracted from the description boxes in the front and accompanied by the
  20 raw \textit{HTML} data from the table with the feed entries that contain the
  21 marking made by the user at the time just before the submit button is pressed.
  22 Within the \textit{HTML} data of the table markers are placed before sending to
  23 make the parsing of the tables more easy and remove the need for an advanced
  24 \textit{HTML} parser. The \textit{POST} request is not send asynchronously,
  25 this means the user has to wait until the server has processed the request. In
  26 this way the user will be immediately notified when the processing has finished
  27 successfully and will be able to review and test the resulting crawler. All the
  28 data preparation for the \textit{POST} request is done in Javascript and thus
  29 on the user side, all the processing afterwards is done in the backend and thus
  30 server side. All the descriptive fields will be transferred to the final
  31 aggregating dictionary.
  32
  33 \section{Table rows}
  34 The first conversion step is to extract the rows within from the \textit{HTML}
  35 data. Because of the table markers this process can be done using very
  36 rudimentary matching techniques that require very little computational power.
  37 All the \texttt{SPAN} \textit{HTML} elements have to be found since the marking
  38 of the user are visualized through colored \texttt{SPAN} elements. To achieve
  39 this for every row all \texttt{SPAN} elements are extracted, again with simple
  40 matching techniques, to extract the color and afterwards to remove the element
  41 to retrieve the original plain text of the \textit{RSS} feed entry. When this
  42 step is done a data structure containing all the text of the entries together
  43 with the markings will go to the next step. All original data, namely the
  44 \textit{HTML} data per row, will be transferred to the final aggregating
  45 dictionary.
  46
  47 \section{Node lists}
  48 Every entry gotten from the previous step are going to be processing into so
  49 called node-lists. A node-list can be seen as a path graph of every character
  50 and marking. A path graph $G$ is defined as $G=(V,n_1,E,n_i)$ where $V=\{n_1,
  51 n_2, \cdots, n_{i-1}, n_i\}$ and $E=\{(n_1, n_2), (n_2, n_3), ... (n_{i-1},
  52 n_{i})\}$. A path graph is basically a graph that is a path of nodes where
  53 every node is connected to the next on and the last on being final. The
  54 transitions between two nodes is either a character or a marking. As an example
  55 we take the entry \texttt{19:00, 2014-11-12 - Foobar} and create the
  56 corresponding node-lists and make it visible in Figure~\ref{nodelistexample}.
  57 Characters are denoted with single quotes, spaces with an underscore and
  58 markers with angle brackets. Node-lists are the basic elements from which the
  59 DAWG will be generated. These node-lists will also be available in the final
  60 aggregating dictionary to ensure consistency of data and possibility of
  61 regenerating the data.
  62 \begin{figure}[H]
  63         \label{nodelistexample}
  64         \centering
  65         \includegraphics[width=\linewidth]{nodelistexample.eps}
  66         \caption{Node list example}
  67 \end{figure}
  68
  69 \section{DAWGs}
  70 \subsection{Terminology}
  71 \textbf{Parent nodes} are nodes that have an arrow to the child.\\
  72 \textbf{Confluence nodes} are nodes that have multiple parents.
  73
  74 \subsection{Datastructure}
  75 We represent the user generated patterns as DAWGs by converting the node-lists
  76 to DAWGS. Normally DAWGs have single letters from an alphabet as edgelabel but
  77 in our implementation the DAWGs alphabet contains all letters, whitespace and
  78 punctuation but also the specified user markers which can be multiple
  79 characters of actual length but for the DAWGs' sake they are one transition in
  80 the graph.
  81
  82 DAWGs are graphs but due to the constraints we can use the DAWG to check if a
  83 match occurs by checking if a path exists that creates the word by
  84 concatenating all the edge labels while trying to find a path following the
  85 characters from the entry. The first algorithm to generate DAWGs from
  86 words was proposed by Hopcroft et al\cite{Hopcroft1971}. It is an incremental
  87 approach in generating the graph.  Meaning that entry by entry the graph will
  88 be expanded. Hopcrofts algorithm has the constraint of lexicographical
  89 ordering. Later on Daciuk et al.\cite{Daciuk1998}\cite{Daciuk2000} improved on
  90 the original algorithm and their algorithm is the algorithm we used to generate
  91 minimal DAWGs from the node lists.
  92
  93 \subsection{Algorithm}
  94 The algorithm of building DAWGs is an iterative process that goes roughly in
  95 two steps. We start with the null graph that can be described by
  96 $G_0=(\{q_0\},\{q_0\},\{\}\{\})$ and does not contain any edges, one node and
  97 $\mathcal{L}(G_0)=\emptyset$. The first word that is added to the graph will be
  98 added in a naive way. We just create a new node for every transition of
  99 character and we mark the last node as final. From then on all words are added
 100 using a stepwise approach.
 101 \begin{enumerate}
 102         \item
 103                 Say we add word $w$ to the grahp. Step one is finding the common prefix of
 104                 the word already in the graph. The common prefix is defined as the longest
 105                 subword $w'$ for which there is a $\delta^*(q_0, w')$.  When the common
 106                 prefix is found we change the starting state to the last state of the
 107                 common prefix and remain with suffix $w''$ where $w=w'w''$.
 108         \item
 109                 We add the suffix to the graph starting from the last node of the common
 110                 prefix.
 111         \item
 112                 When there are confluence nodes present within the common prefix they are
 113                 cloned starting from the last confluence node to avoid adding unwanted
 114                 words.
 115         \item
 116                 From the last node of the suffix up to the first node we replace the nodes
 117                 by checking if there are equivalent nodes present in the graph that we can
 118                 merge with.
 119 \end{enumerate}
 120
 121 Pseudocode for the algorithm can be found in Listing~\ref{pseudodawg}.
 122
 123 \subsection{Example}
 124 We visualize this with an example shown in the {Subgraphs in
 125 Figure}~\ref{dawg1} that builds a DAWG with the following entries:
 126 \texttt{abcd}, \texttt{aecd} and \texttt{aecf}. In SG0 we begin with the null
 127 graph.
 128
 129 Adding the first entry \texttt{abcd} is trivial because just creates a single
 130 path and does not require any hard work. Because the common prefix we find in
 131 Step 1 is empty the suffix will be the entire word. There is also no
 132 possibility of merging nodes because there were no nodes to begin with. The
 133 result of adding the first word is visible in subgraph SG1.
 134
 135 For the second entry \texttt{aecd} we will have to do some extra work.
 136 The common prefix found in Step 1 is \texttt{a} which we add to the graph. This
 137 leaves us in Step 2 with the suffix \texttt{ecd} which we add too. In Step 3 we
 138 see that there are no confluence nodes in our common prefix and therefore we
 139 do not have to clone nodes. In Step 4 we traverse from the last node back to
 140 the beginning of the suffix and find a common suffix \texttt{cd} and we can
 141 merge these nodes. In this way we can reuse the transition from $q_3$ to $q_4$.
 142 This leaves us with subgraph SG2.
 143
 144 We now add the last entry which is the word \texttt{aecf}. When we do this
 145 without the confluence node checking we encounter an unwanted extra word. In
 146 Step 1 we find the common prefix \texttt{aec} and that leaves us with the
 147 suffix \texttt{f} which we will add. This creates subgraph SG3 and we notice
 148 there is an extra word present in the graph, namely the word \texttt{abcf}.
 149 This extra word appeared because of the confluence node $q_3$ which was present
 150 in the common prefix introducing an unwanted path. Therefore in Step 3 when we
 151 check for confluence node we would see that node $q_3$ is a confluence node and
 152 needs to be cloned off the original path, by cloning we take the added suffix
 153 with it to create an entirely new route. Tracking the route back again we do
 154 not encounter any other confluence nodes so we can start Step 4. For this word
 155 there is no common suffix and therefore no merging will be applied. This
 156 results in subgraph SG4 which is the final DAWG containing only the words
 157 added.
 158
 159 \begin{figure}[H]
 160         \caption{Step 1}
 161         \label{dawg1}
 162         \centering
 163         \digraph[scale=0.4]{inccons}{
 164                 rankdir=LR;
 165                 n4 [style=invis];
 166                 q40 [label="q0"];
 167                 q41 [label="q1"];
 168                 q42 [label="q2"];
 169                 q43 [label="q3"];
 170                 q44 [label="q4" shape=doublecircle];
 171                 q45 [label="q5"];
 172                 q46 [label="q6"];
 173                 n4 -> q40[label="SG4"];
 174                 q40 -> q41[label="a"];
 175                 q41 -> q42[label="b"];
 176                 q42 -> q43[label="c"];
 177                 q43 -> q44[label="d"];
 178                 q41 -> q45[label="e"];
 179                 q45 -> q46[label="c"];
 180                 q46 -> q44[label="d"];
 181                 q46 -> q44[label="f"];
 182
 183                 n3 [style=invis];
 184                 q30 [label="q0"];
 185                 q31 [label="q1"];
 186                 q32 [label="q2"];
 187                 q33 [label="q3"];
 188                 q34 [label="q4" shape=doublecircle];
 189                 q35 [label="q5"];
 190                 n3 -> q30[label="SG3"];
 191                 q30 -> q31[label="a"];
 192                 q31 -> q32[label="b"];
 193                 q32 -> q33[label="c"];
 194                 q33 -> q34[label="d"];
 195                 q33 -> q34[label="f" constraint=false];
 196                 q31 -> q35[label="e"];
 197                 q35 -> q33[label="c"];
 198
 199                 n2 [style=invis];
 200                 q20 [label="q0"];
 201                 q21 [label="q1"];
 202                 q22 [label="q2"];
 203                 q23 [label="q3"];
 204                 q24 [label="q4" shape=doublecircle];
 205                 q25 [label="q5"];
 206                 n2 -> q20[label="SG2"];
 207                 q20 -> q21[label="a"];
 208                 q21 -> q22[label="b"];
 209                 q22 -> q23[label="c"];
 210                 q23 -> q24[label="d"];
 211                 q21 -> q25[label="e"];
 212                 q25 -> q23[label="c"];
 213
 214                 n1 [style=invis];
 215                 q10 [label="q0"];
 216                 q11 [label="q1"];
 217                 q12 [label="q2"];
 218                 q13 [label="q3"];
 219                 q14 [label="q4" shape=doublecircle];
 220                 n1 -> q10[label="SG1"];
 221                 q10 -> q11[label="a"];
 222                 q11 -> q12[label="b"];
 223                 q12 -> q13[label="c"];
 224                 q13 -> q14[label="d"];
 225
 226                 n [style=invis];
 227                 q0 [label="q0"];
 228                 n -> q0[label="SG0"];
 229         }
 230 \end{figure}
 231
 232 \subsection{Minimality of the algorithm}
 233
 234
 235 \subsection{Appliance on extraction of patterns}
 236 The text data in combination with the user markings can not be converted
 237 automatically to a DAWG using the algorithm we described. This is because the
 238 user markings are not necessarily a single character or word. User markings are
 239 basically one or more characters. When we add a user marking we insert a
 240 character that is not in the alphabet and later on we change the marking to a
 241 kind of subgraph. When this is applied it can be possible that non determinism
 242 is added to the graph. Non determinism is the fact that a single node has
 243 multiple edges with the same transition, in practise this means that a word can
 244 be present in the graph in multiple paths. This is shown in
 245 Figure~\ref{nddawg} with the following words: \texttt{ab<1>c}, \texttt{a<1>bbc}.
 246 In this graph the word \texttt{abdc} will be accepted and the user pattern
 247 \texttt{<1>} will be filled with the word \texttt{d}. However if we try the
 248 word \texttt{abdddbc} both paths can be chosen. In the first case the user
 249 pattern \texttt{<1>} will be filled with \texttt{dddb} and in the second case
 250 with \texttt{bddd}. In such a case we need to choose the hopefully smartest
 251 choice.
 252
 253 \begin{figure}[H]
 254         \label{nddawg}
 255         \centering
 256         \includegraphics[width=\linewidth]{nddawg.eps}
 257         \caption{Example non determinism}
 258 \end{figure}
 259
 260 \subsection{Minimality and non-determinism}
 261 The Myhill-Nerode theorem~\cite{Hopcroft1979} states that for every number of
 262 graphs accepting the same language there is a single graph with the least
 263 amount of states. Mihov\cite{Mihov1998} has proven that the algorithm is
 264 minimal in its original form. Our program converts the node-lists to DAWGs that
 265 can possibly contain non deterministic nodes and therefore one can argue about
 266 the minimality. Due to the nature of the determinism this is not the case. The
 267 non determinism is only visible when matching the data and not in the real
 268 graph since in the real graph we ...
 269