thesis2/3.methods.tex

   1 \section{Application overview}
   2 The backend consists of several processing steps before a crawler specification
   3 is ready. These steps are visualized in Figure~\ref{appinternals} where all the
   4 nodes are important milestones in the process of processing the user data.
   5 Arrows indicate informatio transfer between these steps. The Figure is a
   6 detailed explanation of the \textit{Backend} node in Figure~\ref{appoverview}.
   7
   8 \begin{figure}[H]
   9         \label{appinternals}
  10         \centering
  11         \includegraphics[width=\linewidth]{backend.eps}
  12         \strut\\
  13         \caption{Main module internals}
  14 \end{figure}
  15
  16 \section{HTML data}
  17 The raw data from the \textit{Frontend} with the user markings will enter the
  18 backend as a HTTP \textit{POST} request and consists of several information
  19 fields extracted from the description boxes in the front and accompanied by the
  20 raw \textit{HTML} data from the table with the feed entries that contain the
  21 marking made by the user at the time just before the submit button is pressed.
  22 Within the \textit{HTML} data of the table markers are placed before sending to
  23 make the parsing of the tables more easy and remove the need for an advanced
  24 \textit{HTML} parser. The \textit{POST} request is not send asynchronously,
  25 this means the user has to wait until the server has processed the request. In
  26 this way the user will be immediately notified when the processing has finished
  27 successfully and will be able to review and test the resulting crawler. All the
  28 data preparation for the \textit{POST} request is done in Javascript and thus
  29 on the user side, all the processing afterwards is done in the backend and thus
  30 server side. All the descriptive fields will be transferred to the final
  31 aggregating dictionary.
  32
  33 \section{Table rows}
  34 The first conversion step is to extract the rows within from the \textit{HTML}
  35 data. Because of the table markers this process can be done using very
  36 rudimentary matching techniques that require very little computational power.
  37 All the \texttt{SPAN} \textit{HTML} elements have to be found since the marking
  38 of the user are visualized through colored \texttt{SPAN} elements. To achieve
  39 this for every row all \texttt{SPAN} elements are extracted, again with simple
  40 matching techniques, to extract the color and afterwards to remove the element
  41 to retrieve the original plain text of the \textit{RSS} feed entry. When this
  42 step is done a data structure containing all the text of the entries together
  43 with the markings will go to the next step. All original data, namely the
  44 \textit{HTML} data per row, will be transferred to the final aggregating
  45 dictionary.
  46
  47 \section{Node lists}
  48 Every entry gotten from the previous step are going to be processing into so
  49 called node-lists. A node-list can be seen as a path graph of every character
  50 and marking. A path graph $G$ is defined as $G=(V,n_1,E,n_i)$ where $V=\{n_1,
  51 n_2, \cdots, n_{i-1}, n_i\}$ and $E=\{(n_1, n_2), (n_2, n_3), ... (n_{i-1},
  52 n_{i})\}$. A path graph is basically a graph that is a path of nodes where
  53 every node is connected to the next on and the last on being final. The
  54 transitions between two nodes is either a character or a marking. As an example
  55 we take the entry \texttt{19:00, 2014-11-12 - Foobar} and create the
  56 corresponding node-lists and make it visible in Figure~\ref{nodelistexample}.
  57 Characters are denoted with single quotes, spaces with an underscore and
  58 markers with angle brackets. Node-lists are the basic elements from which the
  59 DAWG will be generated. These node-lists will also be available in the final
  60 aggregating dictionary to ensure consistency of data and possibility of
  61 regenerating the data.
  62 \begin{figure}[H]
  63         \label{nodelistexample}
  64         \centering
  65         \includegraphics[width=\linewidth]{nodelistexample.eps}
  66         \caption{Node list example}
  67 \end{figure}
  68
  69 \section{DAWGs}
  70 \subsection{Terminology}
  71 \textbf{Parent nodes} are nodes that have an arrow to the child.\\
  72 \textbf{Confluence nodes} are nodes that have multiple parents.
  73
  74 \subsection{Datastructure}
  75 We represent the user generated patterns as DAWGs by converting the node-lists
  76 to DAWGS. Normally DAWGs have single letters from an alphabet as edgelabel but
  77 in our implementation the DAWGs alphabet contains all letters, whitespace and
  78 punctuation but also the specified user markers which can be multiple
  79 characters of actual length but for the DAWGs' sake they are one transition in
  80 the graph.
  81
  82 DAWGs are graphs but due to the constraints we can use the DAWG to check if a
  83 match occurs by checking if a path exists that creates the word by
  84 concatenating all the edge labels while trying to find a path following the
  85 characters from the entry. The first algorithm to generate DAWGs from
  86 words was proposed by Hopcroft et al\cite{Hopcroft1971}. It is an incremental
  87 approach in generating the graph.  Meaning that entry by entry the graph will
  88 be expanded. Hopcrofts algorithm has the constraint of lexicographical
  89 ordering. Later on Daciuk et al.\cite{Daciuk2000} improved on the original
  90 algorithm and their algorithm is the algorithm we used to minimize our DAWGs. A
  91 minimal graph is a graph $G$ for which there is no graph $G'$ that has less
  92 paths and $\mathcal{L}(G)=\mathcal{L}(G')$ where $\mathcal{L}(G)$ is the set
  93 of all words present in the DAWG.
  94
  95 \subsection{Algorithm}
  96 The algorithm of building DAWGs is an iterative process that goes roughly in
  97 two steps. We start with the null graph that can be described by
  98 $G_0=(\{q_0\},\{q_0\},\{\}\{\})$ and does not contain any edges, one node and
  99 $\mathcal{L}(G_0)=\emptyset$. The first word that is added to the graph will be
 100 added in a naive way. We just create a new node for every transition of
 101 character and we mark the last node as final. From then on all words are added
 102 using a stepwise approach.
 103 \begin{enumerate}
 104         \item
 105                 Say we add word $w$ to the grahp. Step one is finding the common prefix of
 106                 the word already in the graph. The common prefix is defined as the longest
 107                 subword $w'$ for which there is a $\delta^*(q_0, w')$.  When the common
 108                 prefix is found we change the starting state to the last state of the
 109                 common prefix and remain with suffix $w''$ where $w=w'w''$.
 110         \item
 111                 We add the suffix to the graph starting from the last node of the common
 112                 prefix.
 113         \item
 114                 When there are confluence nodes present within the common prefix they are
 115                 cloned starting from the last confluence node to avoid adding unwanted
 116                 words.
 117         \item
 118                 From the last node of the suffix up to the first node we replace the nodes
 119                 by checking if there are equivalent nodes present in the graph that we can
 120                 merge with.
 121 \end{enumerate}
 122
 123 Pseudocode for the algorithm can be found in Listing~\ref{pseudodawg}.
 124
 125 \subsection{Example}
 126 We visualize this with an example shown in the {Subgraphs in
 127 Figure}~\ref{dawg1} that builds a DAWG with the following entries:
 128 \texttt{abcd}, \texttt{aecd} and \texttt{aecf}. In SG0 we begin with the null
 129 graph.
 130
 131 Adding the first entry \texttt{abcd} is trivial because just creates a single
 132 path and does not require any hard work. Because the common prefix we find in
 133 Step 1 is empty the suffix will be the entire word. There is also no
 134 possibility of merging nodes because there were no nodes to begin with. The
 135 result of adding the first word is visible in subgraph SG1.
 136
 137 For the second entry \texttt{aecd} we will have to do some extra work.
 138 The common prefix found in Step 1 is \texttt{a} which we add to the graph. This
 139 leaves us in Step 2 with the suffix \texttt{ecd} which we add too. In Step 3 we
 140 see that there are no confluence nodes in our common prefix and therefore we
 141 do not have to clone nodes. In Step 4 we traverse from the last node back to
 142 the beginning of the suffix and find a common suffix \texttt{cd} and we can
 143 merge these nodes. In this way we can reuse the transition from $q_3$ to $q_4$.
 144 This leaves us with subgraph SG2.
 145
 146 We now add the last entry which is the word \texttt{aecf}. When we do this
 147 without the confluence node checking we encounter an unwanted extra word. In
 148 Step 1 we find the common prefix \texttt{aec} and that leaves us with the
 149 suffix \texttt{f} which we will add. This creates subgraph SG3 and we notice
 150 there is an extra word present in the graph, namely the word \texttt{abcf}.
 151 This extra word appeared because of the confluence node $q_3$ which was present
 152 in the common prefix introducing an unwanted path. Therefore in Step 3 when we
 153 check for confluence node we would see that node $q_3$ is a confluence node and
 154 needs to be cloned off the original path, by cloning we take the added suffix
 155 with it to create an entirely new route. Tracking the route back again we do
 156 not encounter any other confluence nodes so we can start Step 4. For this word
 157 there is no common suffix and therefore no merging will be applied. This
 158 results in subgraph SG4 which is the final DAWG containing only the words
 159 added.
 160
 161 \begin{figure}[H]
 162         \caption{Step 1}
 163         \label{dawg1}
 164         \centering
 165         \digraph[scale=0.4]{inccons}{
 166                 rankdir=LR;
 167                 n4 [style=invis];
 168                 q40 [label="q0"];
 169                 q41 [label="q1"];
 170                 q42 [label="q2"];
 171                 q43 [label="q3"];
 172                 q44 [label="q4" shape=doublecircle];
 173                 q45 [label="q5"];
 174                 q46 [label="q6"];
 175                 n4 -> q40[label="SG4"];
 176                 q40 -> q41[label="a"];
 177                 q41 -> q42[label="b"];
 178                 q42 -> q43[label="c"];
 179                 q43 -> q44[label="d"];
 180                 q41 -> q45[label="e"];
 181                 q45 -> q46[label="c"];
 182                 q46 -> q44[label="d"];
 183                 q46 -> q44[label="f"];
 184
 185                 n3 [style=invis];
 186                 q30 [label="q0"];
 187                 q31 [label="q1"];
 188                 q32 [label="q2"];
 189                 q33 [label="q3"];
 190                 q34 [label="q4" shape=doublecircle];
 191                 q35 [label="q5"];
 192                 n3 -> q30[label="SG3"];
 193                 q30 -> q31[label="a"];
 194                 q31 -> q32[label="b"];
 195                 q32 -> q33[label="c"];
 196                 q33 -> q34[label="d"];
 197                 q33 -> q34[label="f" constraint=false];
 198                 q31 -> q35[label="e"];
 199                 q35 -> q33[label="c"];
 200
 201                 n2 [style=invis];
 202                 q20 [label="q0"];
 203                 q21 [label="q1"];
 204                 q22 [label="q2"];
 205                 q23 [label="q3"];
 206                 q24 [label="q4" shape=doublecircle];
 207                 q25 [label="q5"];
 208                 n2 -> q20[label="SG2"];
 209                 q20 -> q21[label="a"];
 210                 q21 -> q22[label="b"];
 211                 q22 -> q23[label="c"];
 212                 q23 -> q24[label="d"];
 213                 q21 -> q25[label="e"];
 214                 q25 -> q23[label="c"];
 215
 216                 n1 [style=invis];
 217                 q10 [label="q0"];
 218                 q11 [label="q1"];
 219                 q12 [label="q2"];
 220                 q13 [label="q3"];
 221                 q14 [label="q4" shape=doublecircle];
 222                 n1 -> q10[label="SG1"];
 223                 q10 -> q11[label="a"];
 224                 q11 -> q12[label="b"];
 225                 q12 -> q13[label="c"];
 226                 q13 -> q14[label="d"];
 227
 228                 n [style=invis];
 229                 q0 [label="q0"];
 230                 n -> q0[label="SG0"];
 231         }
 232 \end{figure}
 233
 234 \subsection{Minimality of the algorithm}
 235
 236
 237 \subsection{Appliance on extraction of patterns}
 238 The text data in combination with the user markings can not be converted
 239 automatically to a DAWG using the algorithm we described. This is because the
 240 user markings are not necessarily a single character or word. User markings are
 241 basically one or more characters. When we add a user marking we insert a
 242 character that is not in the alphabet and later on we change the marking to a
 243 kind of subgraph. When this is applied it can be possible that non determinism
 244 is added to the graph. Non determinism is the fact that a single node has
 245 multiple edges with the same transition, in practise this means that a word can
 246 be present in the graph in multiple paths. This is shown in
 247 Figure~\ref{nddawg} with the following words: \texttt{ab<1>c}, \texttt{a<1>bbc}.
 248 In this graph the word \texttt{abdc} will be accepted and the user pattern
 249 \texttt{<1>} will be filled with the word \texttt{d}. However if we try the
 250 word \texttt{abdddbc} both paths can be chosen. In the first case the user
 251 pattern \texttt{<1>} will be filled with \texttt{dddb} and in the second case
 252 with \texttt{bddd}. In such a case we need to choose the hopefully smartest
 253 choice.
 254
 255 \begin{figure}[H]
 256         \label{nddawg}
 257         \centering
 258         \includegraphics[width=\linewidth]{nddawg.eps}
 259         \caption{Example non determinism}
 260 \end{figure}
 261
 262 \subsection{How to choose path}
 263