thesis2/3.methods.tex

   1 \section{Application overview}
   2 The backend consists of several processing steps that the input has go through
   3 before it is converted to a crawler specification. These steps are visualized
   4 in Figure~\ref{appinternals}. All the nodes are important milestones in the
   5 process of processing the user data. Arrows indicate information transfer
   6 between these steps. The Figure is a detailed explanation of the
   7 \textit{Backend} node in Figure~\ref{appoverview}.
   8
   9 \begin{figure}[H]
  10         \label{appinternals}
  11         \centering
  12         \includegraphics[width=\linewidth]{backend.pdf}
  13         \caption{Main module internals}
  14 \end{figure}
  15
  16 \section{HTML data}
  17 The raw data from the frontend with the user markings enter the backend as a
  18 HTTP \textit{POST} request. This \textit{POST} request consists of several
  19 information data fields. These data fields are either fields from the static
  20 description boxes in the frontend or raw \textit{HTML} data from the table
  21 showing the processed RSS feed entries which contain the markings made by the
  22 user. The table is sent in whole precisely at the time the user presses the
  23 submit button. Within the \textit{HTML} data of the table markers are placed
  24 before sending. These markers make the parsing of the tables more easy and
  25 remove the need for an advanced \textit{HTML} parser to extract the markers.
  26 The \textit{POST} request is not send asynchronously. Synchronous sending means
  27 the user has to wait until the server has processed the request. In this way
  28 the user will be notified immediately when the processing has finished
  29 successfully and will be able to review and test the resulting crawler. All the
  30 data preparation for the \textit{POST} request is done in \textit{Javascript}
  31 and thus happens on the user side. All the processing afterwards is done in the
  32 backend and thus happens on the server side. All descriptive fields will also
  33 be put directly in the final aggregating dictionary. All the raw \textit{HTML}
  34 data will be sent to the next step.
  35
  36 \section{Table rows}
  37 The first conversion step is to extract individual table rows from the
  38 \textit{HTML} data. In the previous step markers were placed to make the
  39 identification easy of table rows. Because of this, extracting the table rows
  40 only requires rudimentary matching techniques that require very little
  41 computational power. User markings are highlights of certain text elements.
  42 The highlighting is done using \texttt{SPAN} elements and therefore all the
  43 \texttt{SPAN} elements have to be found and extracted. To achieve this for
  44 every row all \texttt{SPAN} elements are extracted, again with simple matching
  45 techniques, to extract the color and afterwards to remove the element to
  46 retrieve the original plain text of the \textit{RSS} feed entry. When this step
  47 is done a data structure containing all the text of the entries together with
  48 the markings will go to the next step. All original data, namely the
  49 \textit{HTML} data per row, will be transferred to the final aggregating
  50 dictionary.
  51
  52 \section{Node lists}
  53 Every entry gotten from the previous step is going to be processing into so
  54 called node-lists. A node-list can be seen as a path graph where every
  55 character and marking has a node. A path graph $G$ is defined as
  56 $G=(V,n_1,E,n_i)$ where $V=\{n_1, n_2, \cdots, n_{i-1}, n_i\}$ and
  57 $E=\{(n_1, n_2), (n_2, n_3), \ldots\\ (n_{i-1}, n_{i})\}$. A path graph is basically
  58 a graph that is a single linear path of nodes where every node is connected to
  59 the next node except for the last one. The last node is the only final node.
  60 The transitions between two nodes is either a character or a marking. As an
  61 example we take the entry \texttt{19:00, 2014-11-12 - Foobar} and create the
  62 corresponding node-lists and it is shown in Figure~\ref{nodelistexample}.
  63 Characters are denoted with single quotes, spaces with an underscore and
  64 markers with angle brackets. Node-lists are the basic elements from which the
  65 DAWG will be generated. These node-lists will also be available in the final
  66 aggregating dictionary to ensure consistency of data and possibility of
  67 regenerating the data.
  68 \begin{figure}[H]
  69         \label{nodelistexample}
  70         \centering
  71         \includegraphics[width=\linewidth]{nodelistexample.pdf}
  72         \caption{Node list example}
  73 \end{figure}
  74
  75 \section{DAWGs}
  76 \subsection{Terminology}
  77 \textbf{Parent nodes} are nodes that have an arrow to the child.\\
  78 \textbf{Confluence nodes} are nodes that have multiple parents.
  79
  80 \subsection{Datastructure}
  81 We represent the user generated patterns as DAWGs by converting the node-lists
  82 to DAWGs. Normally DAWGs have single letters from an alphabet as edgelabel but
  83 in our implementation the DAWGs alphabet contains all letters, whitespace and
  84 punctuation but also the specified user markers which can be multiple
  85 characters of actual length but for the DAWGs sake they are one transition in
  86 the graph.
  87
  88 DAWGs are graphs but due to the constraints we can use a DAWG to check if a
  89 match occurs by checking if a path exists that creates the word by
  90 concatenating all the edge labels. The first algorithm to generate DAWGs from
  91 words was proposed by Hopcroft et al\cite{Hopcroft1971}. It is an incremental
  92 approach in generating the graph. Meaning that entry by entry the graph will be
  93 expanded. Hopcrofts algorithm has the constraint of lexicographical ordering.
  94 Later on Daciuk et al.\cite{Daciuk1998}\cite{Daciuk2000} improved on the
  95 original algorithm and their algorithm is the algorithm we used to generate
  96 minimal DAWGs from the node lists.
  97
  98 \subsection{Algorithm}
  99 The algorithm of building DAWGs is an iterative process that goes roughly in
 100 four steps. We start with the null graph that can be described by
 101 $G_0=(\{q_0\},\{q_0\},\{\}\{\})$ and does not contain any edges, one node and
 102 $\mathcal{L}(G_0)=\emptyset$. The first word that is added to the graph will be
 103 added in a naive way and is basically replacing the graph by the node-list
 104 which is a path graph. We just create a new node for every transition of
 105 character and we mark the last node as final. From then on all words are added
 106 using a four step approach described below. Pseudocode for this algorithm can
 107 be found in Listing~\ref{pseudodawg} named as the function
 108 \texttt{generate\_dawg(words)}. A \textit{Python} implementation can be found
 109 in Listing~\ref{dawg.py}.
 110 \begin{enumerate}
 111         \item
 112                 Say we add word $w$ to the graph. Step one is finding the
 113                 common prefix of the word already in the graph. The common
 114                 prefix is defined as the longest subword $w'$ for which there
 115                 is a $\delta^*(q_0, w')$. When the common prefix is found we
 116                 change the starting state to the last state of the common
 117                 prefix and remain with suffix $w''$ where $w=w'w''$.
 118         \item
 119                 We add the suffix to the graph starting from the last node of
 120                 the common prefix.
 121         \item
 122                 When there are confluence nodes present within the common
 123                 prefix they are cloned starting from the last confluence node
 124                 to avoid adding unwanted words.
 125         \item
 126                 From the last node of the suffix up to the first node we
 127                 replace the nodes by checking if there are equivalent nodes
 128                 present in the graph that we can merge with.
 129 \end{enumerate}
 130
 131 \begin{algorithm}[H]
 132         \SetKwProg{Def}{def}{:}{end}
 133         \Def{generate\_dawg(words)}{
 134                 register := $\emptyset$\;
 135                 \While{there is another word}{%
 136                         word := next word\;
 137                         commonprefix := CommonPrefix(word)\;
 138                         laststate := %
 139 delta\textsuperscript{*}(q0, commonprefix)\;
 140                         currentsuffix := %
 141 word[length(commonprefix)\ldots length(word)]\;
 142                         \If{has\_children(laststate)}{%
 143                                 replace\_or\_register(laststate)\;
 144                         }
 145                         add\_suffix(laststate, currentsuffix)\;
 146                 }
 147                 replace\_or\_register(q0)\;
 148         }
 149         \Def{replace\_or\_register\_dawg(state)}{%
 150                 child := last\_child(state)\;
 151                 \If{has\_children(child)}{%
 152                         replace\_or\_register(child)\;
 153                 }
 154                 \eIf{there is an equivalent state q}{%
 155                         last\_child(state)\;
 156                         delete(child)\;
 157                 }{%
 158                         register.add(child)\;
 159                 }
 160         }
 161         \caption{Generating DAWGs pseudocode}
 162         \label{pseudodawg}
 163 \end{algorithm}
 164
 165 \newpage\subsection{Example}
 166 We visualize this with an example shown in the {Subgraphs in
 167 Figure}~\ref{dawg1} that builds a DAWG with the following entries:
 168 \texttt{abcd}, \texttt{aecd} and \texttt{aecf}.
 169
 170 \begin{itemize}
 171         \item No words added yet\\
 172                 Initially we begin with the null graph. This graph is show in
 173                 the figure as SG0. This DAWG does not yet accept any words.
 174         \item Adding \texttt{abcd}\\
 175                 Adding the first entry \texttt{abcd} is trivial because we can
 176                 just create a single path which does not require any hard work.
 177                 This is because the common prefix we find in Step 1 is empty
 178                 and the suffix will thus be the entire word. Merging the suffix
 179                 back into the graph is also not possible since there are no
 180                 nodes except for the first node.  The result of adding the
 181                 first word is visible in subgraph SG1.
 182         \item Adding \texttt{aecd}\\
 183                 For the second entry we will have to do some
 184                 extra work. The common prefix found in Step 1 is \texttt{a}
 185                 which we add to the graph. This leaves us in Step 2 with the
 186                 suffix \texttt{ecd} which we add too. In Step 3 we see that
 187                 there are no confluence nodes in our common prefix and
 188                 therefore we do not have to clone nodes. In Step 4 we traverse
 189                 from the last node back to the beginning of the suffix and find
 190                 a common suffix \texttt{cd} and we can merge these nodes. In
 191                 this way we can reuse the transition from $q_3$ to $q_4$. This
 192                 leaves us with subgraph SG2.
 193         \item Adding \texttt{aecf}\\
 194                 We now add the last entry which is the word \texttt{aecf}. When
 195                 we do this without the confluence node checking we encounter an
 196                 unwanted extra word. In Step 1 we find the common prefix
 197                 \texttt{aec} and that leaves us with the suffix \texttt{f}
 198                 which we will add. This creates subgraph SG3 and we notice
 199                 there is an extra word present in the graph, namely the word
 200                 \texttt{abcf}.  This extra word appeared because of the
 201                 confluence node $q_3$ which was present in the common prefix
 202                 introducing an unwanted path. Therefore in Step 3 when we check
 203                 for confluence node we would see that node $q_3$ is a
 204                 confluence node and needs to be cloned off the original path,
 205                 by cloning we take the added suffix with it to create an
 206                 entirely new route. Tracking the route back again we do not
 207                 encounter any other confluence nodes so we can start Step 4.
 208                 For this word there is no common suffix and therefore no
 209                 merging will be applied. This results in subgraph SG4 which is
 210                 the final DAWG containing only the words added.
 211 \end{itemize}
 212
 213 \begin{figure}[H]
 214         \label{dawg1}
 215         \centering
 216         \includegraphics[height=20em]{inccons.pdf}
 217         \caption{Incrementally constructing a DAWG}
 218 \end{figure}
 219
 220 \subsection{Appliance on extraction of patterns}
 221 The text data in combination with the user markings can not be converted
 222 automatically to a DAWG using the algorithm we described. This is because the
 223 user markings are not necessarily a single character or word. Currently user
 224 markings are basically multiple random characters. When we add a user marking,
 225 we are inserting a kind of subgraph in the place of the node with the marking.
 226 By doing this we can introduce non determinism to the graph. Non determinism is
 227 the fact that a single node has multiple edges with the same transition, in
 228 practise this means it could happen that a word can be present in the graph in
 229 multiple paths. An example of non determinism in one of our DAWGs is shown in
 230 Figure~\ref{nddawg}. This figure represents a generated DAWG with the following
 231 entries: \texttt{ab<1>c}, \texttt{a<1>bbc}.
 232
 233 In this graph the word \texttt{abdc} will be accepted and the user pattern
 234 \texttt{<1>} will be filled with the subword \texttt{d}. However if we try the
 235 word \texttt{abdddbc} both paths can be chosen. In the first case the user
 236 pattern \texttt{<1>} will be filled with \texttt{dddb} and in the second case
 237 with \texttt{bddd}. In such a case we need to choose the hopefully smartest
 238 choice. In the case of no paths matching the system will report a failed
 239 extraction. The crawling system can be made more forgiving in such a way that
 240 it will give partial information when no match is possible, however it still
 241 needs to report the error and the data should be handled with extra care.
 242
 243 \begin{figure}[H]
 244         \label{nddawg}
 245         \centering
 246         \includegraphics[width=\linewidth]{nddawg.pdf}
 247         \caption{Example non determinism}
 248 \end{figure}
 249
 250 \subsection{Minimality \& non-determinism}
 251 The Myhill-Nerode theorem~\cite{Hopcroft1979} states that for every number of
 252 graphs accepting the same language there is a single graph with the least
 253 amount of states. Mihov\cite{Mihov1998} has proven that the algorithm for
 254 generating DAWGs is minimal in its original form. Our program converts the
 255 node-lists to DAWGs that can possibly contain non deterministic transitions
 256 from nodes and therefore one can argue about Myhill-Nerodes theorem and Mihovs
 257 proof holding. Due to the nature of the determinism this is not the case and
 258 both hold. In reality the graph itself is only non-deterministic when expanding
 259 the categories and thus only during matching.
 260
 261 Choosing the smartest path during matching the program has to choose
 262 deterministically between possibly multiple path with possibly multiple
 263 results. There are several possibilities or heuristics to choose from.
 264 \begin{itemize}
 265         \item Maximum fields heuristic\\
 266
 267                 This heuristic prefers the result that has the highest amount
 268                 of categories filled with actual text. Using this method the
 269                 highest amount of data fields will be getting filled at all
 270                 times. The downside of this method is that because of this it
 271                 might be that some data is not put in the right field because a
 272                 suboptimal splitting occurred that has put the data in two
 273                 separate fields whereas it should be in one field.
 274         \item Maximum path heuristic\\
 275
 276                 Maximum path heuristic tries to find a match with the highest
 277                 amount of fixed path transitions. Fixed path transitions are
 278                 transitions that occur not within a category. The philosophy
 279                 behind is, is that because the path are hard coded in the graph
 280                 they must be important. The downside of this method is when
 281                 overlap occurs between hard coded paths and information within
 282                 the categories. For example a band that is called
 283                 \texttt{Location} could interfere greatly with a hard coded
 284                 path that marks a location using the same words.
 285 \end{itemize}
 286
 287 If we would know more about the categories the best heuristic automatically
 288 becomes the maximum path heuristic. When, as in our implementation, there is
 289 very little information both heuristics perform about the same.