final
[bsc-thesis1415.git] / thesis2 / 3.methods.tex
1 \section{Application overview}
2 The backend consists of several processing steps that the input has go through
3 before it is converted to a crawler specification. These steps are visualized
4 in Figure~\ref{appinternals}. All the nodes are important milestones in the
5 process of processing the user data. Arrows indicate information transfer
6 between these steps. The Figure is a detailed explanation of the
7 \textit{Backend} node in Figure~\ref{appoverview}.
8
9 \begin{figure}[H]
10 \centering
11 \includegraphics[width=\linewidth]{backend.pdf}
12 \caption{Main module internals\label{appinternals}}
13 \end{figure}
14
15 \section{HTML data}
16 The raw data from the frontend with the user markings enter the backend as a
17 HTTP \textit{POST} request. This \textit{POST} request consists of several
18 information data fields. These data fields are either fields from the static
19 description boxes in the frontend or raw \textit{HTML} data from the table
20 showing the processed RSS feed entries which contain the markings made by the
21 user. The table is sent in whole precisely at the time the user presses the
22 submit button. Within the \textit{HTML} data of the table markers are placed
23 before sending. These markers make the parsing of the tables more easy and
24 remove the need for an advanced \textit{HTML} parser to extract the markers.
25 The \textit{POST} request is not send asynchronously. Synchronous sending means
26 the user has to wait until the server has processed the request. In this way
27 the user will be notified immediately when the processing has finished
28 successfully and will be able to review and test the resulting crawler. All the
29 data preparation for the \textit{POST} request is done in \textit{Javascript}
30 and thus happens on the user side. All the processing afterwards is done in the
31 backend and thus happens on the server side. All descriptive fields will also
32 be put directly in the final aggregating dictionary. All the raw \textit{HTML}
33 data will be sent to the next step.
34
35 \section{Table rows}
36 The first conversion step is to extract individual table rows from the
37 \textit{HTML} data. In the previous step markers were placed to make the
38 identification easy of table rows. Because of this, extracting the table rows
39 only requires rudimentary matching techniques that require very little
40 computational power. User markings are highlights of certain text elements.
41 The highlighting is done using \texttt{SPAN} elements and therefore all the
42 \texttt{SPAN} elements have to be found and extracted. To achieve this for
43 every row all \texttt{SPAN} elements are extracted to get the color and
44 afterwards to remove the element to retrieve the original plain text of the
45 \textit{RSS} feed entry. When this step is done a data structure containing all
46 the text of the entries together with the markings will go to the next step.
47 All original data, namely the \textit{HTML} data per row, will be transferred
48 to the final aggregating dictionary.
49
50 \section{Node lists}
51 Every entry gotten from the previous step is going to be processing into so
52 called node-lists. A node-list can be seen as a path graph where every
53 character and marking has a node. A path graph $G$ is defined as
54 $G=(V,n_1,E,n_i)$ where $V=\{n_1, n_2, \ldots, n_{i-1}, n_i\}$ and
55 $E=\{(n_1, n_2), (n_2, n_3), \ldots\\ (n_{i-1}, n_{i})\}$. A path graph is
56 basically a graph that is a single linear path of nodes where every node is
57 connected to the next node except for the last one. The last node is the only
58 final node. The transitions between two nodes is either a character or a
59 marking. As an example we take the entry \texttt{19:00, 2014--11--12 {-}
60 Foobar} and create the corresponding node-lists and it is shown in
61 Figure~\ref{nodelistexample}. Characters are denoted with single quotes,
62 spaces with an underscore and markers with angle brackets. Node-lists are the
63 basic elements from which the DAWG will be generated. These node-lists will
64 also be available in the final aggregating dictionary to ensure consistency of
65 data and possibility of regenerating the data.
66 \begin{figure}[H]
67 \centering
68 \includegraphics[width=\linewidth]{nodelistexample.pdf}
69 \caption{Node list example\label{nodelistexample}}
70 \end{figure}
71
72 \section{DAWGs}
73 \subsection{Terminology}
74 \textbf{Parent nodes} are nodes that have an arrow to the child.\\
75 \textbf{Confluence nodes} are nodes that have multiple parents.
76
77 \subsection{Datastructure}
78 We represent the user generated patterns as DAWGs by converting the node-lists
79 to DAWGs. Normally DAWGs have single letters from an alphabet as edgelabel but
80 in our implementation the DAWGs alphabet contains all letters, whitespace and
81 punctuation but also the specified user markers which can be multiple
82 characters of actual length but for the DAWGs sake they are one transition in
83 the graph.
84
85 DAWGs are graphs but due to the constraints we can use a DAWG to check if a
86 match occurs by checking if a path exists that creates the word by
87 concatenating all the edge labels. The first algorithm to generate DAWGs from
88 words was proposed by Hopcroft et al\cite{Hopcroft1971}. It is an incremental
89 approach in generating the graph. Meaning that entry by entry the graph will be
90 expanded. Hopcrofts algorithm has the constraint of lexicographical ordering.
91 Later on Daciuk et al.\cite{Daciuk1998}\cite{Daciuk2000} improved on the
92 original algorithm and their algorithm is the algorithm we used to generate
93 minimal DAWGs from the node lists.
94
95 \subsection{Algorithm}
96 The algorithm of building DAWGs is an iterative process that goes roughly in
97 four steps. We start with the null graph that can be described by
98 $G_0=(\{q_0\},\{q_0\},\{\}\{\})$ and does not contain any edges, one node and
99 $\mathcal{L}(G_0)=\emptyset$. The first word that is added to the graph will be
100 added in a naive way and is basically replacing the graph by the node-list
101 which is a path graph. We just create a new node for every transition of
102 character and we mark the last node as final. From then on all words are added
103 using a four step approach described below. Pseudocode for this algorithm can
104 be found in Listing~\ref{pseudodawg} named as the function
105 \texttt{generate\_dawg(words)}. A \textit{Python} implementation can be found
106 in Listing~\ref{dawg.py}.
107 \begin{enumerate}
108 \item
109 Say we add word $w$ to the graph. Step one is finding the
110 common prefix of the word already in the graph. The common
111 prefix is defined as the longest subword $w'$ for which there
112 is a $\delta^*(q_0, w')$. When the common prefix is found we
113 change the starting state to the last state of the common
114 prefix and remain with suffix $w''$ where $w=w'w''$.
115 \item
116 We add the suffix to the graph starting from the last node of
117 the common prefix.
118 \item
119 When there are confluence nodes present within the common
120 prefix they are cloned starting from the last confluence node
121 to avoid adding unwanted words.
122 \item
123 From the last node of the suffix up to the first node we
124 replace the nodes by checking if there are equivalent nodes
125 present in the graph that we can merge with.
126 \end{enumerate}
127
128 \begin{algorithm}[H]
129 \SetKwProg{Def}{def}{:}{end}
130 \Def{generate\_dawg(words)}{%
131 register := $\emptyset$\;
132 \While{there is another word}{%
133 word := next word\;
134 commonprefix := CommonPrefix(word)\;
135 laststate := %
136 delta\textsuperscript{*}(q0, commonprefix)\;
137 currentsuffix := %
138 word[length(commonprefix)\ldots length(word)]\;
139 \If{has\_children(laststate)}{%
140 replace\_or\_register(laststate)\;
141 }
142 add\_suffix(laststate, currentsuffix)\;
143 }
144 replace\_or\_register(q0)\;
145 }
146 \Def{replace\_or\_register\_dawg(state)}{%
147 child := last\_child(state)\;
148 \If{has\_children(child)}{%
149 replace\_or\_register(child)\;
150 }
151 \eIf{there is an equivalent state q}{%
152 last\_child(state)\;
153 delete(child)\;
154 }{%
155 register.add(child)\;
156 }
157 }
158 \caption{Generating DAWGs pseudocode\label{pseudodawg}}
159 \end{algorithm}
160
161 \newpage\subsection{Example}
162 The size of the graphs that are generated from real world data from the leisure
163 industry grows extremely fast. Therefore the example concists of short strings
164 instead of real life event information.
165 The algorith is visualized with an example shown in the {Subgraphs in
166 Figure}~\ref{dawg1} that builds a DAWG with the following entries:
167 \texttt{abcd}, \texttt{aecd} and \texttt{aecf}.
168
169 \begin{itemize}
170 \item No words added yet\\
171 Initially we begin with the null graph. This graph is show in
172 the figure as SG0. This DAWG does not yet accept any words.
173 \item Adding \texttt{abcd}\\
174 Adding the first entry \texttt{abcd} is trivial because we can
175 just create a single path which does not require any hard work.
176 This is because the common prefix we find in Step 1 is empty
177 and the suffix will thus be the entire word. Merging the suffix
178 back into the graph is also not possible since there are no
179 nodes except for the first node. The result of adding the
180 first word is visible in subgraph SG1.
181 \item Adding \texttt{aecd}\\
182 For the second entry we will have to do some
183 extra work. The common prefix found in Step 1 is \texttt{a}
184 which we add to the graph. This leaves us in Step 2 with the
185 suffix \texttt{ecd} which we add too. In Step 3 we see that
186 there are no confluence nodes in our common prefix and
187 therefore we do not have to clone nodes. In Step 4 we traverse
188 from the last node back to the beginning of the suffix and find
189 a common suffix \texttt{cd} and we can merge these nodes. In
190 this way we can reuse the transition from $q_3$ to $q_4$. This
191 leaves us with subgraph SG2.
192 \item Adding \texttt{aecf}\\
193 We now add the last entry which is the word \texttt{aecf}. When
194 we do this without the confluence node checking we encounter an
195 unwanted extra word. In Step 1 we find the common prefix
196 \texttt{aec} and that leaves us with the suffix \texttt{f}
197 which we will add. This creates subgraph SG3 and we notice
198 there is an extra word present in the graph, namely the word
199 \texttt{abcf}. This extra word appeared because of the
200 confluence node $q_3$ which was present in the common prefix
201 introducing an unwanted path. Therefore in Step 3 when we check
202 for confluence node we would see that node $q_3$ is a
203 confluence node and needs to be cloned off the original path,
204 by cloning we take the added suffix with it to create an
205 entirely new route. Tracking the route back again we do not
206 encounter any other confluence nodes so we can start Step 4.
207 For this word there is no common suffix and therefore no
208 merging will be applied. This results in subgraph SG4 which is
209 the final DAWG containing only the words added.
210 \end{itemize}
211
212 \begin{figure}[H]
213 \centering
214 \includegraphics[height=20em]{inccons.pdf}
215 \caption{Incrementally constructing a DAWG\label{dawg1}}
216 \end{figure}
217
218 \subsection{Appliance on extraction of patterns}
219 The text data in combination with the user markings can not be converted
220 automatically to a DAWG using the algorithm we described. This is because the
221 user markings are not necessarily a single character or word. Currently user
222 markings are subgraphs that accept any word of any length. When we add a user
223 marking, we are inserting a kind of subgraph in the place of the node with the
224 marking. By doing this we can introduce non determinism to the graph. Non
225 determinism is the fact that a single node has multiple edges with the same
226 transition, in practise this means it could happen that a word can be present
227 in the graph in multiple paths. An example of non determinism in one of our
228 DAWGs is shown in Figure~\ref{nddawg}. This figure represents a generated DAWG
229 with the following entries: \texttt{ab<1>c}, \texttt{a<1>bc}.
230
231 In this graph the word \texttt{abdc} will be accepted and the user pattern
232 \texttt{<1>} will be filled with the subword \texttt{d}. However if we try the
233 word \texttt{abdddbc} both paths can be chosen. In the first case the user
234 pattern \texttt{<1>} will be filled with \texttt{dddb} and in the second case
235 with \texttt{bddd}. In such a case we need to choose the hopefully smartest
236 choice. In the case of no paths matching the system will report a failed
237 extraction. The crawling system can be made more forgiving in such a way that
238 it will give partial information when no match is possible, however it still
239 needs to report the error and the data should be handled with extra care.
240
241 \begin{figure}[H]
242 \centering
243 \includegraphics[width=\linewidth]{nddawg.pdf}
244 \caption{Example non determinism\label{nddawg}}
245 \end{figure}
246
247 \subsection{Minimality \& non-determinism}
248 The Myhill-Nerode theorem~\cite{Hopcroft1979} states that for every number of
249 graphs accepting the same language there is a single graph with the least
250 amount of states. Mihov\cite{Mihov1998} has proven that the algorithm for
251 generating DAWGs is minimal in its original form. Our program converts the
252 node-lists to DAWGs that can possibly contain non deterministic transitions
253 from nodes and therefore one can argue about Myhill-Nerodes theorem and Mihovs
254 proof holding. Due to the nature of the determinism this is not the case and
255 both hold. In reality the graph itself is only non-deterministic when expanding
256 the categories and thus only during matching.
257
258 Choosing the smartest path during matching the program has to choose
259 deterministically between possibly multiple path with possibly multiple
260 results. There are several possibilities or heuristics to choose from.
261 \begin{itemize}
262 \item Maximum fields heuristic\\
263
264 This heuristic prefers the result that has the highest amount
265 of categories filled with actual text. Using this method the
266 highest amount of data fields will be getting filled at all
267 times. The downside of this method is that because of this it
268 might be that some data is not put in the right field because a
269 suboptimal splitting occurred that has put the data in two
270 separate fields whereas it should be in one field.
271 \item Maximum path heuristic\\
272
273 Maximum path heuristic tries to find a match with the highest
274 amount of fixed path transitions. Fixed path transitions are
275 transitions that occur not within a category. The philosophy
276 behind is, is that because the path are hard coded in the graph
277 they must be important. The downside of this method is when
278 overlap occurs between hard coded paths and information within
279 the categories. For example a band that is called
280 \texttt{Location} could interfere greatly with a hard coded
281 path that marks a location using the same words.
282 \end{itemize}
283
284 The more one knows about the contents of the categories the better the Maximum
285 field heuristic performs. When, as in our current implementation, categories do
286 not contain information both heuristics perform about the same.