be8c4b469cdc9336383388db71a55c4342e60117
[bsc-thesis1415.git] / thesis2 / 3.methods.tex
1 \section{Application overview}
2 The backend consists of several processing steps before a crawler specification
3 is ready. These steps are visualized in Figure~\ref{appinternals} where all the
4 nodes are important milestones in the process of processing the user data.
5 Arrows indicate informatio transfer between these steps. The Figure is a
6 detailed explanation of the \textit{Backend} node in Figure~\ref{appoverview}.
7
8 \begin{figure}[H]
9 \label{appinternals}
10 \centering
11 \includegraphics[width=\linewidth]{backend.eps}
12 \strut\\
13 \caption{Main module internals}
14 \end{figure}
15
16 \section{HTML data}
17 The raw data from the \textit{Frontend} with the user markings will enter the
18 backend as a HTTP \textit{POST} request and consists of several information
19 fields extracted from the description boxes in the front and accompanied by the
20 raw \textit{HTML} data from the table with the feed entries that contain the
21 marking made by the user at the time just before the submit button is pressed.
22 Within the \textit{HTML} data of the table markers are placed before sending to
23 make the parsing of the tables more easy and remove the need for an advanced
24 \textit{HTML} parser. The \textit{POST} request is not send asynchronously,
25 this means the user has to wait until the server has processed the request. In
26 this way the user will be immediately notified when the processing has finished
27 successfully and will be able to review and test the resulting crawler. All the
28 data preparation for the \textit{POST} request is done in Javascript and thus
29 on the user side, all the processing afterwards is done in the backend and thus
30 server side. All the descriptive fields will be transferred to the final
31 aggregating dictionary.
32
33 \section{Table rows}
34 The first conversion step is to extract the rows within from the \textit{HTML}
35 data. Because of the table markers this process can be done using very
36 rudimentary matching techniques that require very little computational power.
37 All the \texttt{SPAN} \textit{HTML} elements have to be found since the marking
38 of the user are visualized through colored \texttt{SPAN} elements. To achieve
39 this for every row all \texttt{SPAN} elements are extracted, again with simple
40 matching techniques, to extract the color and afterwards to remove the element
41 to retrieve the original plain text of the \textit{RSS} feed entry. When this
42 step is done a data structure containing all the text of the entries together
43 with the markings will go to the next step. All original data, namely the
44 \textit{HTML} data per row, will be transferred to the final aggregating
45 dictionary.
46
47 \section{Node lists}
48 Every entry gotten from the previous step are going to be processing into so
49 called node-lists. A node-list can be seen as a path graph of every character
50 and marking. A path graph $G$ is defined as $G=(V,n_1,E,n_i)$ where $V=\{n_1,
51 n_2, \cdots, n_{i-1}, n_i\}$ and $E=\{(n_1, n_2), (n_2, n_3), ... (n_{i-1},
52 n_{i})\}$. A path graph is basically a graph that is a path of nodes where
53 every node is connected to the next on and the last on being final. The
54 transitions between two nodes is either a character or a marking. As an example
55 we take the entry \texttt{19:00, 2014-11-12 - Foobar} and create the
56 corresponding node-lists and make it visible in Figure~\ref{nodelistexample}.
57 Characters are denoted with single quotes, spaces with an underscore and
58 markers with angle brackets. Node-lists are the basic elements from which the
59 DAWG will be generated. These node-lists will also be available in the final
60 aggregating dictionary to ensure consistency of data and possibility of
61 regenerating the data.
62 \begin{figure}[H]
63 \label{nodelistexample}
64 \centering
65 \includegraphics[width=\linewidth]{nodelistexample.eps}
66 \caption{Node list example}
67 \end{figure}
68
69 \section{DAWGs}
70 \subsection{Terminology}
71 \textbf{Parent nodes} are nodes that have an arrow to the child.\\
72 \textbf{Confluence nodes} are nodes that have multiple parents.
73
74 \subsection{Datastructure}
75 We represent the user generated patterns as DAWGs by converting the node-lists
76 to DAWGS. Normally DAWGs have single letters from an alphabet as edgelabel but
77 in our implementation the DAWGs alphabet contains all letters, whitespace and
78 punctuation but also the specified user markers which can be multiple
79 characters of actual length but for the DAWGs' sake they are one transition in
80 the graph.
81
82 DAWGs are graphs but due to the constraints we can use the DAWG to check if a
83 match occurs by checking if a path exists that creates the word by
84 concatenating all the edge labels while trying to find a path following the
85 characters from the entry. The first algorithm to generate DAWGs from
86 words was proposed by Hopcroft et al\cite{Hopcroft1971}. It is an incremental
87 approach in generating the graph. Meaning that entry by entry the graph will
88 be expanded. Hopcrofts algorithm has the constraint of lexicographical
89 ordering. Later on Daciuk et al.\cite{Daciuk1998}\cite{Daciuk2000} improved on
90 the original algorithm and their algorithm is the algorithm we used to generate
91 minimal DAWGs from the node lists.
92
93 \subsection{Algorithm}
94 The algorithm of building DAWGs is an iterative process that goes roughly in
95 two steps. We start with the null graph that can be described by
96 $G_0=(\{q_0\},\{q_0\},\{\}\{\})$ and does not contain any edges, one node and
97 $\mathcal{L}(G_0)=\emptyset$. The first word that is added to the graph will be
98 added in a naive way. We just create a new node for every transition of
99 character and we mark the last node as final. From then on all words are added
100 using a stepwise approach.
101 \begin{enumerate}
102 \item
103 Say we add word $w$ to the grahp. Step one is finding the common prefix of
104 the word already in the graph. The common prefix is defined as the longest
105 subword $w'$ for which there is a $\delta^*(q_0, w')$. When the common
106 prefix is found we change the starting state to the last state of the
107 common prefix and remain with suffix $w''$ where $w=w'w''$.
108 \item
109 We add the suffix to the graph starting from the last node of the common
110 prefix.
111 \item
112 When there are confluence nodes present within the common prefix they are
113 cloned starting from the last confluence node to avoid adding unwanted
114 words.
115 \item
116 From the last node of the suffix up to the first node we replace the nodes
117 by checking if there are equivalent nodes present in the graph that we can
118 merge with.
119 \end{enumerate}
120
121 Pseudocode for the algorithm can be found in Listing~\ref{pseudodawg}.
122
123 \subsection{Example}
124 We visualize this with an example shown in the {Subgraphs in
125 Figure}~\ref{dawg1} that builds a DAWG with the following entries:
126 \texttt{abcd}, \texttt{aecd} and \texttt{aecf}. In SG0 we begin with the null
127 graph.
128
129 Adding the first entry \texttt{abcd} is trivial because just creates a single
130 path and does not require any hard work. Because the common prefix we find in
131 Step 1 is empty the suffix will be the entire word. There is also no
132 possibility of merging nodes because there were no nodes to begin with. The
133 result of adding the first word is visible in subgraph SG1.
134
135 For the second entry \texttt{aecd} we will have to do some extra work.
136 The common prefix found in Step 1 is \texttt{a} which we add to the graph. This
137 leaves us in Step 2 with the suffix \texttt{ecd} which we add too. In Step 3 we
138 see that there are no confluence nodes in our common prefix and therefore we
139 do not have to clone nodes. In Step 4 we traverse from the last node back to
140 the beginning of the suffix and find a common suffix \texttt{cd} and we can
141 merge these nodes. In this way we can reuse the transition from $q_3$ to $q_4$.
142 This leaves us with subgraph SG2.
143
144 We now add the last entry which is the word \texttt{aecf}. When we do this
145 without the confluence node checking we encounter an unwanted extra word. In
146 Step 1 we find the common prefix \texttt{aec} and that leaves us with the
147 suffix \texttt{f} which we will add. This creates subgraph SG3 and we notice
148 there is an extra word present in the graph, namely the word \texttt{abcf}.
149 This extra word appeared because of the confluence node $q_3$ which was present
150 in the common prefix introducing an unwanted path. Therefore in Step 3 when we
151 check for confluence node we would see that node $q_3$ is a confluence node and
152 needs to be cloned off the original path, by cloning we take the added suffix
153 with it to create an entirely new route. Tracking the route back again we do
154 not encounter any other confluence nodes so we can start Step 4. For this word
155 there is no common suffix and therefore no merging will be applied. This
156 results in subgraph SG4 which is the final DAWG containing only the words
157 added.
158
159 \begin{figure}[H]
160 \caption{Step 1}
161 \label{dawg1}
162 \centering
163 \digraph[scale=0.4]{inccons}{
164 rankdir=LR;
165 n4 [style=invis];
166 q40 [label="q0"];
167 q41 [label="q1"];
168 q42 [label="q2"];
169 q43 [label="q3"];
170 q44 [label="q4" shape=doublecircle];
171 q45 [label="q5"];
172 q46 [label="q6"];
173 n4 -> q40[label="SG4"];
174 q40 -> q41[label="a"];
175 q41 -> q42[label="b"];
176 q42 -> q43[label="c"];
177 q43 -> q44[label="d"];
178 q41 -> q45[label="e"];
179 q45 -> q46[label="c"];
180 q46 -> q44[label="d"];
181 q46 -> q44[label="f"];
182
183 n3 [style=invis];
184 q30 [label="q0"];
185 q31 [label="q1"];
186 q32 [label="q2"];
187 q33 [label="q3"];
188 q34 [label="q4" shape=doublecircle];
189 q35 [label="q5"];
190 n3 -> q30[label="SG3"];
191 q30 -> q31[label="a"];
192 q31 -> q32[label="b"];
193 q32 -> q33[label="c"];
194 q33 -> q34[label="d"];
195 q33 -> q34[label="f" constraint=false];
196 q31 -> q35[label="e"];
197 q35 -> q33[label="c"];
198
199 n2 [style=invis];
200 q20 [label="q0"];
201 q21 [label="q1"];
202 q22 [label="q2"];
203 q23 [label="q3"];
204 q24 [label="q4" shape=doublecircle];
205 q25 [label="q5"];
206 n2 -> q20[label="SG2"];
207 q20 -> q21[label="a"];
208 q21 -> q22[label="b"];
209 q22 -> q23[label="c"];
210 q23 -> q24[label="d"];
211 q21 -> q25[label="e"];
212 q25 -> q23[label="c"];
213
214 n1 [style=invis];
215 q10 [label="q0"];
216 q11 [label="q1"];
217 q12 [label="q2"];
218 q13 [label="q3"];
219 q14 [label="q4" shape=doublecircle];
220 n1 -> q10[label="SG1"];
221 q10 -> q11[label="a"];
222 q11 -> q12[label="b"];
223 q12 -> q13[label="c"];
224 q13 -> q14[label="d"];
225
226 n [style=invis];
227 q0 [label="q0"];
228 n -> q0[label="SG0"];
229 }
230 \end{figure}
231
232 \subsection{Minimality of the algorithm}
233
234
235 \subsection{Appliance on extraction of patterns}
236 The text data in combination with the user markings can not be converted
237 automatically to a DAWG using the algorithm we described. This is because the
238 user markings are not necessarily a single character or word. User markings are
239 basically one or more characters. When we add a user marking we insert a
240 character that is not in the alphabet and later on we change the marking to a
241 kind of subgraph. When this is applied it can be possible that non determinism
242 is added to the graph. Non determinism is the fact that a single node has
243 multiple edges with the same transition, in practise this means that a word can
244 be present in the graph in multiple paths. This is shown in
245 Figure~\ref{nddawg} with the following words: \texttt{ab<1>c}, \texttt{a<1>bbc}.
246 In this graph the word \texttt{abdc} will be accepted and the user pattern
247 \texttt{<1>} will be filled with the word \texttt{d}. However if we try the
248 word \texttt{abdddbc} both paths can be chosen. In the first case the user
249 pattern \texttt{<1>} will be filled with \texttt{dddb} and in the second case
250 with \texttt{bddd}. In such a case we need to choose the hopefully smartest
251 choice.
252
253 \begin{figure}[H]
254 \label{nddawg}
255 \centering
256 \includegraphics[width=\linewidth]{nddawg.eps}
257 \caption{Example non determinism}
258 \end{figure}
259
260 \subsection{Minimality and non-determinism}
261 The Myhill-Nerode theorem~\cite{Hopcroft1979} states that for every number of
262 graphs accepting the same language there is a single graph with the least
263 amount of states. Mihov\cite{Mihov1998} has proven that the algorithm is
264 minimal in its original form. Our program converts the node-lists to DAWGs that
265 can possibly contain non deterministic nodes and therefore one can argue about
266 the minimality. Due to the nature of the determinism this is not the case. The
267 non determinism is only visible when matching the data and not in the real
268 graph since in the real graph we ...
269