thesis update
[bsc-thesis1415.git] / thesis2 / 3.methods.tex
1 \section{Application overview}
2 The backend consists of several processing steps before a crawler specification
3 is ready. These steps are visualized in Figure~\ref{appinternals} where all the
4 nodes are important milestones in the process of processing the user data.
5 Arrows indicate informatio transfer between these steps. The Figure is a
6 detailed explanation of the \textit{Backend} node in Figure~\ref{appoverview}.
7
8 \begin{figure}[H]
9 \label{appinternals}
10 \centering
11 \includegraphics[width=\linewidth]{backend.eps}
12 \strut\\
13 \caption{Main module internals}
14 \end{figure}
15
16 \section{HTML data}
17 The raw data from the \textit{Frontend} with the user markings will enter the
18 backend as a HTTP \textit{POST} request and consists of several information
19 fields extracted from the description boxes in the front and accompanied by the
20 raw \textit{HTML} data from the table with the feed entries that contain the
21 marking made by the user at the time just before the submit button is pressed.
22 Within the \textit{HTML} data of the table markers are placed before sending to
23 make the parsing of the tables more easy and remove the need for an advanced
24 \textit{HTML} parser. The \textit{POST} request is not send asynchronously,
25 this means the user has to wait until the server has processed the request. In
26 this way the user will be immediately notified when the processing has finished
27 successfully and will be able to review and test the resulting crawler. All the
28 data preparation for the \textit{POST} request is done in Javascript and thus
29 on the user side, all the processing afterwards is done in the backend and thus
30 server side. All the descriptive fields will be transferred to the final
31 aggregating dictionary.
32
33 \section{Table rows}
34 The first conversion step is to extract the rows within from the \textit{HTML}
35 data. Because of the table markers this process can be done using very
36 rudimentary matching techniques that require very little computational power.
37 All the \texttt{SPAN} \textit{HTML} elements have to be found since the marking
38 of the user are visualized through colored \texttt{SPAN} elements. To achieve
39 this for every row all \texttt{SPAN} elements are extracted, again with simple
40 matching techniques, to extract the color and afterwards to remove the element
41 to retrieve the original plain text of the \textit{RSS} feed entry. When this
42 step is done a data structure containing all the text of the entries together
43 with the markings will go to the next step. All original data, namely the
44 \textit{HTML} data per row, will be transferred to the final aggregating
45 dictionary.
46
47 \section{Node lists}
48 Every entry gotten from the previous step are going to be processing into so
49 called node-lists. A node-list can be seen as a path graph of every character
50 and marking. A path graph $G$ is defined as $G=(V,n_1,E,n_i)$ where $V=\{n_1,
51 n_2, \cdots, n_{i-1}, n_i\}$ and $E=\{(n_1, n_2), (n_2, n_3), ... (n_{i-1},
52 n_{i})\}$. A path graph is basically a graph that is a path of nodes where
53 every node is connected to the next on and the last on being final. The
54 transitions between two nodes is either a character or a marking. As an example
55 we take the entry \texttt{19:00, 2014-11-12 - Foobar} and create the
56 corresponding node-lists and make it visible in Figure~\ref{nodelistexample}.
57 Characters are denoted with single quotes, spaces with an underscore and
58 markers with angle brackets. Node-lists are the basic elements from which the
59 DAWG will be generated. These node-lists will also be available in the final
60 aggregating dictionary to ensure consistency of data and possibility of
61 regenerating the data.
62 \begin{figure}[H]
63 \label{nodelistexample}
64 \centering
65 \includegraphics[width=\linewidth]{nodelistexample.eps}
66 \caption{Node list example}
67 \end{figure}
68
69 \section{DAWGs}
70 \subsection{Terminology}
71 \textbf{Parent nodes} are nodes that have an arrow to the child.\\
72 \textbf{Confluence nodes} are nodes that have multiple parents.
73
74 \subsection{Datastructure}
75 We represent the user generated patterns as DAWGs by converting the node-lists
76 to DAWGS. Normally DAWGs have single letters from an alphabet as edgelabel but
77 in our implementation the DAWGs alphabet contains all letters, whitespace and
78 punctuation but also the specified user markers which can be multiple
79 characters of actual length but for the DAWGs' sake they are one transition in
80 the graph.
81
82 DAWGs are graphs but due to the constraints we can use the DAWG to check if a
83 match occurs by checking if a path exists that creates the word by
84 concatenating all the edge labels while trying to find a path following the
85 characters from the entry. The first algorithm to generate DAWGs from
86 words was proposed by Hopcroft et al\cite{Hopcroft1971}. It is an incremental
87 approach in generating the graph. Meaning that entry by entry the graph will
88 be expanded. Hopcrofts algorithm has the constraint of lexicographical
89 ordering. Later on Daciuk et al.\cite{Daciuk2000} improved on the original
90 algorithm and their algorithm is the algorithm we used to minimize our DAWGs. A
91 minimal graph is a graph $G$ for which there is no graph $G'$ that has less
92 paths and $\mathcal{L}(G)=\mathcal{L}(G')$ where $\mathcal{L}(G)$ is the set
93 of all words present in the DAWG.
94
95 \subsection{Algorithm}
96 The algorithm of building DAWGs is an iterative process that goes roughly in
97 two steps. We start with the null graph that can be described by
98 $G_0=(\{q_0\},\{q_0\},\{\}\{\})$ and does not contain any edges, one node and
99 $\mathcal{L}(G_0)=\emptyset$. The first word that is added to the graph will be
100 added in a naive way. We just create a new node for every transition of
101 character and we mark the last node as final. From then on all words are added
102 using a stepwise approach.
103 \begin{enumerate}
104 \item
105 Say we add word $w$ to the grahp. Step one is finding the common prefix of
106 the word already in the graph. The common prefix is defined as the longest
107 subword $w'$ for which there is a $\delta^*(q_0, w')$. When the common
108 prefix is found we change the starting state to the last state of the
109 common prefix and remain with suffix $w''$ where $w=w'w''$.
110 \item
111 We add the suffix to the graph starting from the last node of the common
112 prefix.
113 \item
114 When there are confluence nodes present within the common prefix they are
115 cloned starting from the last confluence node to avoid adding unwanted
116 words.
117 \item
118 From the last node of the suffix up to the first node we replace the nodes
119 by checking if there are equivalent nodes present in the graph that we can
120 merge with.
121 \end{enumerate}
122
123 Pseudocode for the algorithm can be found in Listing~\ref{pseudodawg}.
124
125 \subsection{Example}
126 We visualize this with an example shown in the {Subgraphs in
127 Figure}~\ref{dawg1} that builds a DAWG with the following entries:
128 \texttt{abcd}, \texttt{aecd} and \texttt{aecf}. In SG0 we begin with the null
129 graph.
130
131 Adding the first entry \texttt{abcd} is trivial because just creates a single
132 path and does not require any hard work. Because the common prefix we find in
133 Step 1 is empty the suffix will be the entire word. There is also no
134 possibility of merging nodes because there were no nodes to begin with. The
135 result of adding the first word is visible in subgraph SG1.
136
137 For the second entry \texttt{aecd} we will have to do some extra work.
138 The common prefix found in Step 1 is \texttt{a} which we add to the graph. This
139 leaves us in Step 2 with the suffix \texttt{ecd} which we add too. In Step 3 we
140 see that there are no confluence nodes in our common prefix and therefore we
141 do not have to clone nodes. In Step 4 we traverse from the last node back to
142 the beginning of the suffix and find a common suffix \texttt{cd} and we can
143 merge these nodes. In this way we can reuse the transition from $q_3$ to $q_4$.
144 This leaves us with subgraph SG2.
145
146 We now add the last entry which is the word \texttt{aecf}. When we do this
147 without the confluence node checking we encounter an unwanted extra word. In
148 Step 1 we find the common prefix \texttt{aec} and that leaves us with the
149 suffix \texttt{f} which we will add. This creates subgraph SG3 and we notice
150 there is an extra word present in the graph, namely the word \texttt{abcf}.
151 This extra word appeared because of the confluence node $q_3$ which was present
152 in the common prefix introducing an unwanted path. Therefore in Step 3 when we
153 check for confluence node we would see that node $q_3$ is a confluence node and
154 needs to be cloned off the original path, by cloning we take the added suffix
155 with it to create an entirely new route. Tracking the route back again we do
156 not encounter any other confluence nodes so we can start Step 4. For this word
157 there is no common suffix and therefore no merging will be applied. This
158 results in subgraph SG4 which is the final DAWG containing only the words
159 added.
160
161 \begin{figure}[H]
162 \caption{Step 1}
163 \label{dawg1}
164 \centering
165 \digraph[scale=0.4]{inccons}{
166 rankdir=LR;
167 n4 [style=invis];
168 q40 [label="q0"];
169 q41 [label="q1"];
170 q42 [label="q2"];
171 q43 [label="q3"];
172 q44 [label="q4" shape=doublecircle];
173 q45 [label="q5"];
174 q46 [label="q6"];
175 n4 -> q40[label="SG4"];
176 q40 -> q41[label="a"];
177 q41 -> q42[label="b"];
178 q42 -> q43[label="c"];
179 q43 -> q44[label="d"];
180 q41 -> q45[label="e"];
181 q45 -> q46[label="c"];
182 q46 -> q44[label="d"];
183 q46 -> q44[label="f"];
184
185 n3 [style=invis];
186 q30 [label="q0"];
187 q31 [label="q1"];
188 q32 [label="q2"];
189 q33 [label="q3"];
190 q34 [label="q4" shape=doublecircle];
191 q35 [label="q5"];
192 n3 -> q30[label="SG3"];
193 q30 -> q31[label="a"];
194 q31 -> q32[label="b"];
195 q32 -> q33[label="c"];
196 q33 -> q34[label="d"];
197 q33 -> q34[label="f" constraint=false];
198 q31 -> q35[label="e"];
199 q35 -> q33[label="c"];
200
201 n2 [style=invis];
202 q20 [label="q0"];
203 q21 [label="q1"];
204 q22 [label="q2"];
205 q23 [label="q3"];
206 q24 [label="q4" shape=doublecircle];
207 q25 [label="q5"];
208 n2 -> q20[label="SG2"];
209 q20 -> q21[label="a"];
210 q21 -> q22[label="b"];
211 q22 -> q23[label="c"];
212 q23 -> q24[label="d"];
213 q21 -> q25[label="e"];
214 q25 -> q23[label="c"];
215
216 n1 [style=invis];
217 q10 [label="q0"];
218 q11 [label="q1"];
219 q12 [label="q2"];
220 q13 [label="q3"];
221 q14 [label="q4" shape=doublecircle];
222 n1 -> q10[label="SG1"];
223 q10 -> q11[label="a"];
224 q11 -> q12[label="b"];
225 q12 -> q13[label="c"];
226 q13 -> q14[label="d"];
227
228 n [style=invis];
229 q0 [label="q0"];
230 n -> q0[label="SG0"];
231 }
232 \end{figure}
233
234 \subsection{Minimality of the algorithm}
235
236
237 \subsection{Appliance on extraction of patterns}
238 The text data in combination with the user markings can not be converted
239 automatically to a DAWG using the algorithm we described. This is because the
240 user markings are not necessarily a single character or word. User markings are
241 basically one or more characters. When we add a user marking we insert a
242 character that is not in the alphabet and later on we change the marking to a
243 kind of subgraph. When this is applied it can be possible that non determinism
244 is added to the graph. Non determinism is the fact that a single node has
245 multiple edges with the same transition, in practise this means that a word can
246 be present in the graph in multiple paths. This is shown in
247 Figure~\ref{nddawg} with the following words: \texttt{ab<1>c}, \texttt{a<1>bbc}.
248 In this graph the word \texttt{abdc} will be accepted and the user pattern
249 \texttt{<1>} will be filled with the word \texttt{d}. However if we try the
250 word \texttt{abdddbc} both paths can be chosen. In the first case the user
251 pattern \texttt{<1>} will be filled with \texttt{dddb} and in the second case
252 with \texttt{bddd}. In such a case we need to choose the hopefully smartest
253 choice.
254
255 \begin{figure}[H]
256 \label{nddawg}
257 \centering
258 \includegraphics[width=\linewidth]{nddawg.eps}
259 \caption{Example non determinism}
260 \end{figure}
261
262 \subsection{How to choose path}
263