c
[bsc-thesis1415.git] / thesis2 / 3.methods.tex
1 \section{Application overview}
2 The backend consists of several processing steps before a crawler specification
3 is ready.
4
5 \begin{figure}[H]
6 \label{dawg1}
7 \centering
8 \includegraphics[width=\linewidth]{backend.eps}
9 \strut\\
10 \caption{Main module internals}
11 \end{figure}
12
13 \section{HTML data}
14 Data marked by the user will be returned as a \texttt{POST} http request and is
15 basically the information in the description fields accompanied by the raw html
16 of the table with the patterns. For every RSS feed entry there is a row in the
17 table and every marker is a \texttt{SPAN} element containing the color of the
18 text. Within the HTML code of the table markers are placed to make parsing more
19 easy and remove the need of an advanced HTML parser. The \texttt{POST} request
20 that will be send when the user presses the submit button will not be sent
21 asynchronously, the user has to wait for the processing to finish. Because of
22 this the user will be immediately notified when the processing has finished
23 successfully. The preparation of the data for the \texttt{POST} request is all
24 done in Javascript on the user side, the processing of the data in the backend
25 is all done server side.
26
27 \section{Table rows}
28 When the backend receives the HTTP POST data it saves all the table HTML data
29 and the data from the description fields. The program first extracts the table
30 rows using simple pattern matching on the added markers. The table rows are
31 processed one at the time, the processing mainly consists of finding the
32 \texttt{SPAN} elements, extract the color, remove the elements and save the
33 plain text and the location and type of the marking. When the table rows are
34 processed a datastructure with all the entries accompanied by their markers
35 will go to the next step. All intermediate data will be saved for later use.
36
37 \section{Node lists}
38 Every entry from the datastructure is processed to convert them to node-lists.
39 A node-list can be seen as a path graph of every character and marking. A path
40 graph $G$ is defined as $G=(V,n_1,E,n_i)$ where $V=\{n_1, n_2, \cdots, n_{i-1},
41 n_i\}$ and $E=\{(n_1, n_2), (n_2, n_3), ... (n_{i-1}, n_{i})\}$. A path graph
42 is basically a graph that is a path of nodes where every node is connected to
43 the next on and the last on being final. The transitions between two nodes is
44 either a character or a marking. As an example we take the entry \texttt{19:00,
45 2014-11-12 - Foobar} and create the corresponding node-lists and make it
46 visible in Figure~\ref{nodelistexample}. Characters are denoted with single
47 quotes, spaces with an underscore and markers with angle brackets.
48 \begin{figure}[H]
49 \label{nodelistexample}
50 \centering
51 \includegraphics[width=\linewidth]{nodelistexample.eps}
52 \caption{Node list example}
53 \end{figure}
54
55
56 \section{DAWGs}
57 \subsection{Terminology}
58 \textbf{Parent nodes} are nodes that have an arrow to the child.\\
59 \textbf{Confluence nodes} are nodes that have multiple parents.
60
61 \subsection{Datastructure}
62 We represent the user generated patterns as DAWGs by converting the node-lists
63 to DAWGS. Normally DAWGs have single letters from an alphabet as edgelabel but
64 in our implementation the DAWGs alphabet contains all letters, whitespace and
65 punctuation but also the specified user markers which can be multiple
66 characters in length.
67
68 DAWGs are graphs but due to the constraints we can use the DAWG to check if a
69 match occurs by checking if a path exists that creates the word by
70 concatenating all the edge labels. The first algorithm to generate DAWGs from
71 words was proposed by Hopcroft et al\cite{Hopcroft1971}. It is an incremental
72 approach in generating the graph. Meaning that entry by entry the graph will
73 be expanded. Hopcrofts algorithm has the constraint of lexicographical
74 ordering. Later on Daciuk et al.\cite{Daciuk2000} improved on the original
75 algorithm and their algorithm is the algorithm we used to minimize our DAWGs. A
76 minimal graph is a graph $G$ for which there is no graph $G'$ that has less
77 paths and $\mathcal{L}(G)=\mathcal{L }(G')$ where $\mathcal{L}(G)$ is the set
78 of all words present in the DAWG.
79
80 \subsection{Algorithm}
81 The algorithm of building DAWGs is an iterative process that goes roughly in
82 two steps. We start with the null graph that can be described by
83 $G_0=(\{q_0\},\{q_0\},\{\}\{\})$ and does not contain any edges, one node and
84 $\mathcal{L}(G_0)=\emptyset$. The first word that is added to the graph will be
85 added in a naive way. We just create a new node for every transition of
86 character and we mark the last node as final. From then on all words are added
87 using a stepwise approach.
88 \begin{enumerate}
89 \item
90 Say we add word $w$ to the grahp. Step one is finding the common prefix of
91 the word already in the graph. The common prefix is defined as the longest
92 subword $w'$ for which there is a $\delta^*(q_0, w')$. When the common
93 prefix is found we change the starting state to the last state of the
94 common prefix and remain with suffix $w''$ where $w=w'w''$.
95 \item
96 We add the suffix to the graph starting from the last node of the common
97 prefix.
98 \item
99 When there are confluence nodes present within the common prefix they are
100 cloned starting from the last confluence node to avoid adding unwanted
101 words.
102 \item
103 From the last node of the suffix up to the first node we replace the nodes
104 by checking if there are equivalent nodes present in the graph that we can
105 merge with.
106 \end{enumerate}
107
108 Pseudocode for the algorithm can be found in Listing~\ref{pseudodawg}.
109
110 \subsection{Example}
111 We visualize this with an example shown in the {Subgraphs in
112 Figure}~\ref{dawg1} that builds a DAWG with the following entries:
113 \texttt{abcd}, \texttt{aecd} and \texttt{aecf}. In SG0 we begin with the null
114 graph.
115
116 Adding the first entry \texttt{abcd} is trivial because just creates a single
117 path and does not require any hard work. Because the common prefix we find in
118 Step 1 is empty the suffix will be the entire word. There is also no
119 possibility of merging nodes because there were no nodes to begin with. The
120 result of adding the first word is visible in subgraph SG1.
121
122 For the second entry \texttt{aecd} we will have to do some extra work.
123 The common prefix found in Step 1 is \texttt{a} which we add to the graph. This
124 leaves us in Step 2 with the suffix \texttt{ecd} which we add too. In Step 3 we
125 see that there are no confluence nodes in our common prefix and therefore we
126 do not have to clone nodes. In Step 4 we traverse from the last node back to
127 the beginning of the suffix and find a common suffix \texttt{cd} and we can
128 merge these nodes. In this way we can reuse the transition from $q_3$ to $q_4$.
129 This leaves us with subgraph SG2.
130
131 We now add the last entry which is the word \texttt{aecf}. When we do this
132 without the confluence node checking we encounter an unwanted extra word. In
133 Step 1 we find the common prefix \texttt{aec} and that leaves us with the
134 suffix \texttt{f} which we will add. This creates subgraph SG3 and we notice
135 there is an extra word present in the graph, namely the word \texttt{abcf}.
136 This extra word appeared because of the confluence node $q_3$ which was present
137 in the common prefix introducing an unwanted path. Therefore in Step 3 when we
138 check for confluence node we would see that node $q_3$ is a confluence node and
139 needs to be cloned off the original path, by cloning we take the added suffix
140 with it to create an entirely new route. Tracking the route back again we do
141 not encounter any other confluence nodes so we can start Step 4. For this word
142 there is no common suffix and therefore no merging will be applied. This
143 results in subgraph SG4 which is the final DAWG containing only the words
144 added.
145
146 \begin{figure}[H]
147 \caption{Step 1}
148 \label{dawg1}
149 \centering
150 \digraph[scale=0.4]{inccons}{
151 rankdir=LR;
152 n4 [style=invis];
153 q40 [label="q0"];
154 q41 [label="q1"];
155 q42 [label="q2"];
156 q43 [label="q3"];
157 q44 [label="q4" shape=doublecircle];
158 q45 [label="q5"];
159 q46 [label="q6"];
160 n4 -> q40[label="SG4"];
161 q40 -> q41[label="a"];
162 q41 -> q42[label="b"];
163 q42 -> q43[label="c"];
164 q43 -> q44[label="d"];
165 q41 -> q45[label="e"];
166 q45 -> q46[label="c"];
167 q46 -> q44[label="d"];
168 q46 -> q44[label="f"];
169
170 n3 [style=invis];
171 q30 [label="q0"];
172 q31 [label="q1"];
173 q32 [label="q2"];
174 q33 [label="q3"];
175 q34 [label="q4" shape=doublecircle];
176 q35 [label="q5"];
177 n3 -> q30[label="SG3"];
178 q30 -> q31[label="a"];
179 q31 -> q32[label="b"];
180 q32 -> q33[label="c"];
181 q33 -> q34[label="d"];
182 q33 -> q34[label="f" constraint=false];
183 q31 -> q35[label="e"];
184 q35 -> q33[label="c"];
185
186 n2 [style=invis];
187 q20 [label="q0"];
188 q21 [label="q1"];
189 q22 [label="q2"];
190 q23 [label="q3"];
191 q24 [label="q4" shape=doublecircle];
192 q25 [label="q5"];
193 n2 -> q20[label="SG2"];
194 q20 -> q21[label="a"];
195 q21 -> q22[label="b"];
196 q22 -> q23[label="c"];
197 q23 -> q24[label="d"];
198 q21 -> q25[label="e"];
199 q25 -> q23[label="c"];
200
201 n1 [style=invis];
202 q10 [label="q0"];
203 q11 [label="q1"];
204 q12 [label="q2"];
205 q13 [label="q3"];
206 q14 [label="q4" shape=doublecircle];
207 n1 -> q10[label="SG1"];
208 q10 -> q11[label="a"];
209 q11 -> q12[label="b"];
210 q12 -> q13[label="c"];
211 q13 -> q14[label="d"];
212
213 n [style=invis];
214 q0 [label="q0"];
215 n -> q0[label="SG0"];
216 }
217 \end{figure}
218
219 \subsection{Appliance on extraction of patterns}
220 The text data in combination with the user markings are not just plain text and
221 thus we can not create a DAWG out of them without a few adaptations to the
222 structure.
223
224 Adaption for our data
225
226 Non determinism
227
228 How to get best match
229
230 The algorithm for minimizing DAWGs can not be applied directly on the user
231 generated pattern. This is mainly because there is no conclusive information
232 about the contents of the pattern and therefore non deterministic choices have
233 to be made. Because of this there can be multiple matches and the best match
234 has to be determined. There are several possible criteria for determining the
235 best match. The method we use is ... TODO