update
[bsc-thesis1415.git] / thesis2 / 3.methods.tex
1 \section{Application overview}
2 The backend consists of several processing steps before a crawler specification
3 is ready.
4
5 \begin{figure}[H]
6 \label{dawg1}
7 \centering
8 \includegraphics[width=\linewidth]{backend.eps}
9 \strut\\
10 \caption{Main module internals}
11 \end{figure}
12
13 \section{HTML data}
14 Data marked by the user will be returned as a \texttt{POST} http request and is
15 basically the information in the description fields accompanied by the raw html
16 of the table with the patterns. For every RSS feed entry there is a row in the
17 table and every marker is a \texttt{SPAN} element containing the color of the
18 text. Within the HTML code of the table markers are placed to make parsing more
19 easy and remove the need of an advanced HTML parser. The \texttt{POST} request
20 that will be send when the user presses the submit button will not be sent
21 asynchronously, the user has to wait for the processing to finish. Because of
22 this the user will be immediately notified when the processing has finished
23 successfully. The preparation of the data for the \texttt{POST} request is all
24 done in Javascript on the user side, the processing of the data in the backend
25 is all done server side.
26
27 \section{Table rows}
28 When the backend receives the HTTP POST data it saves all the table HTML data and the
29 data from the description fields. The program first extracts the table rows
30 using simple pattern matching on the added markers. The table rows are
31 processed one at the time, the processing mainly consists of finding the
32 \texttt{SPAN} elements, extract the color, remove the elements and save the
33 plain text and the location and type of the marking. When the table rows are
34 processed a datastructure with all the entries accompanied by their markers
35 will go to the next step. All intermediate data will be saved for later use.
36
37 \section{Node lists}
38 Every entry from the datastructure is processed to convert them to node-lists.
39 A node-list can be seen as a path graph of every character and marking. A path
40 graph $G$ is defined as $G=(V,n_1,E,n_i)$ where $V=\{n_1, n_2, \cdots, n_{i-1}, n_i\}$
41 and $E=\{(n_1, n_2), (n_2, n_3), ... (n_{i-1}, n_{i})\}$. A path graph is
42 basically a graph that is a path of nodes where every node is connected to the
43 next on and the last on being final. The transitions between two nodes is
44 either a character or a marking. As an example we take the entry
45 \texttt{19:00, 2014-11-12 - Foobar} and create the corresponding node-lists and
46 make it visible in Figure~\ref{nodelistexample}. Characters are denoted with
47 single quotes, spaces with an underscore and markers with angle brackets.
48 \begin{figure}[H]
49 \label{nodelistexample}
50 \centering
51 \includegraphics[width=\linewidth]{nodelistexample.eps}
52 \caption{Node list example}
53 \end{figure}
54
55
56 \section{DAWGs}
57 \subsection{Terminology}
58 \textbf{Parent nodes} are nodes that have an arrow to the child.\\
59 \textbf{Confluence nodes} are nodes that have multiple parents.
60
61 \subsection{Datastructure}
62 We represent the user generated patterns as DAWGs by converting the node-lists
63 to DAWGS. Normally DAWGs have single letters from an alphabet as edgelabel but
64 in our implementation the DAWGs alphabet contains all letters, whitespace and
65 punctuation but also the specified user markers which can be multiple
66 characters in length.
67
68 DAWGs are graphs but due to the constraints we can use the DAWG to check if a
69 match occurs by checking if a path exists that creates the word by
70 concatenating all the edge labels. The first algorithm to generate DAWGs from
71 words was proposed by Hopcroft et al\cite{Hopcroft1971}. It is an incremental
72 approach in generating the graph. Meaning that entry by entry the graph will
73 be expanded. Hopcrofts algorithm has the constraint of lexicographical
74 ordering. Later on Daciuk et al.\cite{Daciuk2000} improved on the original
75 algorithm and their algorithm is the algorithm we used to minimize our DAWGs. A
76 minimal graph is a graph $G$ for which there is no graph $G'$ that has less
77 paths and $\mathcal{L}(G)=\mathcal{L }(G')$ where $\mathcal{L}(G)$ is the set
78 of all words present in the DAWG.
79
80 \subsection{Algorithm}
81 The algorithm of building DAWGs is an iterative process that goes roughly in
82 three steps. We start with the null graph that can be described by
83 $G_0=(\{q_0\},\{q_0\},\{\}\{\})$ and does not contain any edges, one node and
84 $\mathcal{L}(G_0)=\emptyset$. The first word that is added to the graph will be
85 added in a naive way. We just create a new node for every transition of
86 character and we mark the last node as final. From then on all words are added
87 using a stepwise approach.
88 \begin{enumerate}
89 \item
90 Step one is finding the common prefix of the word in the graph and we chop
91 it of the word. When the suffix that remains is empty and the last state of
92 the common prefix is a final state we are done adding the word. When the
93 suffix is non empty or the common prefix does not end in a final state we
94 apply step two.
95 \item
96 We find the first confluence state in the common prefix and from there on
97 we add the suffix. When there is a confluence state the route all the way
98 back to the starting node will be cloned to make sure no extra words are
99 added to the graph unwillingly.
100 \item
101 Finally if in the suffix propagating from the end to the beginning there
102 are equal states we merge the state from the suffix to the equal state in
103 the graph.
104 \end{enumerate}
105
106 Pseudocode for the algorithm can be found in Listing~\ref{pseudodawg}.
107
108 \subsection{Example}
109 We visualize this with an example shown in the {Subgraphs in
110 Figure}~\ref{dawg1} that builds a DAWG with the following entries:
111 \texttt{abcd}, \texttt{aecd} and \texttt{aecf}. In SG0 we begin with the null
112 graph.
113
114 Adding the first entry \texttt{abcd} is trivial and just creates a single path
115 and does not require any hard work. The result of adding the first word is
116 visible in subgraph SG1.
117
118 For the second entry \texttt{aecd} we will have to follow the steps described
119 earlier. The common prefix is \texttt{a} and the common suffix is \texttt{bc}.
120 Therefore the first node in which the common prefix starts needs to be copied
121 and split off to make room for an alternative route. This is exactly what
122 happens in SG2. Node $q_2$ is copied and gets a connection to node $q_3$. From
123 the last node of the common prefix a route is built towards the first node,
124 that is the copied node, of the common suffix resulting in an extra path
125 $q_1\rightarrow q_5\rightarrow q_3$. Since there are no confluence nodes we can
126 leave the prefix as it is. This results in subgraph SG2.
127
128 The last entry to add is the word \texttt{aecf}. When this is done without
129 confluence node checking we create extra paths in the graph that are unwanted.
130 The common prefix is \texttt{aec} and the common suffix is \texttt{f}.
131 \texttt{f} can not be merged into existing edges and thus a new edge will be
132 created from the end of the common suffix to the end. This results in subgraph
133 SG3. Note that we did not take care of confluence nodes in the common prefix
134 and because of that the word \texttt{abcf} is also accepted even when it was
135 not added to the graph. This is unwanted behaviour. Going back to the common
136 prefix \texttt{aec}. When we check for confluence nodes we see that the last
137 node of the common prefix is such a node and that we need to clone the path
138 and separate the route to the final node. $q_3$ will be cloned into $q_6$ with
139 the route of the common prefix. Tracking the route back we do not encounter any
140 other confluence states and therefore we can merge in the suffix which results
141 in the final DAWG visible in subgraph SG4. No new words are added.
142
143 \begin{figure}[H]
144 \caption{Step 1}
145 \label{dawg1}
146 \centering
147 \digraph[scale=0.4]{inccons}{
148 rankdir=LR;
149 n4 [style=invis];
150 q40 [label="q0"];
151 q41 [label="q1"];
152 q42 [label="q2"];
153 q43 [label="q3"];
154 q44 [label="q4" shape=doublecircle];
155 q45 [label="q5"];
156 q46 [label="q6"];
157 n4 -> q40[label="SG4"];
158 q40 -> q41[label="a"];
159 q41 -> q42[label="b"];
160 q42 -> q43[label="c"];
161 q43 -> q44[label="d"];
162 q41 -> q45[label="e"];
163 q45 -> q46[label="c"];
164 q46 -> q44[label="d"];
165 q46 -> q44[label="f"];
166
167 n3 [style=invis];
168 q30 [label="q0"];
169 q31 [label="q1"];
170 q32 [label="q2"];
171 q33 [label="q3"];
172 q34 [label="q4"ishape=doublecircle];
173 q35 [label="q5"];
174 n3 -> q30[label="SG3"];
175 q30 -> q31[label="a"];
176 q31 -> q32[label="b"];
177 q32 -> q33[label="c"];
178 q33 -> q34[label="d"];
179 q33 -> q34[label="f"iconstraint=false];
180 q31 -> q35[label="e"];
181 q35 -> q33[label="c"];
182
183 n2 [style=invis];
184 q20 [label="q0"];
185 q21 [label="q1"];
186 q22 [label="q2"];
187 q23 [label="q3"];
188 q24 [label="q4"ishape=doublecircle];
189 q25 [label="q5"];
190 n2 -> q20[label="SG2"];
191 q20 -> q21[label="a"];
192 q21 -> q22[label="b"];
193 q22 -> q23[label="c"];
194 q23 -> q24[label="d"];
195 q21 -> q25[label="e"];
196 q25 -> q23[label="c"];
197
198 n1 [style=invis];
199 q10 [label="q0"];
200 q11 [label="q1"];
201 q12 [label="q2"];
202 q13 [label="q3"];
203 q14 [label="q4"ishape=doublecircle];
204 n1 -> q10[label="SG1"];
205 q10 -> q11[label="a"];
206 q11 -> q12[label="b"];
207 q12 -> q13[label="c"];
208 q13 -> q14[label="d"];
209
210 n [style=invis];
211 q0 [label="q0"];
212 n -> q0[label="SG0"];
213 }
214 \end{figure}
215
216 \subsection{Appliance on extraction of patterns}
217 The algorithm for minimizing DAWGs can not be applied directly on the user
218 generated pattern. This is mainly because there is no conclusive information
219 about the contents of the pattern and therefore non deterministic choices have
220 to be made. Because of this there can be multiple matches and the best match
221 has to be determined. There are several possible criteria for determining the
222 best match. The method we use is ... TODO