update
[bsc-thesis1415.git] / thesis2 / 3.methods.tex
1 \section{Application overview}
2 The backend consists of several processing steps before a crawler specification
3 is ready.
4
5 \begin{figure}[H]
6 \label{dawg1}
7 \centering
8 \includegraphics[width=\linewidth]{backend.eps}
9 \strut\\
10 \caption{Backend overview}
11 \end{figure}
12
13 \section{Internals of the crawler generation module}
14 Data marked by the user is in principle just raw html data that contains a
15 table with for every RSS feed entry a table row. Within the html the code
16 several markers are placed to make the parsing more easy and removing the need
17 of an advanced HTML parser. When the user presses submit a http POST request is
18 prepared to send all the gathered data to the backend to get processed. The
19 sending does not happen asynchronously, so the user has to wait for the data to
20 be processed. Because of the sychronous data transmission the user is notified
21 immediatly when the crawler is succesfully added. The data preparation that
22 makes the backend able to process it is all done on the user side using
23 Javascript.
24
25 When the backend receives the http POST data it saves all the raw data and the
26 data from the information fields. The raw data is processed line by line to
27 extract the entries that contain user markings. The entries containing user
28 markings are stripped from all html while creating an data structure that
29 contains the locations of the markings and the original text. All data is
30 stored, for example also the entries without user data, but not all data is
31 processed. The storage of, at first sight, useless data is done because later
32 when the crawlers needs editing the old data can be used to update the crawler.
33
34 When the entries are isolated and processed they are converted to node-lists. A
35 node-list is a literal list of words where a word is interpreted in the
36 broadest sense, meaning that a word can be a character, a single byte, a string
37 or basically anything. These node-lists are in this case the original entry
38 string where all the user markers are replaced by single nodes. As an example
39 lets take the following entry and its corresponding node-list assuming that the
40 user marked the time, title and date correctly. Markers are visualized by
41 enclosing the name of the marker in angle brackets.
42 \begin{flushleft}
43 Entry: \texttt{19:00, 2014-11-12 - Foobar}\\
44 Node-list: \texttt{['<time>', ',', ' ', '<date>', ' ', '-', ' ', '<title>']}
45 \end{flushleft}
46
47 The last step is when the entries with markers are then processed to build
48 node-lists. Node-lists are basically lists of words that, when concatenated,
49 form the original entry. A word isn't a word in the linguistic sense. A word
50 can be one letter or a category. The node-list is generated by putting all the
51 separate characters one by one in the list and when a user marking is
52 encountered, this marking is translated to the category code and that code is
53 then added as a word. The nodelists are then sent to the actual algorithm to be
54 converted to a graph representation.
55
56 \section{Minimizing DAWGs}
57 \subsection{Datastructure}
58 We represent the user generated patterns as DAWGs by converting the node-lists
59 to DAWGS. Normally DAWGs have as edgelabels letters from an alphabet, in our
60 case the DAWGs alphabet contains all letters, whitespace and punctuation but
61 also the specified user markers which can be multiple characters in length.
62
63 DAWGs are graphs but due to the constraints we can check if a match occurs by
64 checking if a path exists that creates the word by concatenating all the edge
65 labels. The first algorithm to
66 generate DAWGs from words was proposed by Hopcroft et
67 al\cite{Hopcroft1971}. It is an incremental approach in generating the graph.
68 Meaning that entry by entry the graph will be expanded. Hopcrofts algorithm has
69 the constraint of lexicographical ordering. Later on Daciuk et
70 al.\cite{Daciuk2000} improved on the original algorithm and their algorithm is
71 the algorithm we used to minimize our DAWGs. A minimal graph is a graph $G$ for
72 which there is no graph $G'$ that has less paths and $\mathcal{L}(G)=\mathcal{L
73 }(G')$ where $\mathcal{L}(G)$ is the set of all words present in the DAWG.
74
75 \subsection{Algorithm}
76 The algorithm of building DAWGs is an iterative process that goes roughly in
77 three steps. We start with the null graph that can be described by
78 $G_0=(\{q_0\},\{q_0\},\{\}\{\})$ and does not contain any edges, one node and
79 $\mathcal{L}(G_0)=\emptyset$. The first word that is added to the graph will be
80 added in a naive way. We just create a new node for every transition of
81 character and we mark the last node as final. From then on all words are added
82 using a stepwise approach.
83 \begin{itemize}
84 \item
85 Step one is finding the common prefix of the word in the graph and we chop
86 it of the word. When the suffix that remains is empty and the last state of
87 the common prefix is a final state we are done adding the word. When the
88 suffix is non empty or the common prefix does not end in a final state we
89 apply step two.
90 \item
91 We find the first confluence state in the common prefix and from there on
92 we add the suffix. When there is a confluence state the route all the way
93 back to the starting node will be cloned to make sure no extra words are
94 added to the graph unwillingly.
95 \item
96 Finally if in the suffix propagating from the end to the beginning there
97 are equal states we merge the state from the suffix to the equal state in
98 the graph.
99 \end{itemize}
100
101 Pseudocode for the algorithm can be found in Listing~\ref{pseudodawg}.
102
103 \subsection{Example}
104 We visualize this with an example shown in the {Subgraphs in
105 Figure}~\ref{dawg1} that builds a DAWG with the following entries:
106 \texttt{abcd}, \texttt{aecd} and \texttt{aecf}. In SG0 we begin with the null
107 graph.
108
109 Adding the first entry \texttt{abcd} is trivial and just creates a single path
110 and does not require any hard work. The result of adding the first word is
111 visible in subgraph SG1.
112
113 For the second entry \texttt{aecd} we will have to follow the steps described
114 earlier. The common prefix is \texttt{a} and the common suffix is \texttt{bc}.
115 Therefore the first node in which the common prefix starts needs to be copied
116 and split off to make room for an alternative route. This is exactly what
117 happens in SG2. Node $q_2$ is copied and gets a connection to node $q_3$. From
118 the last node of the common prefix a route is built towards the first node,
119 that is the copied node, of the common suffix resulting in an extra path
120 $q_1\rightarrow q_5\rightarrow q_3$. Since there are no confluence nodes we can
121 leave the prefix as it is. This results in subgraph SG2.
122
123 The last entry to add is the word \texttt{aecf}. When this is done without
124 confluence node checking we create extra paths in the graph that are unwanted.
125 The common prefix is \texttt{aec} and the common suffix is \texttt{f}.
126 \texttt{f} can not be merged into existing edges and thus a new edge will be
127 created from the end of the common suffix to the end. This results in subgraph
128 SG3. Note that we did not take care of confluence nodes in the common prefix
129 and because of that the word \texttt{abcf} is also accepted even when it was
130 not added to the graph. This is unwanted behaviour. Going back to the common
131 prefix \texttt{aec}. When we check for confluence nodes we see that the last
132 node of the common prefix is such a node and that we need to clone the path
133 and separate the route to the final node. $q_3$ will be cloned into $q_6$ with
134 the route of the common prefix. Tracking the route back we do not encounter any
135 other confluence states and therefore we can merge in the suffix which results
136 in the final DAWG visible in subgraph SG4. No new words are added.
137
138 \begin{figure}[H]
139 \caption{Step 1}
140 \label{dawg1}
141 \centering
142 \digraph[scale=0.4]{inccons}{
143 rankdir=LR;
144 n4 [style=invis];
145 q40 [label="q0"];
146 q41 [label="q1"];
147 q42 [label="q2"];
148 q43 [label="q3"];
149 q44 [label="q4" shape=doublecircle];
150 q45 [label="q5"];
151 q46 [label="q6"];
152 n4 -> q40[label="SG4"];
153 q40 -> q41[label="a"];
154 q41 -> q42[label="b"];
155 q42 -> q43[label="c"];
156 q43 -> q44[label="d"];
157 q41 -> q45[label="e"];
158 q45 -> q46[label="c"];
159 q46 -> q44[label="d"];
160 q46 -> q44[label="f"];
161
162 n3 [style=invis];
163 q30 [label="q0"];
164 q31 [label="q1"];
165 q32 [label="q2"];
166 q33 [label="q3"];
167 q34 [label="q4"ishape=doublecircle];
168 q35 [label="q5"];
169 n3 -> q30[label="SG3"];
170 q30 -> q31[label="a"];
171 q31 -> q32[label="b"];
172 q32 -> q33[label="c"];
173 q33 -> q34[label="d"];
174 q33 -> q34[label="f"iconstraint=false];
175 q31 -> q35[label="e"];
176 q35 -> q33[label="c"];
177
178 n2 [style=invis];
179 q20 [label="q0"];
180 q21 [label="q1"];
181 q22 [label="q2"];
182 q23 [label="q3"];
183 q24 [label="q4"ishape=doublecircle];
184 q25 [label="q5"];
185 n2 -> q20[label="SG2"];
186 q20 -> q21[label="a"];
187 q21 -> q22[label="b"];
188 q22 -> q23[label="c"];
189 q23 -> q24[label="d"];
190 q21 -> q25[label="e"];
191 q25 -> q23[label="c"];
192
193 n1 [style=invis];
194 q10 [label="q0"];
195 q11 [label="q1"];
196 q12 [label="q2"];
197 q13 [label="q3"];
198 q14 [label="q4"ishape=doublecircle];
199 n1 -> q10[label="SG1"];
200 q10 -> q11[label="a"];
201 q11 -> q12[label="b"];
202 q12 -> q13[label="c"];
203 q13 -> q14[label="d"];
204
205 n [style=invis];
206 q0 [label="q0"];
207 n -> q0[label="SG0"];
208 }
209 \end{figure}
210
211 \subsection{Appliance on extraction of patterns}
212 The algorithm for minimizing DAWGs can not be applied directly on the user
213 generated pattern. This is mainly because there is no conclusive information
214 about the contents of the pattern and therefore non deterministic choices have
215 to be made. Because of this there can be multiple matches and the best match
216 has to be determined. There are several possible criteria for determining the
217 best match. The method we use is ... TODO