17a180da301129b0a304cd66b60e8b3a88e5764a
[bsc-thesis1415.git] / thesis2 / 3.methods.tex
1 \section{Internals of the crawler generation module}
2 Data marked by the user is in principle just raw html data that contains a
3 table with for every RSS feed entry a table row. Within the html the code
4 severel markers are placed to make the parsing more easy and removing the need
5 of an advanced HTML parser. When the user presses submit a http POST request is
6 prepared to send all the gathered data to the backend to get processed. The
7 sending does not happen asynchronously, so the user has to wait for the data to
8 be processed. Because of the sychronous data transmission the user is notified
9 immediatly when the crawler is succesfully added. The data preparation that
10 makes the backend able to process it is all done on the user side using
11 Javascript.
12
13 When the backend receives the http POST data it saves all the raw data and the
14 data from the information fields. The raw data is processed line by line to
15 extract the entries that contain user markings. The entries containing user
16 markings are stripped from all html while creating an data structure that
17 contains the locations of the markings and the original text. All data is
18 stored, for example also the entries without user data, but not all data is
19 processed. The storage of, at first sight, useless data is done because later
20 when the crawlers needs editing the old data can be used to update the crawler.
21
22 When the entries are isolated and processed they are converted to node-lists. A
23 node-list is a literal list of words where a word is interpreted in the
24 broadest sense, meaning that a word can be a character, a single byte, a string
25 or basically anything. These node-lists are in this case the original entry
26 string where all the user markers are replaced by single nodes. As an example
27 lets take the following entry and its corresponding node-list assuming that the
28 user marked the time, title and date correctly. Markers are visualized by
29 enclosing the name of the marker in angle brackets.
30 \begin{flushleft}
31 Entry: \texttt{19:00, 2014-11-12 - Foobar}\\
32 Node-list: \texttt{['<time>', ',', ' ', '<date>', ' ', '-', ' ', '<title>']}
33 \end{flushleft}
34
35 The last step is when the entries with markers are then processed to build
36 node-lists. Node-lists are basically lists of words that, when concatenated,
37 form the original entry. A word isn't a word in the linguistic sense. A word
38 can be one letter or a category. The node-list is generated by putting all the
39 separate characters one by one in the list and when a user marking is
40 encountered, this marking is translated to the category code and that code is
41 then added as a word. The nodelists are then sent to the actual algorithm to be
42 converted to a graph representation.
43
44 \section{Minimizing DAWGs}
45 \subsection{Datastructure}
46 We represent the user generated patterns as DAWGs by converting the node-lists
47 to DAWGS. Normally DAWGs have as edgelabels letters from an alphabet, in our
48 case the DAWGs alphabet contains all letters, whitespace and punctuation but
49 also the specified user markers which can be multiple characters in length.
50
51 DAWGs are graphs but due to the constraints we can check if a match occurs by
52 checking if a path exists that creates the word by concatenating all the edge
53 labels. The first algorithm to
54 generate DAWGs from words was proposed by Hopcroft et
55 al\cite{Hopcroft1971}. It is an incremental approach in generating the graph.
56 Meaning that entry by entry the graph will be expanded. Hopcrofts algorithm has
57 the constraint of lexicographical ordering. Later on Daciuk et
58 al.\cite{Daciuk2000} improved on the original algorithm and their algorithm is
59 the algorithm we used to minimize our DAWGs. A minimal graph is a graph $G$ for
60 which there is no graph $G'$ that has less paths and $\mathcal{L}(G)=\mathcal{L
61 }(G')$ where $\mathcal{L}(G)$ is the set of all words present in the DAWG.
62
63 \subsection{Algorithm}
64 The algorithm of building DAWGs is an iterative process that goes roughly in
65 three steps. We start with the null graph that can be described by
66 $G_0=(\{q_0\},\{q_0\},\{\}\{\})$ and does not contain any edges, one node and
67 $\mathcal{L}(G_0)=\emptyset$. The first word that is added to the graph will be
68 added in a naive way. We just create a new node for every transition of
69 character and we mark the last node as final. From then on all words are added
70 using a stepwise approach.
71 \begin{itemize}
72 \item
73 Step one is finding the common prefix of the word in the graph and we chop
74 it of the word. When the suffix that remains is empty and the last state of
75 the common prefix is a final state we are done adding the word. When the
76 suffix is non empty or the common prefix does not end in a final state we
77 apply step two.
78 \item
79 We find the first confluence state in the common prefix and from there on
80 we add the suffix. When there is a confluence state the route all the way
81 back to the starting node will be cloned to make sure no extra words are
82 added to the graph unwillingly.
83 \item
84 Finally if in the suffix propagating from the end to the beginning there
85 are equal states we merge the state from the suffix to the equal state in
86 the graph.
87 \end{itemize}
88
89 Pseudocode for the algorithm can be found in Listing~\ref{pseudodawg}.
90
91 \subsection{Example}
92 We visualize this with an example shown in the {Subgraphs in
93 Figure}~\ref{dawg1} that builds a DAWG with the following entries:
94 \texttt{abcd}, \texttt{aecd} and \texttt{aecf}. In SG0 we begin with the null
95 graph.
96
97 Adding the first entry \texttt{abcd} is trivial and just creates a single path
98 and does not require any hard work. The result of adding the first word is
99 visible in subgraph SG1.
100
101 For the second entry \texttt{aecd} we will have to follow the steps described
102 earlier. The common prefix is \texttt{a} and the common suffix is \texttt{bc}.
103 Therefore the first node in which the common prefix starts needs to be copied
104 and split off to make room for an alternative route. This is exactly what
105 happens in SG2. Node $q_2$ is copied and gets a connection to node $q_3$. From
106 the last node of the common prefix a route is built towards the first node,
107 that is the copied node, of the common suffix resulting in an extra path
108 $q_1\rightarrow q_5\rightarrow q_3$. Since there are no confluence nodes we can
109 leave the prefix as it is. This results in subgraph SG2.
110
111 The last entry to add is the word \texttt{aecf}. When this is done without
112 confluence node checking we create extra paths in the graph that are unwanted.
113 The common prefix is \texttt{aec} and the common suffix is \texttt{f}.
114 \texttt{f} can not be merged into existing edges and thus a new edge will be
115 created from the end of the common suffix to the end. This results in subgraph
116 SG3. Note that we did not take care of confluence nodes in the common prefix
117 and because of that the word \texttt{abcf} is also accepted even when it was
118 not added to the graph. This is unwanted behaviour. Going back to the common
119 prefix \texttt{aec}. When we check for confluence nodes we see that the last
120 node of the common prefix is such a node and that we need to clone the path
121 and separate the route to the final node. $q_3$ will be cloned into $q_6$ with
122 the route of the common prefix. Tracking the route back we do not encounter any
123 other confluence states and therefore we can merge in the suffix which results
124 in the final DAWG visible in subgraph SG4. No new words are added.
125
126 \texttt{abcd}, \texttt{aecd} and \texttt{aecf}. In SG0 we begin with the null
127 \begin{figure}[H]
128 \caption{Step 1}
129 \label{dawg1}
130 \centering
131 \digraph[scale=0.4]{inccons}{
132 rankdir=LR;
133 n4 [style=invis];
134 q40 [label="q0"];
135 q41 [label="q1"];
136 q42 [label="q2"];
137 q43 [label="q3"];
138 q44 [label="q4" shape=doublecircle];
139 q45 [label="q5"];
140 q46 [label="q6"];
141 n4 -> q40[label="SG4"];
142 q40 -> q41[label="a"];
143 q41 -> q42[label="b"];
144 q42 -> q43[label="c"];
145 q43 -> q44[label="d"];
146 q41 -> q45[label="e"];
147 q45 -> q46[label="c"];
148 q46 -> q44[label="d"];
149 q46 -> q44[label="f"];
150
151 n3 [style=invis];
152 q30 [label="q0"];
153 q31 [label="q1"];
154 q32 [label="q2"];
155 q33 [label="q3"];
156 q34 [label="q4"ishape=doublecircle];
157 q35 [label="q5"];
158 n3 -> q30[label="SG3"];
159 q30 -> q31[label="a"];
160 q31 -> q32[label="b"];
161 q32 -> q33[label="c"];
162 q33 -> q34[label="d"];
163 q33 -> q34[label="f"iconstraint=false];
164 q31 -> q35[label="e"];
165 q35 -> q33[label="c"];
166
167 n2 [style=invis];
168 q20 [label="q0"];
169 q21 [label="q1"];
170 q22 [label="q2"];
171 q23 [label="q3"];
172 q24 [label="q4"ishape=doublecircle];
173 q25 [label="q5"];
174 n2 -> q20[label="SG2"];
175 q20 -> q21[label="a"];
176 q21 -> q22[label="b"];
177 q22 -> q23[label="c"];
178 q23 -> q24[label="d"];
179 q21 -> q25[label="e"];
180 q25 -> q23[label="c"];
181
182 n1 [style=invis];
183 q10 [label="q0"];
184 q11 [label="q1"];
185 q12 [label="q2"];
186 q13 [label="q3"];
187 q14 [label="q4"ishape=doublecircle];
188 n1 -> q10[label="SG1"];
189 q10 -> q11[label="a"];
190 q11 -> q12[label="b"];
191 q12 -> q13[label="c"];
192 q13 -> q14[label="d"];
193
194 n [style=invis];
195 q0 [label="q0"];
196 n -> q0[label="SG0"];
197 }
198 \end{figure}