812cff07cf8b7bb5fc90ad5df239d30633fbb9ba
[bsc-thesis1415.git] / thesis2 / 3.methods.tex
1 \section{Internals of the crawler generation module}
2 Data marked by the user is in principle just raw html data that contains a
3 table with for every RSS feed entry a table row. Within the html the code
4 severel markers are placed to make the parsing more easy and removing the need
5 of an advanced HTML parser. When the user presses submit a http POST request is
6 prepared to send all the gathered data to the backend to get processed. The
7 sending does not happen asynchronously, so the user has to wait for the data to
8 be processed. Because of the sychronous data transmission the user is notified
9 immediatly when the crawler is succesfully added. The data preparation that
10 makes the backend able to process it is all done on the user side using
11 Javascript.
12
13 When the backend receives the http POST data it saves all the raw data and the
14 data from the information fields. The raw data is processed line by line to
15 extract the entries that contain user markings. The entries containing user
16 markings are stripped from all html while creating an data structure that
17 contains the locations of the markings and the original text. All data is
18 stored, for example also the entries without user data, but not all data is
19 processed. The storage of, at first sight, useless data is done because later
20 when the crawlers needs editing the old data can be used to update the crawler.
21
22 When the entries are isolated and processed they are converted to node-lists. A
23 node-list is a literal list of words where a word is interpreted in the
24 broadest sense, meaning that a word can be a character, a single byte, a string
25 or basically anything. These node-lists are in this case the original entry
26 string where all the user markers are replaced by single nodes. As an example
27 lets take the following entry and its corresponding node-list assuming that the
28 user marked the time, title and date correctly. Markers are visualized by
29 enclosing the name of the marker in angle brackets.
30 \begin{flushleft}
31 Entry: \texttt{19:00, 2014-11-12 - Foobar}\\
32 Node-list: \texttt{['<time>', ',', ' ', '<date>', ' ', '-', ' ', '<title>']}
33 \end{flushleft}
34
35 The last step is when the entries with markers are then processed to build
36 node-lists. Node-lists are basically lists of words that, when concatenated,
37 form the original entry. A word isn't a word in the linguistic sense. A word
38 can be one letter or a category. The node-list is generated by putting all the
39 separate characters one by one in the list and when a user marking is
40 encountered, this marking is translated to the category code and that code is
41 then added as a word. The nodelists are then sent to the actual algorithm to be
42 converted to a graph representation.
43
44 \section{Minimizing DAWGs}
45 We represent the user generated patterns as DAWGs by converting the node-lists
46 to DAWGS. Normally DAWGs have as edgelabels letters from an alphabet, in our
47 case the DAWGs alphabet contains all letters, whitespace and punctuation but
48 also the specified user markers. DAWGs are a graph but by using them as an
49 automaton we can check if a word is accepted by the automaton, or in other
50 words, if the word matches the specified pattern. The first algorithm to
51 generate DAWGs from node-lists was proposed by Hopcroft et
52 al\cite{Hopcroft1971}. It is an incremental approach in generating the graph.
53 Meaning that entry by entry the graph was built. The only constraint that the
54 algorithm has is that the entries must be sorted lexicographically. Later on
55 Daciuk et al.\cite{Daciuk2000} improved on the original algorithm and their
56 algorithm is the algorithm we used to minimize or optimize our DAWGs.
57
58 Pseudocode for the algorithm can be found in Listing~\ref{pseudodawg}.
59
60 Incrementally node-lists are added to create a graph. For example in
61 {Subgraphs in Figure}~\ref{dawg1} visualizes the construction of the
62 DAWG from the entries: \texttt{a.bc}, \texttt{a,bc} and \texttt{a,bd}.
63
64 In SG0 the graph is only the null graph, described by
65 $G=(\{q_0\},\{q_0\},\{\}\{\})$ and does not contain any entries.
66
67 Adding the first entry \texttt{a,bc} is not a
68 hard task because there is no common prefix nor suffix so the entry becomes a
69 strain of nodes starting from $q_0$ and is visualized in subgraph SG1.
70
71 For the second entry \texttt{a.bc} some smart optimization has to be applied.
72 The common prefix is \texttt{a} and the common suffix is \texttt{bc}. Therefore
73 the first node in which the common prefix starts needs to be copied and split
74 off to make room for an alternative route. This is exactly what happens in SG2.
75 Node $q_2$ is copied and gets a connection to node $q_3$. From the last node of
76 the common prefix a route is built towards the first node, that is the copied
77 node, of the common suffix resulting in an extra path $q_1\rightarrow
78 q_5\rightarrow q_3$. This results in subgraph SG2.
79
80 Adding the thing node \texttt{a.bd} in the same naive way as the second node
81 results in subgraph SG3 and introduces a bad side effect. Namely that a new
82 word, that was not specifically added was introduced, is added to the DAWG.
83 This is because within the common prefix there is a node that has some other
84 arrow leading to it. When a new path is added the algorithm checks the common
85 prefix and suffix for confluence nodes. Confluence nodes are nodes that have
86 multiple arrows leading in and because of the multiple arrows they can lead to
87 unwanted additional words. The first confluence node found must be copied and
88 detached and the specific suffix from the entry must be copied with it to
89 separate the path from the existing paths. Applying this technique in the
90 example leads to subgraph SG4. With common prefix \texttt{a.b} and an empty
91 common suffix node $q_3$ is found as a confluence node in the common prefix and
92 therefore node $q_3$ is copied to the new node $q_6$ taking the paths leading
93 to the final state with it. In this way no new words are added to the DAWG and
94 the DAWG is still optimal.
95
96 \begin{figure}[H]
97 \caption{Step 1}
98 \label{dawg1}
99 \centering
100 \digraph[width=\linewidth]{inccons}{
101 rankdir=LR;
102 n4 [style=invis];
103 q40 [label="q0"];
104 q41 [label="q1"];
105 q42 [label="q2"];
106 q43 [label="q3"];
107 q44 [label="q4" shape=doublecircle];
108 q45 [label="q5"];
109 q46 [label="q6"];
110 n4 -> q40[label="SG4"];
111 q40 -> q41[label="a"];
112 q41 -> q42[label=","];
113 q42 -> q43[label="b"];
114 q43 -> q44[label="c"];
115 q41 -> q45[label="."];
116 q45 -> q46[label="b"];
117 q46 -> q44[label="d"];
118 q46 -> q44[label="c"];
119
120 n3 [style=invis];
121 q30 [label="q0"];
122 q31 [label="q1"];
123 q32 [label="q2"];
124 q33 [label="q3"];
125 q34 [label="q4",shape=doublecircle];
126 q35 [label="q5"];
127 n3 -> q30[label="SG3"];
128 q30 -> q31[label="a"];
129 q31 -> q32[label=","];
130 q32 -> q33[label="b"];
131 q33 -> q34[label="c"];
132 q33 -> q34[label="d",constraint=false];
133 q31 -> q35[label="."];
134 q35 -> q33[label="b"];
135
136 n2 [style=invis];
137 q20 [label="q0"];
138 q21 [label="q1"];
139 q22 [label="q2"];
140 q23 [label="q3"];
141 q24 [label="q4",shape=doublecircle];
142 q25 [label="q5"];
143 n2 -> q20[label="SG2"];
144 q20 -> q21[label="a"];
145 q21 -> q22[label=","];
146 q22 -> q23[label="b"];
147 q23 -> q24[label="c"];
148 q21 -> q25[label="."];
149 q25 -> q23[label="b"];
150
151 n1 [style=invis];
152 q10 [label="q0"];
153 q11 [label="q1"];
154 q12 [label="q2"];
155 q13 [label="q3"];
156 q14 [label="q4",shape=doublecircle];
157 n1 -> q10[label="SG1"];
158 q10 -> q11[label="a"];
159 q11 -> q12[label=","];
160 q12 -> q13[label="b"];
161 q13 -> q14[label="c"];
162
163 n [style=invis];
164 q0 [label="q0"];
165 n -> q0[label="SG0"];
166 }
167 \end{figure}
168
169 %\subsection{Process}
170 %Proposal was written
171 %
172 %
173 %First html/mail/fax/rss, worst case rss
174 %
175 %
176 %After some research and determining the scope of the project we decided only to
177 %do RSS, this because RSS tends to force structure in the data because RSS feeds
178 %are often generated by the website and thus reliable and consistent. We found a
179 %couple of good RSS feeds.
180 %
181 %
182 %At first the general framework was designed and implemented, no method yet.
183 %
184 %
185 %Started with method for recognizing separators.
186 %
187 %
188 %Found research paper about algorithm that can create directed acyclic graphs
189 %from string, although it was designed to compress word lists it can be
190 %(mis)used to extract information.
191 %
192 %
193 %Implementation of DAG algorithm found and tied to the program.
194 %
195 %
196 %Command line program ready. Conversation with both supervisors, gui had to be
197 %made.
198 %
199 %Step by step gui created. Web interface as a control center for the crawlers.
200 %
201 %
202 %Gui optimized.
203 %
204 %
205 %Concluded that the program doesn't reach wide audience due to lack of well
206 %structured rss feeds.