stuff added in methods
[bsc-thesis1415.git] / thesis2 / 3.methods.tex
1 \section{Internals of the crawler generation module}
2 Data marked by the user is in principle just raw html data that contains a
3 table with for every RSS feed entry a table row. Within the html the code
4 severel markers are placed to make the parsing more easy and removing the need
5 of an advanced HTML parser. When the user presses submit a http POST request is
6 prepared to send all the gathered data to the backend to get processed. The
7 sending does not happen asynchronously, so the user has to wait for the data to
8 be processed. Because of the sychronous data transmission the user is notified
9 immediatly when the crawler is succesfully added. The data preparation that
10 makes the backend able to process it is all done on the user side using
11 Javascript.
12
13 When the backend receives the http POST data it saves all the raw data and the
14 data from the information fields. The raw data is processed line by line to
15 extract the entries that contain user markings. The entries containing user
16 markings are stripped from all html while creating an data structure that
17 contains the locations of the markings and the original text. All data is
18 stored, for example also the entries without user data, but not all data is
19 processed. The storage of, at first sight, useless data is done because later
20 when the crawlers needs editing the old data can be used to update the crawler.
21
22 When the entries are isolated and processed they are converted to node-lists. A
23 node-list is a literal list of words where a word is interpreted in the
24 broadest sense, meaning that a word can be a character, a single byte, a string
25 or basically anything. These node-lists are in this case the original entry
26 string where all the user markers are replaced by single nodes. As an example
27 lets take the following entry and its corresponding node-list assuming that the
28 user marked the time, title and date correctly. Markers are visualized by
29 enclosing the name of the marker in angle brackets.
30 \begin{flushleft}
31 Entry: \texttt{19:00, 2014-11-12 - Foobar}\\
32 Node-list: \texttt{['<time>', ',', ' ', '<date>', ' ', '-', ' ', '<title>']}
33 \end{flushleft}
34
35 The last step is when the entries with markers are then processed to build
36 node-lists. Node-lists are basically lists of words that, when concatenated,
37 form the original entry. A word isn't a word in the linguistic sense. A word
38 can be one letter or a category. The node-list is generated by putting all the
39 separate characters one by one in the list and when a user marking is
40 encountered, this marking is translated to the category code and that code is
41 then added as a word. The nodelists are then sent to the actual algorithm to be
42 converted to a graph representation.
43
44 \section{Minimizing DAWGs}
45 The first algorithm to generate DAG's was proposed by Hopcroft et
46 al\cite{Hopcroft1971}. The algorithm they described wasn't incremental and had
47 a complexity of $\mathcal{O}(N\log{N})$. \cite{Daciuk2000} et al. later
48 extended the algorithm and created an incremental one without increasing the
49 computational complexity. The non incremental algorithm from Daciuk et al. is
50 used to convert the nodelists to a graph.
51
52 For example constructing a graph that from the entry: \textit{a,bc} and
53 \textit{a.bc} goes in the following steps:
54
55 \begin{figure}[H]
56 \caption{Sample DAG, first entry}
57 \label{fig:f22}
58 \centering
59 \digraph[]{graph22}{
60 rankdir=LR;
61 1,2,3,5 [shape="circle"];
62 5 [shape="doublecircle"];
63 1 -> 2 [label="a"];
64 2 -> 3 [label="."];
65 3 -> 4 [label="b"];
66 4 -> 5 [label="c"];
67 }
68 \end{figure}
69
70 \begin{figure}[H]
71 \caption{Sample DAG, second entry}
72 \label{fig:f23}
73 \centering
74 \digraph[]{graph23}{
75 rankdir=LR;
76 1,2,3,5,6 [shape="circle"];
77 5 [shape="doublecircle"];
78 1 -> 2 [label="a"];
79 2 -> 3 [label="."];
80 3 -> 4 [label="b"];
81 4 -> 5 [label="c"];
82
83 2 -> 6 [label=","];
84 6 -> 4 [label="b"];
85 }
86 \end{figure}
87
88 %\subsection{Process}
89 %Proposal was written
90 %
91 %
92 %First html/mail/fax/rss, worst case rss
93 %
94 %
95 %After some research and determining the scope of the project we decided only to
96 %do RSS, this because RSS tends to force structure in the data because RSS feeds
97 %are often generated by the website and thus reliable and consistent. We found a
98 %couple of good RSS feeds.
99 %
100 %
101 %At first the general framework was designed and implemented, no method yet.
102 %
103 %
104 %Started with method for recognizing separators.
105 %
106 %
107 %Found research paper about algorithm that can create directed acyclic graphs
108 %from string, although it was designed to compress word lists it can be
109 %(mis)used to extract information.
110 %
111 %
112 %Implementation of DAG algorithm found and tied to the program.
113 %
114 %
115 %Command line program ready. Conversation with both supervisors, gui had to be
116 %made.
117 %
118 %Step by step gui created. Web interface as a control center for the crawlers.
119 %
120 %
121 %Gui optimized.
122 %
123 %
124 %Concluded that the program doesn't reach wide audience due to lack of well
125 %structured rss feeds.