db5e56c097c286f343338639a07575d4d74f1899
[bsc-thesis1415.git] / thesis2 / 2.methods.tex
1 \section{Application overview and workflow}
2 The program can be divided into two main components namely the \textit{Crawler
3 application} and the \textit{Input application}. The components are strictly
4 separated by task and by application. The crawler is an application dedicated
5 to the sole task of periodically crawling the sources asynchronously. The input
6 is a web interface to a set of tools that can create, edit, remove and test
7 crawlers via simple point and click user interfaces that can be worked with by
8 someone without a computer science background.
9
10
11 \section{Input application}
12 \subsection{Components}
13 Add new crawler
14
15 Editing or remove crawlers
16
17 Test crawler
18
19 Generate xml
20
21
22 \section{Crawler application}
23 \subsection{Interface}
24
25 \subsection{Preprocessing}
26 When the data is received by the crawler the data is embedded as POST data in a
27 HTTP request. The POST data consists of several fields with information about
28 the feed and a container that has the table with the user markers embedded.
29 After that the entries are extracted and processed line by line.
30
31 The line processing converts the raw string of html data from a table row to a
32 string. The string is stripped of all the html tags and is accompanied by a
33 list of marker items. The entries that don't contain any markers are left out
34 in the next step of processing. All data, including entries without user
35 markers, is stored in the object too for possible later reference, for example
36 for editing the patterns.
37
38 The last step is when the entries with markers are then processed to build
39 node-lists. Node-lists are basically lists of words that, when concatenated,
40 form the original entry. A word isn't a word in the linguistic sense. A word
41 can be one letter or a category. The node-list is generated by putting all the
42 separate characters one by one in the list and when a user marking is
43 encountered, this marking is translated to the category code and that code is
44 then added as a word. The nodelists are then sent to the actual algorithm to be
45 converted to a graph representation.
46
47 \subsection{Directed acyclic graphs(DAG)}
48 Directed acyclic graphs are a special kind of graph that is used to store big
49 sets of words and has a complexity of $\mathcal{O}(L)$, meaning the length of
50 the word you are searching. Figure~\ref{fig:f21} shows a DAG that accepts the
51 words \textit{aa} and \textit{ab}. Final states are marked with a double circle
52 and intermediate states are marked with a single circle.
53
54 \begin{figure}[H]
55 \caption{Sample DAG}
56 \label{fig:f21}
57 \centering
58 \digraph[]{graph21}{
59 rankdir=LR;
60 1, 2 [shape="circle"];
61 3, 4 [shape="doublecircle"];
62 1 -> 2 [label="a"];
63 2 -> 3 [label="a"];
64 2 -> 4 [label="b"];
65 }
66 \end{figure}
67
68 The first algorithm to generate DAG's was proposed by Hopcroft et
69 al\cite{Hopcroft1971}. The algorithm they described wasn't incremental and had
70 a complexity of $\mathcal{O}(N\log{N})$. \cite{Daciuk2000} et al. later
71 extended the algorithm and created an incremental one without increasing the
72 computational complexity. The non incremental algorithm from Daciuk et al. is
73 used to convert the nodelists to a graph.
74
75 For example constructing a graph that from the entry: \textit{a,bc} and
76 \textit{a.bc} goes in the following steps:
77
78 \begin{figure}[H]
79 \caption{Sample DAG, first entry}
80 \label{fig:f22}
81 \centering
82 \digraph[]{graph22}{
83 rankdir=LR;
84 1,2,3,5 [shape="circle"];
85 5 [shape="doublecircle"];
86 1 -> 2 [label="a"];
87 2 -> 3 [label="."];
88 3 -> 4 [label="b"];
89 4 -> 5 [label="c"];
90 }
91 \end{figure}
92
93 \begin{figure}[H]
94 \caption{Sample DAG, second entry}
95 \label{fig:f23}
96 \centering
97 \digraph[]{graph23}{
98 rankdir=LR;
99 1,2,3,5,6 [shape="circle"];
100 5 [shape="doublecircle"];
101 1 -> 2 [label="a"];
102 2 -> 3 [label="."];
103 3 -> 4 [label="b"];
104 4 -> 5 [label="c"];
105
106 2 -> 6 [label=","];
107 6 -> 4 [label="b"];
108 }
109 \end{figure}
110
111 \subsection{Defining categories}
112 pass
113
114 \subsection{Process}
115 Proposal was written
116
117
118 First html/mail/fax/rss, worst case rss
119
120
121 After some research and determining the scope of the project we decided only to
122 do RSS, this because RSS tends to force structure in the data because RSS feeds
123 are often generated by the website and thus reliable and consistent. We found a
124 couple of good RSS feeds.
125
126
127 At first the general framework was designed and implemented, no method yet.
128
129
130 Started with method for recognizing separators.
131
132
133 Found research paper about algorithm that can create directed acyclic graphs
134 from string, although it was designed to compress word lists it can be
135 (mis)used to extract information.
136
137
138 Implementation of DAG algorithm found and tied to the program.
139
140
141 Command line program ready. Conversation with both supervisors, gui had to be
142 made.
143
144 Step by step gui created. Web interface as a control center for the crawlers.
145
146
147 Gui optimized.
148
149
150 Concluded that the program doesn't reach wide audience due to lack of well
151 structured rss feeds.