thesis v0.2
[bsc-thesis1415.git] / thesis2 / 3.methods.tex
1 \section{Application overview and workflow}
2 The program can be divided into two main components namely the \textit{Crawler
3 application} and the \textit{Input application}. The components are strictly
4 separated by task and by application. The crawler is an application dedicated
5 to the sole task of periodically crawling the sources asynchronously. The input
6 is a web interface to a set of tools that can create, edit, remove and test
7 crawlers via simple point and click user interfaces that can be worked with by
8 someone without a computer science background.
9
10 \section{Minimizing DAWGs}
11 The first algorithm to generate DAG's was proposed by Hopcroft et
12 al\cite{Hopcroft1971}. The algorithm they described wasn't incremental and had
13 a complexity of $\mathcal{O}(N\log{N})$. \cite{Daciuk2000} et al. later
14 extended the algorithm and created an incremental one without increasing the
15 computational complexity. The non incremental algorithm from Daciuk et al. is
16 used to convert the nodelists to a graph.
17
18 For example constructing a graph that from the entry: \textit{a,bc} and
19 \textit{a.bc} goes in the following steps:
20
21 \begin{figure}[H]
22 \caption{Sample DAG, first entry}
23 \label{fig:f22}
24 \centering
25 \digraph[]{graph22}{
26 rankdir=LR;
27 1,2,3,5 [shape="circle"];
28 5 [shape="doublecircle"];
29 1 -> 2 [label="a"];
30 2 -> 3 [label="."];
31 3 -> 4 [label="b"];
32 4 -> 5 [label="c"];
33 }
34 \end{figure}
35
36 \begin{figure}[H]
37 \caption{Sample DAG, second entry}
38 \label{fig:f23}
39 \centering
40 \digraph[]{graph23}{
41 rankdir=LR;
42 1,2,3,5,6 [shape="circle"];
43 5 [shape="doublecircle"];
44 1 -> 2 [label="a"];
45 2 -> 3 [label="."];
46 3 -> 4 [label="b"];
47 4 -> 5 [label="c"];
48
49 2 -> 6 [label=","];
50 6 -> 4 [label="b"];
51 }
52 \end{figure}
53
54 \section{Input application}
55 \subsection{Components}
56 Add new crawler
57
58 Editing or remove crawlers
59
60 Test crawler
61
62 Generate xml
63
64
65 \section{Crawler application}
66 \subsection{Interface}
67
68 \subsection{Preprocessing}
69 When the data is received by the crawler the data is embedded as POST data in a
70 HTTP request. The POST data consists of several fields with information about
71 the feed and a container that has the table with the user markers embedded.
72 After that the entries are extracted and processed line by line.
73
74 The line processing converts the raw string of html data from a table row to a
75 string. The string is stripped of all the html tags and is accompanied by a
76 list of marker items. The entries that don't contain any markers are left out
77 in the next step of processing. All data, including entries without user
78 markers, is stored in the object too for possible later reference, for example
79 for editing the patterns.
80
81 The last step is when the entries with markers are then processed to build
82 node-lists. Node-lists are basically lists of words that, when concatenated,
83 form the original entry. A word isn't a word in the linguistic sense. A word
84 can be one letter or a category. The node-list is generated by putting all the
85 separate characters one by one in the list and when a user marking is
86 encountered, this marking is translated to the category code and that code is
87 then added as a word. The nodelists are then sent to the actual algorithm to be
88 converted to a graph representation.
89
90 \subsection{Defining categories}
91 pass
92
93 \subsection{Process}
94 Proposal was written
95
96
97 First html/mail/fax/rss, worst case rss
98
99
100 After some research and determining the scope of the project we decided only to
101 do RSS, this because RSS tends to force structure in the data because RSS feeds
102 are often generated by the website and thus reliable and consistent. We found a
103 couple of good RSS feeds.
104
105
106 At first the general framework was designed and implemented, no method yet.
107
108
109 Started with method for recognizing separators.
110
111
112 Found research paper about algorithm that can create directed acyclic graphs
113 from string, although it was designed to compress word lists it can be
114 (mis)used to extract information.
115
116
117 Implementation of DAG algorithm found and tied to the program.
118
119
120 Command line program ready. Conversation with both supervisors, gui had to be
121 made.
122
123 Step by step gui created. Web interface as a control center for the crawlers.
124
125
126 Gui optimized.
127
128
129 Concluded that the program doesn't reach wide audience due to lack of well
130 structured rss feeds.