thesis update
[bsc-thesis1415.git] / thesis2 / 2.methods.tex
1 \section{Application overview and workflow}
2 The program can be divided into two main components namely the \textit{Crawler
3 application} and the \textit{Input application}. The components are strictly
4 separated by task and by application. The crawler is an application dedicated
5 to the sole task of periodically crawling the sources asynchronously. The input
6 is a web interface to a set of tools that can create, edit, remove and test
7 crawlers via simple point and click user interfaces that can be worked with by
8 someone without a computer science background.
9
10
11 \section{Input application}
12 \subsection{Components}
13 Add new crawler
14
15 Editing or remove crawlers
16
17 Test crawler
18
19 Generate xml
20
21
22 \section{Crawler application}
23 \subsection{Interface}
24
25 \subsection{Algorithm}
26 \subsection{Preprocessing}
27 When the data is received by the crawler the data is embedded as POST data in a
28 HTTP request. The POST data consists of several fields with information about
29 the feed and a container that has the table with the user markers embedded.
30 After that the entries are extracted and processed line by line.
31
32 The line processing converts the raw string of html data from a table row to a
33 string. The string is stripped of all the html tags and is accompanied by a
34 list of marker items. The entries that don't contain any markers are left out
35 in the next step of processing. All data, including entries without user
36 markers, is stored in the object too for possible later reference, for example
37 for editing the patterns.
38
39 The last step is when the entries with markers are then processed to build
40 node-lists. Node-lists are basically lists of words that, when concatenated,
41 form the original entry. A word isn't a word in the linguistic sense. A word
42 can be one letter or a category. The node-list is generated by putting all the
43 separate characters one by one in the list and when a user marking is
44 encountered, this marking is translated to the category code and that code is
45 then added as a word. The nodelists are then sent to the actual algorithm to be
46 converted to a graph representation.
47
48 \subsection{Directed acyclic graphs(DAG)}
49 Directed acyclic graphs are a special kind of graph that is used to store big
50 sets of words and has a complexity of $\mathcal{O}(L)$, meaning the length of
51 the word you are searching. Figure~\ref{fig:f21} shows a DAG that accepts the
52 words \textit{aa} and \textit{ab}. Final states are marked with a double circle
53 and intermediate states are marked with a single circle.
54
55 \begin{figure}[H]
56 \caption{Sample DAG}
57 \label{fig:f21}
58 \centering
59 \digraph[]{graph2}{
60 rankdir=LR;
61 1, 2 [shape="circle"];
62 3, 4 [shape="doublecircle"];
63 1 -> 2 [label="a"];
64 2 -> 3 [label="a"];
65 2 -> 4 [label="b"];
66 }
67 \end{figure}
68
69 Using the algorithm described by Hopcroft et al\cite{Hopcroft1971} the
70 nodelists are converted into minimal directed acyclic graphs with a low
71 complexity($\mathcal{O}(N\log{N})$).