3f67cb384e1a47129c96291b8e77e93f1fa424db
[bsc-thesis1415.git] / thesis2 / 2.methods.tex
1 \section{Application overview and workflow}
2 The program can be divided into two main components namely the \textit{Crawler
3 application} and the \textit{Input application}. The components are strictly
4 separated by task and by application. The crawler is an application dedicated
5 to the sole task of periodically crawling the sources asynchronously. The input
6 is a web interface to a set of tools that can create, edit, remove and test
7 crawlers via simple point and click user interfaces that can be worked with by
8 someone without a computer science background.
9
10
11 \section{Input application}
12 \subsection{Components}
13 Add new crawler
14
15 Editing or remove crawlers
16
17 Test crawler
18
19 Generate xml
20
21
22 \section{Crawler application}
23 \subsection{Interface}
24
25 \subsection{Algorithm}
26 \subsection{Preprocessing}
27 When the data is received by the crawler the data is embedded as POST data in a
28 HTTP request. The POST data consists of several fields with information about
29 the feed and a container that has the table with the user markers embedded.
30 After that the entries are extracted and processed line by line.
31
32 The line processing converts the raw string of html data from a table row to a
33 string. The string is stripped of all the html tags and is accompanied by a
34 list of marker items.
35
36 The entries that don't contain any markers are left out in the next step of
37 processing. All data, including entries without user markers, is stored in the
38 object too for possible later reference, for example for editing the patterns.
39
40 The last step is when the entries with markers are then processed to build
41 node-lists. Node-lists are basically strings where the user markers are
42 replaced by patterns so that the variable data, the isolated data, is not used
43 in the node-lists.
44
45 \subsection{Directed acyclic graphs}
46
47