1 \section{Application overview and workflow
}
2 The program can be divided into two main components namely the
\textit{Crawler
3 application
} and the
\textit{Input application
}. The components are strictly
4 separated by task and by application. The crawler is an application dedicated
5 to the sole task of periodically crawling the sources asynchronously. The input
6 is a web interface to a set of tools that can create, edit, remove and test
7 crawlers via simple point and click user interfaces that can be worked with by
8 someone without a computer science background.
11 \section{Input application
}
12 \subsection{Components
}
15 Editing or remove crawlers
22 \section{Crawler application
}
23 \subsection{Interface
}
25 \subsection{Preprocessing
}
26 When the data is received by the crawler the data is embedded as POST data in a
27 HTTP request. The POST data consists of several fields with information about
28 the feed and a container that has the table with the user markers embedded.
29 After that the entries are extracted and processed line by line.
31 The line processing converts the raw string of html data from a table row to a
32 string. The string is stripped of all the html tags and is accompanied by a
33 list of marker items. The entries that don't contain any markers are left out
34 in the next step of processing. All data, including entries without user
35 markers, is stored in the object too for possible later reference, for example
36 for editing the patterns.
38 The last step is when the entries with markers are then processed to build
39 node-lists. Node-lists are basically lists of words that, when concatenated,
40 form the original entry. A word isn't a word in the linguistic sense. A word
41 can be one letter or a category. The node-list is generated by putting all the
42 separate characters one by one in the list and when a user marking is
43 encountered, this marking is translated to the category code and that code is
44 then added as a word. The nodelists are then sent to the actual algorithm to be
45 converted to a graph representation.
47 \subsection{Directed acyclic graphs(DAG)
}
48 Directed acyclic graphs are a special kind of graph that is used to store big
49 sets of words and has a complexity of $
\mathcal{O
}(L)$, meaning the length of
50 the word you are searching. Figure~
\ref{fig:f21
} shows a DAG that accepts the
51 words
\textit{aa
} and
\textit{ab
}. Final states are marked with a double circle
52 and intermediate states are marked with a single circle.
60 1,
2 [shape="circle"
];
61 3,
4 [shape="doublecircle"
];
68 The first algorithm to generate DAG's was proposed by Hopcroft et
69 al
\cite{Hopcroft1971
}. The algorithm they described wasn't incremental and had
70 a complexity of $
\mathcal{O
}(N
\log{N
})$.
\cite{Daciuk2000
} et al. later
71 extended the algorithm and created an incremental one without increasing the
72 computational complexity. The non incremental algorithm from Daciuk et al. is
73 used to convert the nodelists to a graph.
75 For example constructing a graph that from the entry:
\textit{a,bc
} and
76 \textit{a.bc
} goes in the following steps:
79 \caption{Sample DAG, first entry
}
84 1,
2,
3,
5 [shape="circle"
];
85 5 [shape="doublecircle"
];
94 \caption{Sample DAG, second entry
}
99 1,
2,
3,
5,
6 [shape="circle"
];
100 5 [shape="doublecircle"
];
111 \subsection{Defining categories
}
118 First html/mail/fax/rss, worst case rss
121 After some research and determining the scope of the project we decided only to
122 do RSS, this because RSS tends to force structure in the data because RSS feeds
123 are often generated by the website and thus reliable and consistent. We found a
124 couple of good RSS feeds.
127 At first the general framework was designed and implemented, no method yet.
130 Started with method for recognizing separators.
133 Found research paper about algorithm that can create directed acyclic graphs
134 from string, although it was designed to compress word lists it can be
135 (mis)used to extract information.
138 Implementation of DAG algorithm found and tied to the program.
141 Command line program ready. Conversation with both supervisors, gui had to be
144 Step by step gui created. Web interface as a control center for the crawlers.
150 Concluded that the program doesn't reach wide audience due to lack of well
151 structured rss feeds.