thesis2/2.methods.tex

   1 \section{Application overview and workflow}
   2 The program can be divided into two main components namely the \textit{Crawler
   3 application} and the \textit{Input application}. The components are strictly
   4 separated by task and by application. The crawler is an application dedicated
   5 to the sole task of periodically crawling the sources asynchronously. The input
   6 is a web interface to a set of tools that can create, edit, remove and test
   7 crawlers via simple point and click user interfaces that can be worked with by
   8 someone without a computer science background.
   9
  10
  11 \section{Input application}
  12 \subsection{Components}
  13 Add new crawler
  14
  15 Editing or remove crawlers
  16
  17 Test crawler
  18
  19 Generate xml
  20
  21
  22 \section{Crawler application}
  23 \subsection{Interface}
  24
  25 \subsection{Preprocessing}
  26 When the data is received by the crawler the data is embedded as POST data in a
  27 HTTP request. The POST data consists of several fields with information about
  28 the feed and a container that has the table with the user markers embedded.
  29 After that the entries are extracted and processed line by line.
  30
  31 The line processing converts the raw string of html data from a table row to a
  32 string. The string is stripped of all the html tags and is accompanied by a
  33 list of marker items. The entries that don't contain any markers are left out
  34 in the next step of processing. All data, including entries without user
  35 markers, is stored in the object too for possible later reference, for example
  36 for editing the patterns.
  37
  38 The last step is when the entries with markers are then processed to build
  39 node-lists. Node-lists are basically lists of words that, when concatenated,
  40 form the original entry. A word isn't a word in the linguistic sense. A word
  41 can be one letter or a category. The node-list is generated by putting all the
  42 separate characters one by one in the list and when a user marking is
  43 encountered, this marking is translated to the category code and that code is
  44 then added as a word. The nodelists are then sent to the actual algorithm to be
  45 converted to a graph representation.
  46
  47 \subsection{Directed acyclic graphs(DAG)}
  48 Directed acyclic graphs are a special kind of graph that is used to store big
  49 sets of words and has a complexity of $\mathcal{O}(L)$, meaning the length of
  50 the word you are searching. Figure~\ref{fig:f21} shows a DAG that accepts the
  51 words \textit{aa} and \textit{ab}. Final states are marked with a double circle
  52 and intermediate states are marked with a single circle.
  53
  54 \begin{figure}[H]
  55         \caption{Sample DAG}
  56         \label{fig:f21}
  57         \centering
  58         \digraph[]{graph21}{
  59                 rankdir=LR;
  60                 1, 2 [shape="circle"];
  61                 3, 4 [shape="doublecircle"];
  62                 1 -> 2 [label="a"];
  63                 2 -> 3 [label="a"];
  64                 2 -> 4 [label="b"];
  65         }
  66 \end{figure}
  67
  68 The first algorithm to generate DAG's was proposed by Hopcroft et
  69 al\cite{Hopcroft1971}. The algorithm they described wasn't incremental and had
  70 a complexity of $\mathcal{O}(N\log{N})$. \cite{Daciuk2000} et al. later
  71 extended the algorithm and created an incremental one without increasing the
  72 computational complexity. The non incremental algorithm from Daciuk et al. is
  73 used to convert the nodelists to a graph.
  74
  75 For example constructing a graph that from the entry: \textit{a,bc} and
  76 \textit{a.bc} goes in the following steps:
  77
  78 \begin{figure}[H]
  79         \caption{Sample DAG, first entry}
  80         \label{fig:f22}
  81         \centering
  82         \digraph[]{graph22}{
  83                 rankdir=LR;
  84                 1,2,3,5 [shape="circle"];
  85                 5 [shape="doublecircle"];
  86                 1 -> 2 [label="a"];
  87                 2 -> 3 [label="."];
  88                 3 -> 4 [label="b"];
  89                 4 -> 5 [label="c"];
  90         }
  91 \end{figure}
  92
  93 \begin{figure}[H]
  94         \caption{Sample DAG, second entry}
  95         \label{fig:f23}
  96         \centering
  97         \digraph[]{graph23}{
  98                 rankdir=LR;
  99                 1,2,3,5,6 [shape="circle"];
 100                 5 [shape="doublecircle"];
 101                 1 -> 2 [label="a"];
 102                 2 -> 3 [label="."];
 103                 3 -> 4 [label="b"];
 104                 4 -> 5 [label="c"];
 105
 106                 2 -> 6 [label=","];
 107                 6 -> 4 [label="b"];
 108         }
 109 \end{figure}
 110
 111 \subsection{Defining categories}
 112 pass
 113
 114 \subsection{Process}
 115 Proposal was written
 116
 117
 118 First html/mail/fax/rss, worst case rss
 119
 120
 121 After some research and determining the scope of the project we decided only to
 122 do RSS, this because RSS tends to force structure in the data because RSS feeds
 123 are often generated by the website and thus reliable and consistent. We found a
 124 couple of good RSS feeds.
 125
 126
 127 At first the general framework was designed and implemented, no method yet.
 128
 129
 130 Started with method for recognizing separators.
 131
 132
 133 Found research paper about algorithm that can create directed acyclic graphs
 134 from string, although it was designed to compress word lists it can be
 135 (mis)used to extract information.
 136
 137
 138 Implementation of DAG algorithm found and tied to the program.
 139
 140
 141 Command line program ready. Conversation with both supervisors, gui had to be
 142 made.
 143
 144 Step by step gui created. Web interface as a control center for the crawlers.
 145
 146
 147 Gui optimized.
 148
 149
 150 Concluded that the program doesn't reach wide audience due to lack of well
 151 structured rss feeds.