thesis2/3.methods.tex

   1 \section{Application overview and workflow}
   2 The program can be divided into two main components namely the \textit{Crawler
   3 application} and the \textit{Input application}. The components are strictly
   4 separated by task and by application. The crawler is an application dedicated
   5 to the sole task of periodically crawling the sources asynchronously. The input
   6 is a web interface to a set of tools that can create, edit, remove and test
   7 crawlers via simple point and click user interfaces that can be worked with by
   8 someone without a computer science background.
   9
  10 \section{Minimizing DAWGs}
  11 The first algorithm to generate DAG's was proposed by Hopcroft et
  12 al\cite{Hopcroft1971}. The algorithm they described wasn't incremental and had
  13 a complexity of $\mathcal{O}(N\log{N})$. \cite{Daciuk2000} et al. later
  14 extended the algorithm and created an incremental one without increasing the
  15 computational complexity. The non incremental algorithm from Daciuk et al. is
  16 used to convert the nodelists to a graph.
  17
  18 For example constructing a graph that from the entry: \textit{a,bc} and
  19 \textit{a.bc} goes in the following steps:
  20
  21 \begin{figure}[H]
  22         \caption{Sample DAG, first entry}
  23         \label{fig:f22}
  24         \centering
  25         \digraph[]{graph22}{
  26                 rankdir=LR;
  27                 1,2,3,5 [shape="circle"];
  28                 5 [shape="doublecircle"];
  29                 1 -> 2 [label="a"];
  30                 2 -> 3 [label="."];
  31                 3 -> 4 [label="b"];
  32                 4 -> 5 [label="c"];
  33         }
  34 \end{figure}
  35
  36 \begin{figure}[H]
  37         \caption{Sample DAG, second entry}
  38         \label{fig:f23}
  39         \centering
  40         \digraph[]{graph23}{
  41                 rankdir=LR;
  42                 1,2,3,5,6 [shape="circle"];
  43                 5 [shape="doublecircle"];
  44                 1 -> 2 [label="a"];
  45                 2 -> 3 [label="."];
  46                 3 -> 4 [label="b"];
  47                 4 -> 5 [label="c"];
  48
  49                 2 -> 6 [label=","];
  50                 6 -> 4 [label="b"];
  51         }
  52 \end{figure}
  53
  54 \section{Input application}
  55 \subsection{Components}
  56 Add new crawler
  57
  58 Editing or remove crawlers
  59
  60 Test crawler
  61
  62 Generate xml
  63
  64
  65 \section{Crawler application}
  66 \subsection{Interface}
  67
  68 \subsection{Preprocessing}
  69 When the data is received by the crawler the data is embedded as POST data in a
  70 HTTP request. The POST data consists of several fields with information about
  71 the feed and a container that has the table with the user markers embedded.
  72 After that the entries are extracted and processed line by line.
  73
  74 The line processing converts the raw string of html data from a table row to a
  75 string. The string is stripped of all the html tags and is accompanied by a
  76 list of marker items. The entries that don't contain any markers are left out
  77 in the next step of processing. All data, including entries without user
  78 markers, is stored in the object too for possible later reference, for example
  79 for editing the patterns.
  80
  81 The last step is when the entries with markers are then processed to build
  82 node-lists. Node-lists are basically lists of words that, when concatenated,
  83 form the original entry. A word isn't a word in the linguistic sense. A word
  84 can be one letter or a category. The node-list is generated by putting all the
  85 separate characters one by one in the list and when a user marking is
  86 encountered, this marking is translated to the category code and that code is
  87 then added as a word. The nodelists are then sent to the actual algorithm to be
  88 converted to a graph representation.
  89
  90 \subsection{Defining categories}
  91 pass
  92
  93 \subsection{Process}
  94 Proposal was written
  95
  96
  97 First html/mail/fax/rss, worst case rss
  98
  99
 100 After some research and determining the scope of the project we decided only to
 101 do RSS, this because RSS tends to force structure in the data because RSS feeds
 102 are often generated by the website and thus reliable and consistent. We found a
 103 couple of good RSS feeds.
 104
 105
 106 At first the general framework was designed and implemented, no method yet.
 107
 108
 109 Started with method for recognizing separators.
 110
 111
 112 Found research paper about algorithm that can create directed acyclic graphs
 113 from string, although it was designed to compress word lists it can be
 114 (mis)used to extract information.
 115
 116
 117 Implementation of DAG algorithm found and tied to the program.
 118
 119
 120 Command line program ready. Conversation with both supervisors, gui had to be
 121 made.
 122
 123 Step by step gui created. Web interface as a control center for the crawlers.
 124
 125
 126 Gui optimized.
 127
 128
 129 Concluded that the program doesn't reach wide audience due to lack of well
 130 structured rss feeds.