From: Mart Lubbers Date: Wed, 12 Nov 2014 20:27:31 +0000 (+0100) Subject: thesis v0.2 X-Git-Url: https://git.martlubbers.net/?a=commitdiff_plain;h=4b5b9b229886c1a7e5cf25eb227fa95d7cd5294b;p=bsc-thesis1415.git thesis v0.2 --- diff --git a/thesis2/2.requirementsanddesign.tex b/thesis2/2.requirementsanddesign.tex new file mode 100644 index 0000000..a1dff98 --- /dev/null +++ b/thesis2/2.requirementsanddesign.tex @@ -0,0 +1,3 @@ +\section{Requirements} + +\section{Design} diff --git a/thesis2/3.methods.tex b/thesis2/3.methods.tex new file mode 100644 index 0000000..d8c0e3a --- /dev/null +++ b/thesis2/3.methods.tex @@ -0,0 +1,130 @@ +\section{Application overview and workflow} +The program can be divided into two main components namely the \textit{Crawler +application} and the \textit{Input application}. The components are strictly +separated by task and by application. The crawler is an application dedicated +to the sole task of periodically crawling the sources asynchronously. The input +is a web interface to a set of tools that can create, edit, remove and test +crawlers via simple point and click user interfaces that can be worked with by +someone without a computer science background. + +\section{Minimizing DAWGs} +The first algorithm to generate DAG's was proposed by Hopcroft et +al\cite{Hopcroft1971}. The algorithm they described wasn't incremental and had +a complexity of $\mathcal{O}(N\log{N})$. \cite{Daciuk2000} et al. later +extended the algorithm and created an incremental one without increasing the +computational complexity. The non incremental algorithm from Daciuk et al. is +used to convert the nodelists to a graph. + +For example constructing a graph that from the entry: \textit{a,bc} and +\textit{a.bc} goes in the following steps: + +\begin{figure}[H] + \caption{Sample DAG, first entry} + \label{fig:f22} + \centering + \digraph[]{graph22}{ + rankdir=LR; + 1,2,3,5 [shape="circle"]; + 5 [shape="doublecircle"]; + 1 -> 2 [label="a"]; + 2 -> 3 [label="."]; + 3 -> 4 [label="b"]; + 4 -> 5 [label="c"]; + } +\end{figure} + +\begin{figure}[H] + \caption{Sample DAG, second entry} + \label{fig:f23} + \centering + \digraph[]{graph23}{ + rankdir=LR; + 1,2,3,5,6 [shape="circle"]; + 5 [shape="doublecircle"]; + 1 -> 2 [label="a"]; + 2 -> 3 [label="."]; + 3 -> 4 [label="b"]; + 4 -> 5 [label="c"]; + + 2 -> 6 [label=","]; + 6 -> 4 [label="b"]; + } +\end{figure} + +\section{Input application} +\subsection{Components} +Add new crawler + +Editing or remove crawlers + +Test crawler + +Generate xml + + +\section{Crawler application} +\subsection{Interface} + +\subsection{Preprocessing} +When the data is received by the crawler the data is embedded as POST data in a +HTTP request. The POST data consists of several fields with information about +the feed and a container that has the table with the user markers embedded. +After that the entries are extracted and processed line by line. + +The line processing converts the raw string of html data from a table row to a +string. The string is stripped of all the html tags and is accompanied by a +list of marker items. The entries that don't contain any markers are left out +in the next step of processing. All data, including entries without user +markers, is stored in the object too for possible later reference, for example +for editing the patterns. + +The last step is when the entries with markers are then processed to build +node-lists. Node-lists are basically lists of words that, when concatenated, +form the original entry. A word isn't a word in the linguistic sense. A word +can be one letter or a category. The node-list is generated by putting all the +separate characters one by one in the list and when a user marking is +encountered, this marking is translated to the category code and that code is +then added as a word. The nodelists are then sent to the actual algorithm to be +converted to a graph representation. + +\subsection{Defining categories} +pass + +\subsection{Process} +Proposal was written + + +First html/mail/fax/rss, worst case rss + + +After some research and determining the scope of the project we decided only to +do RSS, this because RSS tends to force structure in the data because RSS feeds +are often generated by the website and thus reliable and consistent. We found a +couple of good RSS feeds. + + +At first the general framework was designed and implemented, no method yet. + + +Started with method for recognizing separators. + + +Found research paper about algorithm that can create directed acyclic graphs +from string, although it was designed to compress word lists it can be +(mis)used to extract information. + + +Implementation of DAG algorithm found and tied to the program. + + +Command line program ready. Conversation with both supervisors, gui had to be +made. + +Step by step gui created. Web interface as a control center for the crawlers. + + +Gui optimized. + + +Concluded that the program doesn't reach wide audience due to lack of well +structured rss feeds. diff --git a/thesis2/4.conclusion.tex b/thesis2/4.conclusion.tex new file mode 100644 index 0000000..e69de29 diff --git a/thesis2/5.discussion.tex b/thesis2/5.discussion.tex new file mode 100644 index 0000000..e69de29 diff --git a/thesis2/6.appendices.tex b/thesis2/6.appendices.tex new file mode 100644 index 0000000..ee8c8b9 --- /dev/null +++ b/thesis2/6.appendices.tex @@ -0,0 +1,2 @@ + \section{Algorithm} + \section{Progress} diff --git a/thesis2/version/mart_thesis_0.2.tar.gz b/thesis2/version/mart_thesis_0.2.tar.gz new file mode 100644 index 0000000..1378441 Binary files /dev/null and b/thesis2/version/mart_thesis_0.2.tar.gz differ