thesis v0.2
authorMart Lubbers <mart@martlubbers.net>
Wed, 12 Nov 2014 20:27:31 +0000 (21:27 +0100)
committerMart Lubbers <mart@martlubbers.net>
Wed, 12 Nov 2014 20:27:31 +0000 (21:27 +0100)
thesis2/2.requirementsanddesign.tex [new file with mode: 0644]
thesis2/3.methods.tex [new file with mode: 0644]
thesis2/4.conclusion.tex [new file with mode: 0644]
thesis2/5.discussion.tex [new file with mode: 0644]
thesis2/6.appendices.tex [new file with mode: 0644]
thesis2/version/mart_thesis_0.2.tar.gz [new file with mode: 0644]

diff --git a/thesis2/2.requirementsanddesign.tex b/thesis2/2.requirementsanddesign.tex
new file mode 100644 (file)
index 0000000..a1dff98
--- /dev/null
@@ -0,0 +1,3 @@
+\section{Requirements}
+
+\section{Design}
diff --git a/thesis2/3.methods.tex b/thesis2/3.methods.tex
new file mode 100644 (file)
index 0000000..d8c0e3a
--- /dev/null
@@ -0,0 +1,130 @@
+\section{Application overview and workflow}
+The program can be divided into two main components namely the \textit{Crawler
+application} and the \textit{Input application}. The components are strictly
+separated by task and by application. The crawler is an application dedicated
+to the sole task of periodically crawling the sources asynchronously. The input
+is a web interface to a set of tools that can create, edit, remove and test
+crawlers via simple point and click user interfaces that can be worked with by
+someone without a computer science background.
+
+\section{Minimizing DAWGs}
+The first algorithm to generate DAG's was proposed by Hopcroft et
+al\cite{Hopcroft1971}. The algorithm they described wasn't incremental and had
+a complexity of $\mathcal{O}(N\log{N})$. \cite{Daciuk2000} et al. later
+extended the algorithm and created an incremental one without increasing the
+computational complexity. The non incremental algorithm from Daciuk et al. is
+used to convert the nodelists to a graph.
+
+For example constructing a graph that from the entry: \textit{a,bc} and
+\textit{a.bc} goes in the following steps:
+
+\begin{figure}[H]
+       \caption{Sample DAG, first entry}
+       \label{fig:f22}
+       \centering
+       \digraph[]{graph22}{
+               rankdir=LR;
+               1,2,3,5 [shape="circle"];
+               5 [shape="doublecircle"];
+               1 -> 2 [label="a"];
+               2 -> 3 [label="."];
+               3 -> 4 [label="b"];
+               4 -> 5 [label="c"];
+       }
+\end{figure}
+
+\begin{figure}[H]
+       \caption{Sample DAG, second entry}
+       \label{fig:f23}
+       \centering
+       \digraph[]{graph23}{
+               rankdir=LR;
+               1,2,3,5,6 [shape="circle"];
+               5 [shape="doublecircle"];
+               1 -> 2 [label="a"];
+               2 -> 3 [label="."];
+               3 -> 4 [label="b"];
+               4 -> 5 [label="c"];
+
+               2 -> 6 [label=","];
+               6 -> 4 [label="b"];
+       }
+\end{figure}
+
+\section{Input application}
+\subsection{Components}
+Add new crawler
+
+Editing or remove crawlers
+
+Test crawler
+
+Generate xml
+
+
+\section{Crawler application}
+\subsection{Interface}
+
+\subsection{Preprocessing}
+When the data is received by the crawler the data is embedded as POST data in a
+HTTP request. The POST data consists of several fields with information about
+the feed and a container that has the table with the user markers embedded.
+After that the entries are extracted and processed line by line.
+
+The line processing converts the raw string of html data from a table row to a
+string. The string is stripped of all the html tags and is accompanied by a
+list of marker items. The entries that don't contain any markers are left out
+in the next step of processing. All data, including entries without user
+markers, is stored in the object too for possible later reference, for example
+for editing the patterns.
+
+The last step is when the entries with markers are then processed to build
+node-lists. Node-lists are basically lists of words that, when concatenated,
+form the original entry. A word isn't a word in the linguistic sense. A word
+can be one letter or a category. The node-list is generated by putting all the
+separate characters one by one in the list and when a user marking is
+encountered, this marking is translated to the category code and that code is
+then added as a word. The nodelists are then sent to the actual algorithm to be
+converted to a graph representation.
+
+\subsection{Defining categories}
+pass
+
+\subsection{Process}
+Proposal was written
+
+
+First html/mail/fax/rss, worst case rss
+
+
+After some research and determining the scope of the project we decided only to
+do RSS, this because RSS tends to force structure in the data because RSS feeds
+are often generated by the website and thus reliable and consistent. We found a
+couple of good RSS feeds.
+
+
+At first the general framework was designed and implemented, no method yet.
+
+
+Started with method for recognizing separators.
+
+
+Found research paper about algorithm that can create directed acyclic graphs
+from string, although it was designed to compress word lists it can be
+(mis)used to extract information.
+
+
+Implementation of DAG algorithm found and tied to the program.
+
+
+Command line program ready. Conversation with both supervisors, gui had to be
+made.
+
+Step by step gui created. Web interface as a control center for the crawlers.
+
+
+Gui optimized.
+
+
+Concluded that the program doesn't reach wide audience due to lack of well
+structured rss feeds.
diff --git a/thesis2/4.conclusion.tex b/thesis2/4.conclusion.tex
new file mode 100644 (file)
index 0000000..e69de29
diff --git a/thesis2/5.discussion.tex b/thesis2/5.discussion.tex
new file mode 100644 (file)
index 0000000..e69de29
diff --git a/thesis2/6.appendices.tex b/thesis2/6.appendices.tex
new file mode 100644 (file)
index 0000000..ee8c8b9
--- /dev/null
@@ -0,0 +1,2 @@
+       \section{Algorithm}
+       \section{Progress}
diff --git a/thesis2/version/mart_thesis_0.2.tar.gz b/thesis2/version/mart_thesis_0.2.tar.gz
new file mode 100644 (file)
index 0000000..1378441
Binary files /dev/null and b/thesis2/version/mart_thesis_0.2.tar.gz differ