From: Mart Lubbers Date: Wed, 11 Mar 2015 18:49:59 +0000 (+0100) Subject: update rcrc2 X-Git-Url: https://git.martlubbers.net/?a=commitdiff_plain;h=78228570ee45c2fadb8fb0fda313c1eae29fbe98;p=bsc-thesis1415.git update rcrc2 --- diff --git a/thesis2/1.introduction.tex b/thesis2/1.introduction.tex index 652dcb3..ac6e9c5 100644 --- a/thesis2/1.introduction.tex +++ b/thesis2/1.introduction.tex @@ -124,8 +124,8 @@ from the title of an entry in a RSS feed. The example has a clear structure and almost all information required is available directly from the entry. \begin{flushleft} - \texttt{2015-05-20, 18:00-23:00 - \textit{Foobar} presenting their new% -CD in combination with a show. Location: small salon.} + \texttt{2015-05-20, 18:00-23:00 - \textit{Foobar} presenting their % +new CD in combination with a show. Location: small salon.} \end{flushleft} An example of a terrible item could be for example the following text that @@ -143,18 +143,19 @@ park tomorrow evening.} When the source has been determined and classified the next step is periodically crawling the source. At the moment the crawling happens using two main methods.\\ -\textbf{Manual crawling:} Manual crawling is basically letting an employee -access the source and put the information directly in the database. This often -happens with non digital sources and with very sudden events or event changes -such as surprise concerts or event cancellation.\\ -\textbf{Automatic crawling:} Some sites are very structured and a programmer -can create a program that can visit the website systematically and -automatically to extract all the new information. Not all digital sources are -suitable to be crawled automatically and will still need manual crawling. The -programmed crawlers are always specifically created for one or a couple sources -and when the source changes for example structure the programmer has to adapt -the crawler which is costly. Information from the all the crawlers goes first -to the \textit{Temporum}. +\textbf{Manual crawling:}\\ +Manual crawling is basically letting an employee access the source and put the +information directly in the database. This often happens with non digital +sources and with very sudden events or event changes such as surprise concerts +or event cancellation.\\ +\textbf{Automatic crawling:}\\ +Some sites are very structured and a programmer can create a program that can +visit the website systematically and automatically to extract all the new +information. Not all digital sources are suitable to be crawled automatically +and will still need manual crawling. The programmed crawlers are always +specifically created for one or a couple sources and when the source changes +for example structure the programmer has to adapt the crawler which is costly. +Information from the all the crawlers goes first to the \textit{Temporum}. \subsection*{Temporum} The \textit{Temporum} is a big bin that contains raw data extracted from @@ -203,11 +204,11 @@ current feedback loop for crawlers. \begin{figure}[H] \label{feedbackloop} \centering - \includegraphics[scale=0.5]{feedbackloop.eps} + \includegraphics[width=0.8\linewidth]{feedbackloop.eps} \strut\\\strut\\ \caption{Feedback loop for malfunctioning crawlers} \end{figure} - +\strut\\ The goal of this project is specifically to relieve the programmer of repairing crawlers all the time and make the task of adapting, editing and removing crawlers doable for someone without programming experience. In practice this diff --git a/thesis2/3.methods.tex b/thesis2/3.methods.tex index 5168de1..54b1d65 100644 --- a/thesis2/3.methods.tex +++ b/thesis2/3.methods.tex @@ -54,12 +54,12 @@ dictionary. Every entry gotten from the previous step is going to be processing into so called node-lists. A node-list can be seen as a path graph where every character and marking has a node. A path graph $G$ is defined as -$G=(V,n_1,E,n_i)$ where $V=\{n_1, n_2, \cdots, n_{i-1}, n_i\}$ and $E=\{(n_1, -n_2), (n_2, n_3), ... (n_{i-1}, n_{i})\}$. A path graph is basically a graph -that is a single linear path of nodes where every node is connected to the next -node except for the last one. The last node is the only final node. The -transitions between two nodes is either a character or a marking. As an example -we take the entry \texttt{19:00, 2014-11-12 - Foobar} and create the +$G=(V,n_1,E,n_i)$ where $V=\{n_1, n_2, \cdots, n_{i-1}, n_i\}$ and +$E=\{(n_1, n_2), (n_2, n_3), \ldots\\ (n_{i-1}, n_{i})\}$. A path graph is basically +a graph that is a single linear path of nodes where every node is connected to +the next node except for the last one. The last node is the only final node. +The transitions between two nodes is either a character or a marking. As an +example we take the entry \texttt{19:00, 2014-11-12 - Foobar} and create the corresponding node-lists and it is shown in Figure~\ref{nodelistexample}. Characters are denoted with single quotes, spaces with an underscore and markers with angle brackets. Node-lists are the basic elements from which the @@ -247,7 +247,7 @@ choice. \caption{Example non determinism} \end{figure} -\subsection{Minimality and non-determinism} +\subsection{Minimality \& non-determinism} The Myhill-Nerode theorem~\cite{Hopcroft1979} states that for every number of graphs accepting the same language there is a single graph with the least amount of states. Mihov\cite{Mihov1998} has proven that the algorithm for diff --git a/thesis2/5.appendices.tex b/thesis2/5.appendices.tex index 6720c4c..6c5b30d 100644 --- a/thesis2/5.appendices.tex +++ b/thesis2/5.appendices.tex @@ -1,4 +1,4 @@ -\section{scheme.xsd} +\section{XSD schema} \lstinputlisting[language=XML,label={scheme.xsd},caption={XSD scheme for XML% output}]{scheme.xsd} diff --git a/thesis2/thesis.tex b/thesis2/thesis.tex index f718de3..e4074ee 100644 --- a/thesis2/thesis.tex +++ b/thesis2/thesis.tex @@ -1,15 +1,15 @@ \documentclass[twopage,titlepage]{book} -\usepackage{algorithm2e} % Pseudocode -\usepackage{a4wide} % Paper size -\usepackage{graphicx} % Eps inclusion -\usepackage{float} % Floating placement of figures -\usepackage{listings} % Source code formatting -\usepackage{setspace} % Line spacing abstract -\usepackage[dvipdfmx,hidelinks]{hyperref} % Hyperlinks -\usepackage{amssymb} % nexists and much more -\usepackage{amsmath} % Rightarrow and much more -\usepackage{marvosym} % For euro sign +\usepackage{algorithm2e} % Pseudocode +\usepackage{a4wide} % Paper size +\usepackage{graphicx} % Eps inclusion +\usepackage{float} % Floating placement of figures +\usepackage{listings} % Source code formatting +\usepackage{setspace} % Line spacing abstract +\usepackage[dvipdfmx]{hyperref} % Hyperlinks +\usepackage{amssymb} % nexists and much more +\usepackage{amsmath} % Rightarrow and much more +\usepackage{marvosym} % For euro sign \lstset{% basicstyle=\footnotesize, @@ -28,18 +28,22 @@ leisure activity RSS feeds} \hypersetup{ pdftitle={\cvartitle}, pdfauthor={Mart Lubbers}, - pdfsubject={Artificial Intelligence} + pdfsubject={Artificial Intelligence}, + hidelinks } % Describe the frontpage \author{ Mart Lubbers\\ s4109053\\ - Artificial Intelligence\\ Radboud University Nijmegen\\ \strut\\ - External supervisor: Alessandro Paula\\ - Internal supervisor: Franc Grootjen + Alessandro Paula\footnote{External supervisor}\\ + Hyperleap, Nijmegen\\ + \strut\\ + Franc Grootjen\footnote{Internal supervisor}\\ + Artificial Intelligence, Nijmegen\\ + Radboud University Nijmegen } \title{\cvartitle} \date{\today} @@ -65,7 +69,7 @@ leisure activity RSS feeds} \chapter{Introduction} \input{1.introduction.tex} -\chapter{Requirements and design} +\chapter{Requirements \& Application design} \input{2.requirementsanddesign.tex} \chapter{Algorithm} @@ -77,7 +81,7 @@ leisure activity RSS feeds} \chapter{Appendices} \input{5.appendices.tex} -\bibliographystyle{ieeetr} +\bibliographystyle{plain} \bibliography{thesis} \end{document}