From: Mart Lubbers Date: Thu, 24 Jul 2014 10:04:04 +0000 (+0200) Subject: last comit before meeting, wrap up thesis X-Git-Url: https://git.martlubbers.net/?a=commitdiff_plain;h=c2e07a7c64b014f099f56a2cfccef171b261c41f;p=bsc-thesis1415.git last comit before meeting, wrap up thesis --- diff --git a/thesis/Makefile b/thesis/Makefile index 886cea5..f61a637 100644 --- a/thesis/Makefile +++ b/thesis/Makefile @@ -1,7 +1,9 @@ +SHELL:=/bin/bash + all: thesis dots dots: - bash ./dots/compileall + ./compileall thesis: pdflatex thesis.tex diff --git a/thesis/appendices.tex b/thesis/appendices.tex new file mode 100644 index 0000000..fd0132c --- /dev/null +++ b/thesis/appendices.tex @@ -0,0 +1,7 @@ +\section{Input application} +\lstinputlisting[style=custompy,title=Python front/back-end] + {../program/hypfront/hyper.py} +\lstinputlisting[style=customhtml,title=HTML landing page] + {../program/hypfront/index.html} +\lstinputlisting[style=customjs,title=Javascript frontend] + {../program/hypfront/contextmenu_o.js} diff --git a/thesis/compileall.sh b/thesis/compileall.sh old mode 100644 new mode 100755 diff --git a/thesis/dots/graph3.dot b/thesis/dots/graph3.dot new file mode 100644 index 0000000..c588c11 --- /dev/null +++ b/thesis/dots/graph3.dot @@ -0,0 +1,9 @@ +digraph { + graph [ dpi =300 ] + rankdir = "LR" + "User" -> "Source" [ label = "1. Gather metadata" ] + "User" -> "Input application" [ label = "2. Give metadata and source" ] + "Input application" -> "Source" [ label = "3. Fetch and present source" ] + "Input application" -> "Data processing application" [ label = "4. Transfer pattern"] + "Data processing application" -> "Crawler" [ label = "5. Instruct crawler" ] +} diff --git a/thesis/img/img1.png b/thesis/img/img1.png new file mode 100644 index 0000000..a64aa03 Binary files /dev/null and b/thesis/img/img1.png differ diff --git a/thesis/introduction.tex b/thesis/introduction.tex index ea46330..5d19ce3 100644 --- a/thesis/introduction.tex +++ b/thesis/introduction.tex @@ -2,31 +2,53 @@ Within the entertainment business there is no consistent style of informing people about the events. Different venues display their, often incomplete, information in entirely different ways. Because of this, converting raw -information from venues to structured consistent data is a relevant problem. +information from venues to structured consistent data is a challenging and, +relevant problem. \section{HyperLeap} Hyperleap is a small company that is specialized in infotainment -(information+entertainment) and administrates several websites which bundle -information about entertainment in a ordered and complete way. Right now, most -of the data input is done by hand and takes a lot of time to type in. +(information + entertainment) and administrates several websites which bundle +information about entertainment in an ordered way and as complete as possible. +Right now, most of the input data is added to the database by by hand which is +very labor intensive. Therefore Hyperleap is looking for a smart solution to +automate a part of the data injection in the database, the crux however is that +the system must not be too complicated from the outside and be useable for a +non IT professional(NIP). -\section{Research question} -The main research question is: \textit{How can we make an adaptive, autonomous -and programmable data mining program that can be set up by a non IT -professional(NIP) which is able to transform raw data into structured data.}\\ +\section{Research question and practical goals} +This brings up the main research question: \textit{How can we make an adaptive, +autonomous and programmable data mining program that can be set up by a NIP +which is able to transform raw data into structured data.}\\ -The practical goal and aim of the project is to make a crawler(web or other -document types) that can autonomously gather information after it has been -setup by a, not necessarily IT trained, employer via an intuitive interface. -Optionally the crawler shouldn't be susceptible by small structure changes in -the website, be able to handle advanced website display techniques such as -javascript and should be able to notify the administrator when the site has -become uncrawlable and the crawler needs to be reprogrammed for that particular -site. But the main purpose is the translation from raw data to structured data. -The projects is in principle a continuation of a past project done by Wouter +In practice the goal and aim of the project is to create an application that +can, with NIP input, give computer parseable patterns which a separate crawler +can periodically crawl. The NIP has to be able to enter the information about +the data source in a user friendly interface which sends the information +together with the data source to the data processing application. The +dataprocessing application then in turn processes the data into a extraction +pattern which is sent to the crawler. The crawler can visit sources specified +by the NIP accompanied by the extraction pattern created by the dataprocessing +application. This workflow is described in graph~\ref{fig:ig1}. + +\begin{figure}[H] + \centering + \caption{Workflow within the applications} + \label{fig:ig1} + \includegraphics[width=150mm]{./dots/graph3.png} +\end{figure} + +In this way the NIP can train the crawler to periodically crawl different data +sources without too much technical knowledge. The main goal of this project is +to extract the underlying structure rather then to extract the substructures. +The project is in principle a continuation of a past project done by Wouter Roelofs\cite{Roelofs2009} which was also supervised by Franc Grootjen and -Alessandro Paula, however it was never taken out of the experimental phase and -therefore is in need continuation. +Alessandro Paula, however it was neven taken out of the experimental phase. The +techniques described by Roelofs et al. are more focussed on extracting data +from substructures so it can be an addition to the current project.\\ + +As a very important sidenote, the crawler needs to notify the administrators if +a source has become problematic to crawl, in this way the NIP can very easily +retrain the application to fit the latest structural patterns. \section{Scientific relevance} Currently the techniques for conversion from non structured data to structured diff --git a/thesis/methods.tex b/thesis/methods.tex index 29ea2b8..7919c4e 100644 --- a/thesis/methods.tex +++ b/thesis/methods.tex @@ -1,5 +1,18 @@ -\section{Directed acyclic graphs and finitie automata} -Directed acyclic graphs(DAG) and finite state automatas(FSA) have a lot in +\section{Input application} +The user input all goes through the familiar interface of the user's preferred +web browser. By visiting the crawler's train website the user can specify the +metadata of the source it wants to be periodically crawled through simple web +forms as seen in figure~\ref{fig:mf1} +\begin{figure}[H] + \centering + \caption{Webforms for source metadata} + \label{fig:mf1} + \includegraphics[width=80mm]{./img/img1.png} +\end{figure} + +\section{Data processing application} +\subsection{Directed acyclic graphs and finiti automata} +Directed acyclic graphs(DAG) and finite state automaton(FSA) have a lot in common concerning pattern recognition and information extraction. By feeding words into an algorithm a DAG can be generated so that it matches certain patters present in the given words. Figure~\ref{fig:mg1} for example shows a @@ -15,14 +28,15 @@ With this FSA we can test if a word fits to the constraints it the FSA describes. And with a little adaptation we can extract dynamic information from semi-structured data.\\ -\section{NIP input} -\section{Back to DAG's and FSA's} -Nodes in this datastructure can be single letters but also bigger + +\subsection{Back to DAG's and FSA's} +Nodes in this data structure can be single letters but also bigger constructions. The example in Figure~\ref{fig:mg2} describes different separator pattern for event data with its three component: what, when, where. In this example the nodes with the labels \textit{what, when, where} can also -be complete subgrahps. In this way data on a larger scale +be complete subgrahps. In this way data on a larger level can be using the +NIP markings and data within the categories can be processed autonomously. \begin{figure}[H] \centering \caption{Example event data} @@ -30,8 +44,6 @@ be complete subgrahps. In this way data on a larger scale \includegraphics[width=\linewidth]{./dots/graph2.png} \end{figure} +\subsection{Algorithm} - -\section{Algorithm} -Hello Wordl - +\section{Crawler application} diff --git a/thesis/thesis.pdf b/thesis/thesis.pdf index 2d7a68f..43ab6ac 100644 Binary files a/thesis/thesis.pdf and b/thesis/thesis.pdf differ diff --git a/thesis/thesis.tex b/thesis/thesis.tex index a7a5ae7..95d6898 100644 --- a/thesis/thesis.tex +++ b/thesis/thesis.tex @@ -3,9 +3,34 @@ \usepackage{lipsum} \usepackage{graphicx} \usepackage{float} +\usepackage{listings} + +\lstset{ + basicstyle=\scriptsize, + breaklines=true, + numbers=left, + numberstyle=\tiny, + tabsize=2 +} + + +\lstdefinestyle{custompy}{ + language=python, + keepspaces=true, + columns=flexible, + showspaces=false +} +\lstdefinestyle{customhtml}{ + language=html +} +\lstdefinestyle{customjs}{ + language=java +} + \author{Mart Lubbers\\s4109053} -\title{Non IT congurable adaptive data mining solution used in transforming raw data to structured data} +\title{Non IT congurable adaptive data mining solution used in transforming raw +data to structured data} \subtitle{ Bachelor's Thesis in Artificial Intelligence\\ Radboud University Nijmegen\\ @@ -42,9 +67,12 @@ \input{methods.tex} \chapter{Results} +\lipsum \chapter{Discussion} +\lipsum \chapter{Appendices} +\input{appendices.tex} \end{document}