From: Mart Lubbers Date: Wed, 20 Aug 2014 18:27:47 +0000 (+0200) Subject: intro updated X-Git-Url: https://git.martlubbers.net/?a=commitdiff_plain;h=f2d71ee7cbbe5f93d61d1f06ebe163a3a70919d9;p=bsc-thesis1415.git intro updated --- diff --git a/thesis/appendices.tex b/thesis/appendices.tex index fd0132c..366452f 100644 --- a/thesis/appendices.tex +++ b/thesis/appendices.tex @@ -1,7 +1,7 @@ \section{Input application} -\lstinputlisting[style=custompy,title=Python front/back-end] - {../program/hypfront/hyper.py} -\lstinputlisting[style=customhtml,title=HTML landing page] - {../program/hypfront/index.html} -\lstinputlisting[style=customjs,title=Javascript frontend] - {../program/hypfront/contextmenu_o.js} +%\lstinputlisting[style=custompy,title=Python front/back-end] +% {../program/hypfront/hyper.py} +%\lstinputlisting[style=customhtml,title=HTML landing page] +% {../program/hypfront/index.html} +%\lstinputlisting[style=customjs,title=Javascript frontend] +% {../program/hypfront/contextmenu_o.js} diff --git a/thesis/introduction.tex b/thesis/introduction.tex index 553f1dc..edcdf76 100644 --- a/thesis/introduction.tex +++ b/thesis/introduction.tex @@ -1,57 +1,60 @@ \section{Introduction} -Within the entertainment business there is no consistent style of informing -people about the events. Different venues display their, often incomplete, -information in entirely different ways. Because of this, converting raw -information from venues to structured consistent data is a challenging and, -a relevant problem. +People are looking on the internet for information about their favourite +theater show, music group or movie. All the information is scattered around on +the websites, mailinglists, newsfeeds and other sources owned by the venues. +This makes the search for cerntain events a energy consuming task. The venues +do not have a consistent way of presenting the information and to get all the +details different sources must be consulted. Because of this, converting raw +information from venues into structured consistent data is a relevant problem. \section{HyperLeap} -Hyperleap\footnote{\url{http://hyperleap.nl/}} is a small company that is -specialized in infotainment (information + entertainment) and administrates -several websites which bundle information about entertainment in an ordered way -and as complete as possible. Right now, most of the input data is added to the -database by by hand which is very labor intensive. Therefore Hyperleap is -looking for a smart solution to automate a part of the data injection in the -database, the crux however is that the system must not be too complicated from -the outside and be usable for a non IT professional(NIP). +Hyperleap\footnote{\url{http://hyperleap.nl/}} is a small company settled in +Nijmegen that is specialized in bundling the information from different sources +into a consistent information source about entertainment(infotainment), it +administrates several websites for several entertainment categories. Hyperleap +differentiates itself from other companies with the same business goals because +Hyperleap the most complete information most of the time. +Right now, most of the data in the database is added in two different fashions. +The first method is inputting the data in the database by hand, an employee +scans the raw inputs gathered from websites and has to separate the entries and +match them to existing events or create new events. This process is very +labour intensive and therefore costly. +The second way of adding information to the database is by crawlers programmed +specifically for certain websites. Because a programmer is needed to program +all the separate crawlers individually this is a costly business. This way of +gathering information is also very error-prone, this because when a source +changes it structure, for example the layout, the crawler is not functioning +anymore. When this happens the programmer has to adapt the crawler again to the +new changes and this takes valuable time. \section{Research question and practical goals} -This brings up the main research question: \textit{How can we make an adaptive, -autonomous and programmable data mining program that can be set up by a NIP -which is able to transform raw data into structured data.}\\ +The goal of the project is to create a software solution to make an employee +with no particular programmers background able to train or retrain crawlers for +RSS\footnote{\url{http://www.rssboard.org/rss-specification}} or +Atom\footnote{\url{http://tools.ietf.org/html/rfc5023}} publishing feeds. This +is done in such a way that the information is categorized and put into the +database. The software will notice the administrator of the program when a +source changed so that the new data can be added to the crawlers trainingset or +it can be decided that source crawler will be retrained from scratch. -In practice the goal and aim of the project is to create an application that -can, with NIP input, give computer parsable patterns which a separate crawler -can periodically crawl. The NIP has to be able to enter the information about -the data source in a user friendly interface which sends the information -together with the data source to the data processing application. The -data processing application then in turn processes the data into a extraction -pattern which is sent to the crawler. The crawler can visit sources specified -by the NIP accompanied by the extraction pattern created by the data processing -application. This work flow is described in graph~\ref{fig:ig1}. +This brings up the main research question: +\begin{center} + \textit{How can we make an adaptive, autonomous and programmable data mining + program that can be set up by someone without programmering experience which +is capable of transforming raw data into structured data.} +\end{center} -\begin{figure}[H] - \centering - \caption{Work flow within the applications} - \label{fig:ig1} - \includegraphics[width=150mm]{./dots/graph3.png} -\end{figure} +In practise this means that the end product is a software solution which does +the previously described tasks. -In this way the NIP can train the crawler to periodically crawl different data -sources without too much technical knowledge. The main goal of this project is -to extract the underlying structure rather then to extract the substructures. -The project is in principle a continuation of a past project done by Wouter -Roelofs\cite{Roelofs2009} which was also supervised by Franc Grootjen and -Alessandro Paula, however it was never taken out of the experimental phase. The -techniques described by Roelofs et al. are more focussed on extracting data -from substructures so it can be an addition to the current project.\\ - -As a very important side note, the crawler needs to notify the administrators if -a source has become problematic to crawl, in this way the NIP can very easily -retrain the application to fit the latest structural patterns. \section{Scientific relevance} Currently the techniques for conversion from non structured data to structured -data are static and mainly only usable by IT specialists. There is a great need -of data mining in non structured data because the data within companies and on -the internet is piling up and are usually left to catch dust. +data are static and mainly only usable by computer science experts. There is a +great need of data mining in non structured data because the data within +companies and on the internet is piling up and are usually left to catch dust. + +The project is a continuation of the past project done by Roelofs et +al.\cite{Roelofs2009}. The techniques described by Roelofs et al. are more +focussed on extracting data from already isolated data so it can be an addition +to the current project. diff --git a/thesis/methods.tex b/thesis/methods.tex index 475d80b..6fab641 100644 --- a/thesis/methods.tex +++ b/thesis/methods.tex @@ -1,3 +1,52 @@ +\section{Software architecture} +\begin{figure} + \centering + + \begin{sequencediagram} + \newthread{u}{:User} + \newinst{i}{:Input} + \newinst{p}{:Data processing} + \newinst{s}{:Source} + \newthread{c}{:Crawler} + \newinst{d}{:Database} + + \begin{sdblock}{Training}{} + \begin{messcall} + {u}{initiate}{i} + \begin{call} + {i}{fetch source} + {s}{source data} + \end{call} + \begin{call} + {i}{ask for markings} + {u}{marked data} + \end{call} + \begin{messcall} + {i}{marked data}{p} + \begin{messcall} + {p}{processed crawler pattern}{c} + \end{messcall} + \end{messcall} + \end{messcall} + \end{sdblock} + + \begin{sdblock}{Crawl}{Correct} + \begin{call} + {c}{visit source} + {s}{source data} + \end{call} + \begin{messcall} + {c}{processed data}{d} + \end{messcall} + \end{sdblock} + + \end{sequencediagram} + + \caption{Workflow of the application} +\end{figure} + + + The program can be divided into three components: input, data processing and the crawler. The applications have separate tasks within the workflow, the input application defines together with the NIP the patterns for the source, diff --git a/thesis/pgf-umlsd.sty b/thesis/pgf-umlsd.sty new file mode 100644 index 0000000..99847db --- /dev/null +++ b/thesis/pgf-umlsd.sty @@ -0,0 +1,329 @@ +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +% Start of pgf-umlsd.sty +% +% Some macros for UML Sequence Diagrams. +% Home page of project: http://pgf-umlsd.googlecode.com/ +% Author: Xu Yuan , Southeast University, China +% Contributor: Nobel Huang , Southeast University, China +% +% History: +% v0.7 2012/03/05 +% - unify interface of call and callself +% - non-instantaneous message +% - bugfix: conflits with tikz library backgrounds +% v0.6 2011/07/27 +% - Fix Issue 6 reported by frankmorgner@gmail.com +% - diagram without a thread +% - allows empty diagram +% - New manual +% v0.5 2009/09/30 Fix Issue 2 reported by vlado.handziski +% - Nested callself is supported +% - Rename sdloop and sdframe to sdblock +% v0.4 2008/12/08 Fix Issue 1 reported by MathStuf: +% Nested sdloop environment hides outer loop +% v0.3 2008/11/10 in Berlin, fix for the PGF cvs version: +% - the list items in \foreach are not evaluated by default now, +% the `evaluate' opinion should be used +% v0.2 2008/03/20 create project at http://pgf-umlsd.googlecode.com/ +% - use `shadows' library +% Thanks for Dr. Ludger Humbert's feedback! +% - reduce the parameter numbers, the user can write the content +% of instance (such as no colon) +% - the user can redefine the `inststyle' +% - new option: switch underlining of the instance text +% - new option: switch rounded corners +% v0.1 2008/01/25 first release at http://www.fauskes.net/pgftikzexamples/ +% + +\NeedsTeXFormat{LaTeX2e}[1999/12/01] +\ProvidesPackage{pgf-umlsd}[2011/07/27 v0.6 Some LaTeX macros for UML +Sequence Diagrams.] + +\RequirePackage{tikz} +\usetikzlibrary{arrows,shadows} + +\RequirePackage{ifthen} + +% Options +% ? the instance name under line ? +\newif\ifpgfumlsdunderline\pgfumlsdunderlinetrue +\DeclareOption{underline}{\pgfumlsdunderlinetrue} +\DeclareOption{underline=true}{\pgfumlsdunderlinetrue} +\DeclareOption{underline=false}{\pgfumlsdunderlinefalse} +% ? the instance box with rounded corners ? +\newif\ifpgfumlsdroundedcorners\pgfumlsdroundedcornersfalse +\DeclareOption{roundedcorners}{\pgfumlsdroundedcornerstrue} +\DeclareOption{roundedcorners=true}{\pgfumlsdroundedcornerstrue} +\DeclareOption{roundedcorners=false}{\pgfumlsdroundedcornersfalse} +\ProcessOptions + +% new counters +\newcounter{preinst} +\newcounter{instnum} +\newcounter{threadnum} +\newcounter{seqlevel} % level +\newcounter{callevel} +\newcounter{callselflevel} +\newcounter{blocklevel} + +% new an instance +% Example: +% \newinst[edge distance]{var}{name:class} +\newcommand{\newinst}[3][0.2]{ + \stepcounter{instnum} + \path (inst\thepreinst.east)+(#1,0) node[inststyle] (inst\theinstnum) + {\ifpgfumlsdunderline + \underline{#3} + \else + #3 + \fi}; + \path (inst\theinstnum)+(0,-0.5*\unitfactor) node (#2) {}; + \tikzstyle{instcolor#2}=[] + \stepcounter{preinst} +} + +% new an instance thread +% Example: +% \newinst[color]{var}{name}{class} +\newcommand{\newthread}[3][gray!30]{ + \newinst{#2}{#3} + \stepcounter{threadnum} + \node[below of=inst\theinstnum,node distance=0.8cm] (thread\thethreadnum) {}; + \tikzstyle{threadcolor\thethreadnum}=[fill=#1] + \tikzstyle{instcolor#2}=[fill=#1] +} + +% draw running (thick) line, should not call directly +\newcommand*{\drawthread}[2]{ + \begin{pgfonlayer}{umlsd@threadlayer} + \draw[threadstyle] (#1.west) -- (#1.east) -- (#2.east) -- (#2.west) -- cycle; + \end{pgfonlayer} +} + +% a function call +% Example: +% \begin{call}[height]{caller}{function}{callee}{return} +% \end{call} +\newenvironment{call}[5][1]{ +\ifthenelse{\equal{#2}{#4}} +{ + \begin{callself}[#1]{#2}{#3}{#5} +} +{ + \begin{callanother}[#1]{#2}{#3}{#4}{#5} +} +} +{ +\ifthenelse{\equal{\f\thecallevel}{\t\thecallevel}} +{ + \end{callself} +} +{ + \end{callanother} +} +} + +% function call to another instance +% interal use only +\newenvironment*{callanother}[5][1]{ + \stepcounter{seqlevel} + \stepcounter{callevel} % push + \path + (#2)+(0,-\theseqlevel*\unitfactor-0.7*\unitfactor) node (cf\thecallevel) {} + (#4.\threadbias)+(0,-\theseqlevel*\unitfactor-0.7*\unitfactor) node (ct\thecallevel) {}; + + \draw[->,>=triangle 60] ({cf\thecallevel}) -- (ct\thecallevel) + node[midway, above] {#3}; + \def\l\thecallevel{#1} + \def\f\thecallevel{#2} + \def\t\thecallevel{#4} + \def\returnvalue{#5} + \tikzstyle{threadstyle}+=[instcolor#2] +} +{ + \addtocounter{seqlevel}{\l\thecallevel} + \path + (\f\thecallevel)+(0,-\theseqlevel*\unitfactor-0.7*\unitfactor) node (rf\thecallevel) {} + (\t\thecallevel.\threadbias)+(0,-\theseqlevel*\unitfactor-0.7*\unitfactor) node (rt\thecallevel) {}; + \draw[dashed,->,>=angle 60] ({rt\thecallevel}) -- (rf\thecallevel) + node[midway, above]{\returnvalue}; + \drawthread{ct\thecallevel}{rt\thecallevel} + \addtocounter{callevel}{-1} % pop +} + +% a function do not need call others +% interal use only +% Example: +% \begin{callself}[height]{caller}{function}{return} +% \end{callself} +\newenvironment*{callself}[4][1]{ + \stepcounter{seqlevel} + \stepcounter{callevel} % push + \stepcounter{callselflevel} + + \path + (#2)+(\thecallselflevel*0.1-0.1,-\theseqlevel*\unitfactor-0.7*\unitfactor) node (sc\thecallevel) {} + ({sc\thecallevel}.east)+(0,-0.33*\unitfactor) node (scb\thecallevel) {}; + + \draw[->,>=triangle 60] ({sc\thecallevel}.east) -- ++(0.8,0) + node[near start, above right] {#3} -- ++(0,-0.33*\unitfactor) + -- (scb\thecallevel); + \def\l\thecallevel{#1} + \def\f\thecallevel{#2} + \def\t\thecallevel{#2} + \def\returnvalue{#4} + \tikzstyle{threadstyle}+=[instcolor#2] +}{ + \addtocounter{seqlevel}{\l\thecallevel} + \path (\f\thecallevel)+(\thecallselflevel*0.1-0.1,-\theseqlevel*\unitfactor-0.33*\unitfactor) node + (sct\thecallevel) {}; + + \draw[dashed,->,>=angle 60] ({sct\thecallevel}.east) node + (sce\thecallevel) {} -- ++(0.8,0) -- node[midway, right]{\returnvalue} ++(0,-0.33*\unitfactor) -- ++(-0.8,0); + \drawthread{scb\thecallevel}{sce\thecallevel} + \addtocounter{callevel}{-1} % pop + \addtocounter{callselflevel}{-1} +} + +% message between threads +% Example: +% \mess[delay]{sender}{message content}{receiver} +\newcommand{\mess}[4][0]{ + \stepcounter{seqlevel} + \path + (#2)+(0,-\theseqlevel*\unitfactor-0.7*\unitfactor) node (mess from) {}; + \addtocounter{seqlevel}{#1} + \path + (#4)+(0,-\theseqlevel*\unitfactor-0.7*\unitfactor) node (mess to) {}; + \draw[->,>=angle 60] (mess from) -- (mess to) node[midway, above] + {#3}; + + \node (#3 from) at (mess from) {}; + \node (#3 to) at (mess to) {}; +} + +\newenvironment{messcall}[4][1]{ + \stepcounter{seqlevel} + \stepcounter{callevel} % push + \path + (#2)+(0,-\theseqlevel*\unitfactor-0.7*\unitfactor) node (cf\thecallevel) {} + (#4.\threadbias)+(0,-\theseqlevel*\unitfactor-0.7*\unitfactor) node (ct\thecallevel) {}; + + \draw[->,>=angle 60] ({cf\thecallevel}) -- (ct\thecallevel) + node[midway, above] {#3}; + \def\l\thecallevel{#1} + \def\f\thecallevel{#2} + \def\t\thecallevel{#4} + \tikzstyle{threadstyle}+=[instcolor#2] +} +{ + \addtocounter{seqlevel}{\l\thecallevel} + \path + (\f\thecallevel)+(0,-\theseqlevel*\unitfactor-0.7*\unitfactor) node (rf\thecallevel) {} + (\t\thecallevel.\threadbias)+(0,-\theseqlevel*\unitfactor-0.3*\unitfactor) node (rt\thecallevel) {}; + \drawthread{ct\thecallevel}{rt\thecallevel} + \addtocounter{callevel}{-1} % pop +} + +% In the situation of multi-threads, some objects are called at the +% same time. Currently, we have to adjust the bias of thread line +% manually. Possible parameters are: center, west, east +\newcommand{\setthreadbias}[1]{\global\def\threadbias{#1}} + +% This function makes the call earlier. +\newcommand{\prelevel}{\addtocounter{seqlevel}{-1}} + +% This function makes the call later. +\newcommand{\postlevel}{\addtocounter{seqlevel}{+1}} + +% a block box with caption +% \begin{sdblock}[caption background color]{caption}{comments} +% \end{sdblock} +\newenvironment{sdblock}[3][white]{ + \stepcounter{seqlevel} + \stepcounter{blocklevel} % push + \coordinate (blockbeg\theblocklevel) at (0,-\theseqlevel*\unitfactor-\unitfactor); + \stepcounter{seqlevel} + \def\blockcolor\theblocklevel{#1} + \def\blockname\theblocklevel{#2} + \def\blockcomm\theblocklevel{#3} + \begin{pgfinterruptboundingbox} +}{ + \coordinate (blockend) at (0,-\theseqlevel*\unitfactor-2*\unitfactor); + \path (current bounding box.east)+(0.2,0) node (boxeast) {} + (current bounding box.west |- {blockbeg\theblocklevel}) + (-0.2,0) + node (nw) {}; + \path (boxeast |- blockend) node (se) {}; + + % % title + \node[blockstyle] (blocktitle) at (nw) {\blockname\theblocklevel}; + \path (blocktitle.south east) + (0,0.2) node (set) {} + (blocktitle.south east) + (-0.2,0) node (seb) {} + (blocktitle.north east) + (0.2,0) node (comm) {}; + \draw[fill=\blockcolor\theblocklevel] (blocktitle.north west) -- (blocktitle.north east) -- + (set.center) -- (seb.center) -- (blocktitle.south west) -- cycle; + \node[blockstyle] (blocktitle) at (nw) {\blockname\theblocklevel}; + \node[blockcommentstyle] (blockcomment) at (comm) {\blockcomm\theblocklevel}; + + \coordinate (se) at (current bounding box.south east); + \end{pgfinterruptboundingbox} + + \draw (se) rectangle (nw); + + \addtocounter{blocklevel}{-1} % pop + \stepcounter{seqlevel} +} + +% the environment of sequence diagram +\newenvironment{sequencediagram}{ + % declare layers + \pgfdeclarelayer{umlsd@background} + \pgfdeclarelayer{umlsd@threadlayer} + \pgfsetlayers{umlsd@background,umlsd@threadlayer,main} + + \begin{tikzpicture} + \setlength{\unitlength}{1cm} + \tikzstyle{sequence}=[coordinate] + \tikzstyle{inststyle}=[rectangle, draw, anchor=west, minimum + height=0.8cm, minimum width=1.6cm, fill=white, + drop shadow={opacity=1,fill=black}] + \ifpgfumlsdroundedcorners + \tikzstyle{inststyle}+=[rounded corners=3mm] + \fi + \tikzstyle{blockstyle}=[anchor=north west] + \tikzstyle{blockcommentstyle}=[anchor=north west, font=\small] + \tikzstyle{dot}=[inner sep=0pt,fill=black,circle,minimum size=0.2pt] + \global\def\unitfactor{0.6} + \global\def\threadbias{center} + % reset counters + \setcounter{preinst}{0} + \setcounter{instnum}{0} + \setcounter{threadnum}{0} + \setcounter{seqlevel}{0} + \setcounter{callevel}{0} + \setcounter{callselflevel}{0} + \setcounter{blocklevel}{0} + + % origin + \node[coordinate] (inst0) {}; +} +{ + \begin{pgfonlayer}{umlsd@background} + \ifnum\c@instnum > 0 + \foreach \t [evaluate=\t] in {1,...,\theinstnum}{ + \draw[dotted] (inst\t) -- ++(0,-\theseqlevel*\unitfactor-2.2*\unitfactor); + } + \fi + \ifnum\c@threadnum > 0 + \foreach \t [evaluate=\t] in {1,...,\thethreadnum}{ + \path (thread\t)+(0,-\theseqlevel*\unitfactor-0.1*\unitfactor) node (threadend) {}; + \tikzstyle{threadstyle}+=[threadcolor\t] + \drawthread{thread\t}{threadend} + } + \fi + \end{pgfonlayer} +\end{tikzpicture}} + + +%%% End of pgf-umlsd.sty +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \ No newline at end of file diff --git a/thesis/thesis.tex b/thesis/thesis.tex index 98ff026..69f8159 100644 --- a/thesis/thesis.tex +++ b/thesis/thesis.tex @@ -1,11 +1,15 @@ -\documentclass{scrbook} +\documentclass[hidelinks]{scrbook} -\usepackage{lipsum} -\usepackage{graphicx} -\usepackage{float} -\usepackage{listings} -\usepackage{hyperref} +\usepackage{lipsum} % Dummy text +\usepackage{graphicx} % Images +\usepackage{float} % Better placement float figures +\usepackage{listings} % Source code formatting +\usepackage{hyperref} % Hyperlinks +\usepackage{tikz} % Sequence diagrams +\usepackage{pgf-umlsd} +\usepgflibrary{arrows} +% Set listings settings \lstset{ basicstyle=\scriptsize, breaklines=true, @@ -13,8 +17,6 @@ numberstyle=\tiny, tabsize=2 } - - \lstdefinestyle{custompy}{ language=python, keepspaces=true, @@ -28,7 +30,14 @@ language=java } +% Setup hyperlink formatting +\hypersetup{ + pdftitle={Non IT congurable adaptive data mining solution used in transforming raw data to structured data}, + pdfauthor={Mart Lubbers}, + pdfsubject={Artificial Intelligence}, +} +% Describe the frontpage \author{Mart Lubbers\\s4109053} \title{Non IT congurable adaptive data mining solution used in transforming raw data to structured data} @@ -41,7 +50,6 @@ data to structured data} RU && Hyperleap \end{tabular} } - \date{\today} \begin{document} @@ -49,6 +57,7 @@ data to structured data} \tableofcontents \newpage +% Surrogate abstract \chapter*{ \centering \begin{normalsize}