\section{Introduction}
-Within the entertainment business there is no consistent style of informing
-people about the events. Different venues display their, often incomplete,
-information in entirely different ways. Because of this, converting raw
-information from venues to structured consistent data is a challenging and,
-a relevant problem.
+People are looking on the internet for information about their favourite
+theater show, music group or movie. All the information is scattered around on
+the websites, mailinglists, newsfeeds and other sources owned by the venues.
+This makes the search for cerntain events a energy consuming task. The venues
+do not have a consistent way of presenting the information and to get all the
+details different sources must be consulted. Because of this, converting raw
+information from venues into structured consistent data is a relevant problem.
\section{HyperLeap}
-Hyperleap\footnote{\url{http://hyperleap.nl/}} is a small company that is
-specialized in infotainment (information + entertainment) and administrates
-several websites which bundle information about entertainment in an ordered way
-and as complete as possible. Right now, most of the input data is added to the
-database by by hand which is very labor intensive. Therefore Hyperleap is
-looking for a smart solution to automate a part of the data injection in the
-database, the crux however is that the system must not be too complicated from
-the outside and be usable for a non IT professional(NIP).
+Hyperleap\footnote{\url{http://hyperleap.nl/}} is a small company settled in
+Nijmegen that is specialized in bundling the information from different sources
+into a consistent information source about entertainment(infotainment), it
+administrates several websites for several entertainment categories. Hyperleap
+differentiates itself from other companies with the same business goals because
+Hyperleap the most complete information most of the time.
+Right now, most of the data in the database is added in two different fashions.
+The first method is inputting the data in the database by hand, an employee
+scans the raw inputs gathered from websites and has to separate the entries and
+match them to existing events or create new events. This process is very
+labour intensive and therefore costly.
+The second way of adding information to the database is by crawlers programmed
+specifically for certain websites. Because a programmer is needed to program
+all the separate crawlers individually this is a costly business. This way of
+gathering information is also very error-prone, this because when a source
+changes it structure, for example the layout, the crawler is not functioning
+anymore. When this happens the programmer has to adapt the crawler again to the
+new changes and this takes valuable time.
\section{Research question and practical goals}
-This brings up the main research question: \textit{How can we make an adaptive,
-autonomous and programmable data mining program that can be set up by a NIP
-which is able to transform raw data into structured data.}\\
+The goal of the project is to create a software solution to make an employee
+with no particular programmers background able to train or retrain crawlers for
+RSS\footnote{\url{http://www.rssboard.org/rss-specification}} or
+Atom\footnote{\url{http://tools.ietf.org/html/rfc5023}} publishing feeds. This
+is done in such a way that the information is categorized and put into the
+database. The software will notice the administrator of the program when a
+source changed so that the new data can be added to the crawlers trainingset or
+it can be decided that source crawler will be retrained from scratch.
-In practice the goal and aim of the project is to create an application that
-can, with NIP input, give computer parsable patterns which a separate crawler
-can periodically crawl. The NIP has to be able to enter the information about
-the data source in a user friendly interface which sends the information
-together with the data source to the data processing application. The
-data processing application then in turn processes the data into a extraction
-pattern which is sent to the crawler. The crawler can visit sources specified
-by the NIP accompanied by the extraction pattern created by the data processing
-application. This work flow is described in graph~\ref{fig:ig1}.
+This brings up the main research question:
+\begin{center}
+ \textit{How can we make an adaptive, autonomous and programmable data mining
+ program that can be set up by someone without programmering experience which
+is capable of transforming raw data into structured data.}
+\end{center}
-\begin{figure}[H]
- \centering
- \caption{Work flow within the applications}
- \label{fig:ig1}
- \includegraphics[width=150mm]{./dots/graph3.png}
-\end{figure}
+In practise this means that the end product is a software solution which does
+the previously described tasks.
-In this way the NIP can train the crawler to periodically crawl different data
-sources without too much technical knowledge. The main goal of this project is
-to extract the underlying structure rather then to extract the substructures.
-The project is in principle a continuation of a past project done by Wouter
-Roelofs\cite{Roelofs2009} which was also supervised by Franc Grootjen and
-Alessandro Paula, however it was never taken out of the experimental phase. The
-techniques described by Roelofs et al. are more focussed on extracting data
-from substructures so it can be an addition to the current project.\\
-
-As a very important side note, the crawler needs to notify the administrators if
-a source has become problematic to crawl, in this way the NIP can very easily
-retrain the application to fit the latest structural patterns.
\section{Scientific relevance}
Currently the techniques for conversion from non structured data to structured
-data are static and mainly only usable by IT specialists. There is a great need
-of data mining in non structured data because the data within companies and on
-the internet is piling up and are usually left to catch dust.
+data are static and mainly only usable by computer science experts. There is a
+great need of data mining in non structured data because the data within
+companies and on the internet is piling up and are usually left to catch dust.
+
+The project is a continuation of the past project done by Roelofs et
+al.\cite{Roelofs2009}. The techniques described by Roelofs et al. are more
+focussed on extracting data from already isolated data so it can be an addition
+to the current project.
--- /dev/null
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+% Start of pgf-umlsd.sty
+%
+% Some macros for UML Sequence Diagrams.
+% Home page of project: http://pgf-umlsd.googlecode.com/
+% Author: Xu Yuan <xuyuan.cn@gmail.com>, Southeast University, China
+% Contributor: Nobel Huang <nobel1984@gmail.com>, Southeast University, China
+%
+% History:
+% v0.7 2012/03/05
+% - unify interface of call and callself
+% - non-instantaneous message
+% - bugfix: conflits with tikz library backgrounds
+% v0.6 2011/07/27
+% - Fix Issue 6 reported by frankmorgner@gmail.com
+% - diagram without a thread
+% - allows empty diagram
+% - New manual
+% v0.5 2009/09/30 Fix Issue 2 reported by vlado.handziski
+% - Nested callself is supported
+% - Rename sdloop and sdframe to sdblock
+% v0.4 2008/12/08 Fix Issue 1 reported by MathStuf:
+% Nested sdloop environment hides outer loop
+% v0.3 2008/11/10 in Berlin, fix for the PGF cvs version:
+% - the list items in \foreach are not evaluated by default now,
+% the `evaluate' opinion should be used
+% v0.2 2008/03/20 create project at http://pgf-umlsd.googlecode.com/
+% - use `shadows' library
+% Thanks for Dr. Ludger Humbert's <humbert@uni-wuppertal.de> feedback!
+% - reduce the parameter numbers, the user can write the content
+% of instance (such as no colon)
+% - the user can redefine the `inststyle'
+% - new option: switch underlining of the instance text
+% - new option: switch rounded corners
+% v0.1 2008/01/25 first release at http://www.fauskes.net/pgftikzexamples/
+%
+
+\NeedsTeXFormat{LaTeX2e}[1999/12/01]
+\ProvidesPackage{pgf-umlsd}[2011/07/27 v0.6 Some LaTeX macros for UML
+Sequence Diagrams.]
+
+\RequirePackage{tikz}
+\usetikzlibrary{arrows,shadows}
+
+\RequirePackage{ifthen}
+
+% Options
+% ? the instance name under line ?
+\newif\ifpgfumlsdunderline\pgfumlsdunderlinetrue
+\DeclareOption{underline}{\pgfumlsdunderlinetrue}
+\DeclareOption{underline=true}{\pgfumlsdunderlinetrue}
+\DeclareOption{underline=false}{\pgfumlsdunderlinefalse}
+% ? the instance box with rounded corners ?
+\newif\ifpgfumlsdroundedcorners\pgfumlsdroundedcornersfalse
+\DeclareOption{roundedcorners}{\pgfumlsdroundedcornerstrue}
+\DeclareOption{roundedcorners=true}{\pgfumlsdroundedcornerstrue}
+\DeclareOption{roundedcorners=false}{\pgfumlsdroundedcornersfalse}
+\ProcessOptions
+
+% new counters
+\newcounter{preinst}
+\newcounter{instnum}
+\newcounter{threadnum}
+\newcounter{seqlevel} % level
+\newcounter{callevel}
+\newcounter{callselflevel}
+\newcounter{blocklevel}
+
+% new an instance
+% Example:
+% \newinst[edge distance]{var}{name:class}
+\newcommand{\newinst}[3][0.2]{
+ \stepcounter{instnum}
+ \path (inst\thepreinst.east)+(#1,0) node[inststyle] (inst\theinstnum)
+ {\ifpgfumlsdunderline
+ \underline{#3}
+ \else
+ #3
+ \fi};
+ \path (inst\theinstnum)+(0,-0.5*\unitfactor) node (#2) {};
+ \tikzstyle{instcolor#2}=[]
+ \stepcounter{preinst}
+}
+
+% new an instance thread
+% Example:
+% \newinst[color]{var}{name}{class}
+\newcommand{\newthread}[3][gray!30]{
+ \newinst{#2}{#3}
+ \stepcounter{threadnum}
+ \node[below of=inst\theinstnum,node distance=0.8cm] (thread\thethreadnum) {};
+ \tikzstyle{threadcolor\thethreadnum}=[fill=#1]
+ \tikzstyle{instcolor#2}=[fill=#1]
+}
+
+% draw running (thick) line, should not call directly
+\newcommand*{\drawthread}[2]{
+ \begin{pgfonlayer}{umlsd@threadlayer}
+ \draw[threadstyle] (#1.west) -- (#1.east) -- (#2.east) -- (#2.west) -- cycle;
+ \end{pgfonlayer}
+}
+
+% a function call
+% Example:
+% \begin{call}[height]{caller}{function}{callee}{return}
+% \end{call}
+\newenvironment{call}[5][1]{
+\ifthenelse{\equal{#2}{#4}}
+{
+ \begin{callself}[#1]{#2}{#3}{#5}
+}
+{
+ \begin{callanother}[#1]{#2}{#3}{#4}{#5}
+}
+}
+{
+\ifthenelse{\equal{\f\thecallevel}{\t\thecallevel}}
+{
+ \end{callself}
+}
+{
+ \end{callanother}
+}
+}
+
+% function call to another instance
+% interal use only
+\newenvironment*{callanother}[5][1]{
+ \stepcounter{seqlevel}
+ \stepcounter{callevel} % push
+ \path
+ (#2)+(0,-\theseqlevel*\unitfactor-0.7*\unitfactor) node (cf\thecallevel) {}
+ (#4.\threadbias)+(0,-\theseqlevel*\unitfactor-0.7*\unitfactor) node (ct\thecallevel) {};
+
+ \draw[->,>=triangle 60] ({cf\thecallevel}) -- (ct\thecallevel)
+ node[midway, above] {#3};
+ \def\l\thecallevel{#1}
+ \def\f\thecallevel{#2}
+ \def\t\thecallevel{#4}
+ \def\returnvalue{#5}
+ \tikzstyle{threadstyle}+=[instcolor#2]
+}
+{
+ \addtocounter{seqlevel}{\l\thecallevel}
+ \path
+ (\f\thecallevel)+(0,-\theseqlevel*\unitfactor-0.7*\unitfactor) node (rf\thecallevel) {}
+ (\t\thecallevel.\threadbias)+(0,-\theseqlevel*\unitfactor-0.7*\unitfactor) node (rt\thecallevel) {};
+ \draw[dashed,->,>=angle 60] ({rt\thecallevel}) -- (rf\thecallevel)
+ node[midway, above]{\returnvalue};
+ \drawthread{ct\thecallevel}{rt\thecallevel}
+ \addtocounter{callevel}{-1} % pop
+}
+
+% a function do not need call others
+% interal use only
+% Example:
+% \begin{callself}[height]{caller}{function}{return}
+% \end{callself}
+\newenvironment*{callself}[4][1]{
+ \stepcounter{seqlevel}
+ \stepcounter{callevel} % push
+ \stepcounter{callselflevel}
+
+ \path
+ (#2)+(\thecallselflevel*0.1-0.1,-\theseqlevel*\unitfactor-0.7*\unitfactor) node (sc\thecallevel) {}
+ ({sc\thecallevel}.east)+(0,-0.33*\unitfactor) node (scb\thecallevel) {};
+
+ \draw[->,>=triangle 60] ({sc\thecallevel}.east) -- ++(0.8,0)
+ node[near start, above right] {#3} -- ++(0,-0.33*\unitfactor)
+ -- (scb\thecallevel);
+ \def\l\thecallevel{#1}
+ \def\f\thecallevel{#2}
+ \def\t\thecallevel{#2}
+ \def\returnvalue{#4}
+ \tikzstyle{threadstyle}+=[instcolor#2]
+}{
+ \addtocounter{seqlevel}{\l\thecallevel}
+ \path (\f\thecallevel)+(\thecallselflevel*0.1-0.1,-\theseqlevel*\unitfactor-0.33*\unitfactor) node
+ (sct\thecallevel) {};
+
+ \draw[dashed,->,>=angle 60] ({sct\thecallevel}.east) node
+ (sce\thecallevel) {} -- ++(0.8,0) -- node[midway, right]{\returnvalue} ++(0,-0.33*\unitfactor) -- ++(-0.8,0);
+ \drawthread{scb\thecallevel}{sce\thecallevel}
+ \addtocounter{callevel}{-1} % pop
+ \addtocounter{callselflevel}{-1}
+}
+
+% message between threads
+% Example:
+% \mess[delay]{sender}{message content}{receiver}
+\newcommand{\mess}[4][0]{
+ \stepcounter{seqlevel}
+ \path
+ (#2)+(0,-\theseqlevel*\unitfactor-0.7*\unitfactor) node (mess from) {};
+ \addtocounter{seqlevel}{#1}
+ \path
+ (#4)+(0,-\theseqlevel*\unitfactor-0.7*\unitfactor) node (mess to) {};
+ \draw[->,>=angle 60] (mess from) -- (mess to) node[midway, above]
+ {#3};
+
+ \node (#3 from) at (mess from) {};
+ \node (#3 to) at (mess to) {};
+}
+
+\newenvironment{messcall}[4][1]{
+ \stepcounter{seqlevel}
+ \stepcounter{callevel} % push
+ \path
+ (#2)+(0,-\theseqlevel*\unitfactor-0.7*\unitfactor) node (cf\thecallevel) {}
+ (#4.\threadbias)+(0,-\theseqlevel*\unitfactor-0.7*\unitfactor) node (ct\thecallevel) {};
+
+ \draw[->,>=angle 60] ({cf\thecallevel}) -- (ct\thecallevel)
+ node[midway, above] {#3};
+ \def\l\thecallevel{#1}
+ \def\f\thecallevel{#2}
+ \def\t\thecallevel{#4}
+ \tikzstyle{threadstyle}+=[instcolor#2]
+}
+{
+ \addtocounter{seqlevel}{\l\thecallevel}
+ \path
+ (\f\thecallevel)+(0,-\theseqlevel*\unitfactor-0.7*\unitfactor) node (rf\thecallevel) {}
+ (\t\thecallevel.\threadbias)+(0,-\theseqlevel*\unitfactor-0.3*\unitfactor) node (rt\thecallevel) {};
+ \drawthread{ct\thecallevel}{rt\thecallevel}
+ \addtocounter{callevel}{-1} % pop
+}
+
+% In the situation of multi-threads, some objects are called at the
+% same time. Currently, we have to adjust the bias of thread line
+% manually. Possible parameters are: center, west, east
+\newcommand{\setthreadbias}[1]{\global\def\threadbias{#1}}
+
+% This function makes the call earlier.
+\newcommand{\prelevel}{\addtocounter{seqlevel}{-1}}
+
+% This function makes the call later.
+\newcommand{\postlevel}{\addtocounter{seqlevel}{+1}}
+
+% a block box with caption
+% \begin{sdblock}[caption background color]{caption}{comments}
+% \end{sdblock}
+\newenvironment{sdblock}[3][white]{
+ \stepcounter{seqlevel}
+ \stepcounter{blocklevel} % push
+ \coordinate (blockbeg\theblocklevel) at (0,-\theseqlevel*\unitfactor-\unitfactor);
+ \stepcounter{seqlevel}
+ \def\blockcolor\theblocklevel{#1}
+ \def\blockname\theblocklevel{#2}
+ \def\blockcomm\theblocklevel{#3}
+ \begin{pgfinterruptboundingbox}
+}{
+ \coordinate (blockend) at (0,-\theseqlevel*\unitfactor-2*\unitfactor);
+ \path (current bounding box.east)+(0.2,0) node (boxeast) {}
+ (current bounding box.west |- {blockbeg\theblocklevel}) + (-0.2,0)
+ node (nw) {};
+ \path (boxeast |- blockend) node (se) {};
+
+ % % title
+ \node[blockstyle] (blocktitle) at (nw) {\blockname\theblocklevel};
+ \path (blocktitle.south east) + (0,0.2) node (set) {}
+ (blocktitle.south east) + (-0.2,0) node (seb) {}
+ (blocktitle.north east) + (0.2,0) node (comm) {};
+ \draw[fill=\blockcolor\theblocklevel] (blocktitle.north west) -- (blocktitle.north east) --
+ (set.center) -- (seb.center) -- (blocktitle.south west) -- cycle;
+ \node[blockstyle] (blocktitle) at (nw) {\blockname\theblocklevel};
+ \node[blockcommentstyle] (blockcomment) at (comm) {\blockcomm\theblocklevel};
+
+ \coordinate (se) at (current bounding box.south east);
+ \end{pgfinterruptboundingbox}
+
+ \draw (se) rectangle (nw);
+
+ \addtocounter{blocklevel}{-1} % pop
+ \stepcounter{seqlevel}
+}
+
+% the environment of sequence diagram
+\newenvironment{sequencediagram}{
+ % declare layers
+ \pgfdeclarelayer{umlsd@background}
+ \pgfdeclarelayer{umlsd@threadlayer}
+ \pgfsetlayers{umlsd@background,umlsd@threadlayer,main}
+
+ \begin{tikzpicture}
+ \setlength{\unitlength}{1cm}
+ \tikzstyle{sequence}=[coordinate]
+ \tikzstyle{inststyle}=[rectangle, draw, anchor=west, minimum
+ height=0.8cm, minimum width=1.6cm, fill=white,
+ drop shadow={opacity=1,fill=black}]
+ \ifpgfumlsdroundedcorners
+ \tikzstyle{inststyle}+=[rounded corners=3mm]
+ \fi
+ \tikzstyle{blockstyle}=[anchor=north west]
+ \tikzstyle{blockcommentstyle}=[anchor=north west, font=\small]
+ \tikzstyle{dot}=[inner sep=0pt,fill=black,circle,minimum size=0.2pt]
+ \global\def\unitfactor{0.6}
+ \global\def\threadbias{center}
+ % reset counters
+ \setcounter{preinst}{0}
+ \setcounter{instnum}{0}
+ \setcounter{threadnum}{0}
+ \setcounter{seqlevel}{0}
+ \setcounter{callevel}{0}
+ \setcounter{callselflevel}{0}
+ \setcounter{blocklevel}{0}
+
+ % origin
+ \node[coordinate] (inst0) {};
+}
+{
+ \begin{pgfonlayer}{umlsd@background}
+ \ifnum\c@instnum > 0
+ \foreach \t [evaluate=\t] in {1,...,\theinstnum}{
+ \draw[dotted] (inst\t) -- ++(0,-\theseqlevel*\unitfactor-2.2*\unitfactor);
+ }
+ \fi
+ \ifnum\c@threadnum > 0
+ \foreach \t [evaluate=\t] in {1,...,\thethreadnum}{
+ \path (thread\t)+(0,-\theseqlevel*\unitfactor-0.1*\unitfactor) node (threadend) {};
+ \tikzstyle{threadstyle}+=[threadcolor\t]
+ \drawthread{thread\t}{threadend}
+ }
+ \fi
+ \end{pgfonlayer}
+\end{tikzpicture}}
+
+
+%%% End of pgf-umlsd.sty
+%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\ No newline at end of file