From 7f4de91145944fc60deba36aef5bbc5c87212499 Mon Sep 17 00:00:00 2001 From: Mart Lubbers Date: Wed, 5 Nov 2014 16:23:21 +0100 Subject: [PATCH] thesis update --- thesis2/1.introduction.tex | 214 ++++++++++++++++++++++++++++--------- thesis2/thesis.bbl | 19 ++++ thesis2/thesis.blg | 48 +++++++++ thesis2/thesis.tex | 26 ++--- 4 files changed, 246 insertions(+), 61 deletions(-) create mode 100644 thesis2/thesis.bbl create mode 100644 thesis2/thesis.blg diff --git a/thesis2/1.introduction.tex b/thesis2/1.introduction.tex index 3efa80b..7e05d00 100644 --- a/thesis2/1.introduction.tex +++ b/thesis2/1.introduction.tex @@ -1,54 +1,172 @@ -\section{Hyperleap and their methods} -Hyperleap\footnote{\url{http://hyperleap.nl}} is a small company founded in the -early years of the internet and is located in Nijmegen. Hyperleap is a company -that presents \textit{infotainment}. \textit{Infotainment} is a concatenation -of the words \textit{entertainment} and \textit{information} and is a -specialized form of information, namely about the entertainment industry. -Hyperleap manages the largest database containing \textit{infotainment} -containing over $10.000$ events per week on average. It also manages a database -containing over $54.000$ venues delivering the entertainment. Next to the -factual information, Hyperleap also provides reviews, previews, background -information and more via several popular websites specialized on genres or -categories. - -Hyperleap stands compared to other \textit{infotainment} providers because of -the quality and completeness of the data is comparatively high. This is because -all information is checked and matched to existing information before it enters -the database. To ensure the quality of the databases all information enters the -database in roughly two steps. - -In the first step the information is extracted from the raw data sources using -crawlers or via venue channels. Crawlers are specialized applications that are -programmed to extract information from one single source. Venue channels are - specially made XML feeds that contain already very structured information. -The extracted information is put in the so called \textit{Temporum}, the -\textit{Temporum} is a stopping place for the gathered information before it is -entered in the real database. - -The second step in the path of the information is the matching of the data. -This step is the actual quality checking and matching. Using several -techniques, employees have to match the incoming information to existing events -or create new events. This is also a safety net for malfunctioning crawlers, -when a crawler provides wrong information in the \textit{Temporum} the -programmer of the crawler has to be informed then. A large amount of the time -the programmers are busy with repairing crawlers because it is a specialized -task only doable by people with a computer science background. Because of this -it is expensive to repair the crawlers. +\section{Introduction} +What do people do when they want to grab a movie? Attend a concert? Find out +which theater shows play in their town theater? + +In the early days of the internet information about entertainment was gathered +from flyers, books, posters, radio/tv advertisements. People had to look pretty +hard for the information and you could easily miss a show just because it +didn't cross paths with you. When the internet grew to what it is now we would +think that missing an event is impossible because of the loads of information +that you receive every day. The opposite is true. + +Nowadays information about entertainment is offered via two main channels on +the internet namely individual venues and combined websites. + +Individual venues put a lot of effort and resources in building a beautiful, +fast and most of all modern website that bundles their information with nice +graphics, animations and gimmicks. There also exist companies that bundle the +information from different websites. Because the information that is bundled +ofter comes from the individual websites the information is most of the time +not complete. Individual organisations tend to think it is obvious what the +address of their venue is, that their ticket price is always fixed to +\EURdig$5.-$ and that you need a membership to attend the events. Individual +organizations usually put this in a disclaimer or another page. + +Combined websites want to bundle this information, for every event they want +all the details and information for an event. This shows to be a hard task +because these websites don't have the resources and time to combine the +different sources to get a good and complete information overview of an event. +Because of this, there are not many websites that bundle entertainment +information so that the entire database is complete and consistent. +Hyperleap\footnote{\url{http://hyperleap.nl}} tries to achieve this goal. + +\section{Hyperleap \& Infotainment} +Hyperleap is a internet company that existed in the time that internet wasn't +widespread. Hyperleap, active since 1993, is specialized in producing, +publishing and maintaining \textit{infotainment}. \textit{Infotainment} is a +combination of the words \textit{information} and \textit{entertainment} and it +means complete information about entertainment in the broadest sense of +entertainment. Entertainment can be a theater show, move showing in the cinema, +the weekly bridge night in the local town center, music concerts etc. Hyperleap +Hyperleap manages the largest database containing \textit{infotainment}. The +database contains over $10.000$ events per week on average and their venue +database contains over $54.000$ venues delivering the entertainment. All the +information is quality checked and therefore very reliable. Hyperleap is the +only in its kind that has such high quality information. The +\textit{infotainment} is presented via several websites specialized per +genre or category and is bundled with other kinds of non factual information +such as reviews, previews, background information and interviews. + +As said before Hyperleap is the only in its kind with the high quality data. +This is because a lot of time and resources are spend to crosscompare, match +and check the data that enters the database. To achieve this the data is +inserted in the database in several different steps described in +Figure~\ref{fig:1.1.1} + +\begin{figure}[H] + \caption{Information flow Hyperleap database} + \label{fig:1.1.1} + \centering + \scalebox{0.8}{ + \digraph[]{graph111}{ + rankdir=TB; + node [shape="rectangle",fontsize=10,nodesep=0.5,ranksep=0.75,width=1] + edge [weight=5.] + i0 [label="Website"] + i1 [label="Email"] + i2 [label="Fax"] + i3 [label="RSS/Atom"] + p1 [label="Crawler: Preproccessing"] + p2 [label="Temporum: Postproccesing"] + o1 [label="Database: Insertion"] + o2 [label="TheAgenda"] + o3 [label="BiosAgenda"] + o4 [label="..."] + p1, p2, o1 [width=5]; + i0 -> p1 + i1 -> p1 + i2 -> p1 + i3 -> p1 + p1 -> p2 + p2 -> o1 + o1 -> o2 + o1 -> o3 + o1 -> o4 + } + } +\end{figure} + +The first step in the information flow is the source. Hyperleap processes +different sources. The input sources vary by type, for example website or fax, +and by source. The sources vary in reliability a lot. For example private +information streams from venues are very reliable whereas other combined +websites are not reliable at all. Sources also vary in structural consistency, +websites from venues often look very consistent but the entries are usually +hand typed by employees and key information appears often on random places +surrounded by a lot of text. Ticket vendors on the other hand present their +information usually in a structured consistent way. Depending on the amount of +consistency and structure preprocessing happens in step two. All the +preprocessed data is then sent to the \textit{Temporum}. + +The \textit{Temporum} is a big bin that contains raw data extracted from +different sources and has to be post processed to be suitable enough for the +actual database. This post processing encompasses several possible tasks. The +first task is to check the validity of the entry. The second step is matching +the entry, entries have to be matched to a venue or organisation in the events +database. Entries also can be matched to existing events that belong to the +same tour or series. These two steps are have a lot of aspects that are and can +be done automatically but a lot of user input is still required to match and +check the data. The \textit{Temporum} functions as safety net for the data. + +When the data is post processed it is entered in the final database. The +database contains all the events that happened in the past and all the events +that are going to happen. The database is linked to several categorical +websites that offer the information to users and accompany it with the non +factual information discussed earlier. \section{Goal \& Research question} +The second step in the information flow is crawling the sources and apply the +preprocessing. This is a expensive task because it requires a programmer to be +hired because currently all crawlers are programmed for one, or a couple, +specific sources. Because a big group of sources often changes this very +expensive and has a long feedback loop. When a source changes it is first +preprocessed in the old way, send to the \textit{Temporum} and checked by a +human and matched. The human then notices the error in the data and contacts +the programmer. The programmer then has to reprogram the specific crawler to +the new structure. This feedback loop, shown in Figure~\ref{fig:1.2.1} can take +days and can be the reason for gaps in the database. +\begin{figure}[H] + \caption{Feedback loop for malfunctioning crawlers} + \label{fig:1.1.2} + \centering + \scalebox{0.8}{ + \digraph[]{graph112}{ + rankdir=LR; + node [shape="rectangle"] + source [label="Source"] + crawler [label="Crawler"] + temporum [label="Temporum"] + user [label="User"] + programmer [label="Programmer"] + database [label="Database"] + source -> crawler + crawler -> temporum + temporum -> user + user -> database + user -> programmer [constraint=false,color="blue"] + user -> crawler [constraint=false,color="red"] + programmer -> crawler [constraint=false,color="blue"] + } + } +\end{figure} + The goal of the project is to relieve the programmer of repairing crawlers all the time and make the task of adapting, editing and removing crawlers doable -for someone without programming experience. In practice this means building an -application that lets the user create, edit or remove crawlers. For this -project we focus on RSS\footnote{\url{http://rssboard.org/rss-specification}} -and Atom\footnote{\url{http://tools.ietf.org/html/rfc5023}} publishing feeds -only. The program will maintain crawlers that are able to isolate categories of -information so that the information will appear structured in the -\textit{Temporum} and make the task of matching the data and entering it less -expensive. - -The program is built so that a programmer can easily add fields or categories -to the application to reduce programming costs and complex modifications. +for someone without programming experience. In practice this means in +Figure~\ref{fig:1.1.2} removing the blue arrows by red arrows. + +For this project an application has been developed that can provide an +interface to a crawler system that is able to crawl +RSS\footnote{\url{http://rssboard.org/rss-specification}} and +Atom\footnote{\url{http://tools.ietf.org/html/rfc5023}} publishing feeds. +The interface also provides the user with point and click interfaces to create, +modify, test and remove crawlers. The Hyperleap back end can, via this +interface, generate XML feeds that contain the crawled data. For editing the +structure and contents of the program a programmer is in theory also not +necessary because all the things someone wants to change are located in a +single file that is human readable. In practice it means that one person, not +by definition a programmer, can be instructed to change the structure and this +can also greatly reduce programmer intervention time. \section{Why RSS/Atom} Information from venues comes in various different format with for each format @@ -66,7 +184,7 @@ websites. RSS/Atom feeds are basically two fields with plain text, however the text almost always has the same structure and keywords and therefore the information can be extracted learning from keywords and structure. -\section{Scientific relevance} +\section{Scientific relevance and similar research} Currently the techniques for conversion from non structured data to structured data are static and mainly only usable by computer science experts. There is a great need of data mining in non structured data because the data within diff --git a/thesis2/thesis.bbl b/thesis2/thesis.bbl new file mode 100644 index 0000000..f45b3a8 --- /dev/null +++ b/thesis2/thesis.bbl @@ -0,0 +1,19 @@ +\begin{thebibliography}{1} + +\bibitem{Daciuk2000} +Jan Daciuk, Stoyan Mihov, Bruce~W. Watson, and Richard~E. Watson. +\newblock {Incremental Construction of Minimal Acyclic Finite-State Automata}. +\newblock {\em Computational Linguistics}, 26(1):3--16, March 2000. + +\bibitem{Hopcroft1971} +John Hopcroft. +\newblock {An N log N algorithm for minimizing states in a finite automaton}. +\newblock Technical report, 1971. + +\bibitem{Roelofs2009} +Wouter Roelofs, Alessandro~Tadeo Paula, and Franc Grootjen. +\newblock {Programming by Clicking}. +\newblock In {\em Proceedings of the Dutch Information Retrieval Conference}, + pages 2--3, 2009. + +\end{thebibliography} diff --git a/thesis2/thesis.blg b/thesis2/thesis.blg new file mode 100644 index 0000000..3a83e1f --- /dev/null +++ b/thesis2/thesis.blg @@ -0,0 +1,48 @@ +This is BibTeX, Version 0.99d (TeX Live 2015/dev/Debian) +Capacity: max_strings=35307, hash_size=35307, hash_prime=30011 +The top-level auxiliary file: thesis.aux +The style file: plain.bst +Database file #1: thesis.bib +Warning--empty institution in Hopcroft1971 +You've used 3 entries, + 2118 wiz_defined-function locations, + 516 strings with 4464 characters, +and the built_in function-call counts, 993 in all, are: += -- 95 +> -- 48 +< -- 1 ++ -- 19 +- -- 16 +* -- 70 +:= -- 173 +add.period$ -- 9 +call.type$ -- 3 +change.case$ -- 18 +chr.to.int$ -- 0 +cite$ -- 4 +duplicate$ -- 38 +empty$ -- 75 +format.name$ -- 16 +if$ -- 202 +int.to.chr$ -- 0 +int.to.str$ -- 3 +missing$ -- 2 +newline$ -- 18 +num.names$ -- 6 +pop$ -- 18 +preamble$ -- 1 +purify$ -- 14 +quote$ -- 0 +skip$ -- 26 +stack$ -- 0 +substring$ -- 47 +swap$ -- 8 +text.length$ -- 1 +text.prefix$ -- 0 +top$ -- 0 +type$ -- 12 +warning$ -- 1 +while$ -- 11 +width$ -- 4 +write$ -- 34 +(There was 1 warning) diff --git a/thesis2/thesis.tex b/thesis2/thesis.tex index 6f2b15d..b9fa5e1 100644 --- a/thesis2/thesis.tex +++ b/thesis2/thesis.tex @@ -1,7 +1,7 @@ -\documentclass{scrbook} +%\documentclass{scrbook} +\documentclass{book} -%\usepackage{bibtex} -\usepackage{scrhack} +%\usepackage{scrhack} \usepackage{graphicx} % Images \usepackage{float} % Better placement float figures \usepackage{listings} % Source code formatting @@ -10,6 +10,7 @@ \usepackage{pgf-umlsd} % \usepackage{graphviz} % For the DAG diagrams \usepackage{amssymb} +\usepackage{marvosym} \usepgflibrary{arrows} % % Set listings settings @@ -38,21 +39,20 @@ \author{Mart Lubbers\\s4109053} \title{Non IT configurable adaptive data mining solution used in transforming raw data to structured data} -\subtitle{ - Bachelor's Thesis in Artificial Intelligence\\ - Radboud University Nijmegen\\ - \vspace{15mm} - \begin{tabular}{cp{5em}c} - Franc Grootjen && Alessandro Paula\\ - RU && Hyperleap - \end{tabular} - } +%\subtitle{ +% Bachelor's Thesis in Artificial Intelligence\\ +% Radboud University Nijmegen\\ +% \vspace{15mm} +% \begin{tabular}{cp{5em}c} +% Franc Grootjen && Alessandro Paula\\ +% RU && Hyperleap +% \end{tabular} +% } \date{\today} \begin{document} \maketitle \tableofcontents -\newpage % Surrogate abstract \chapter*{ -- 2.20.1