From: Mart Lubbers Date: Wed, 15 Oct 2014 05:22:19 +0000 (+0200) Subject: update for thesis intro X-Git-Url: https://git.martlubbers.net/?a=commitdiff_plain;h=8c02f1dd62fd040b12c0181354294a674111b95b;p=bsc-thesis1415.git update for thesis intro --- diff --git a/thesis2/1.introduction.tex b/thesis2/1.introduction.tex new file mode 100644 index 0000000..2aae8a6 --- /dev/null +++ b/thesis2/1.introduction.tex @@ -0,0 +1,78 @@ +\section{Hyperleap and their methods} +Hyperleap\footnote{\url{http://hyperleap.nl}} is a small company founded in the +early years of the internet and is located in Nijmegen. Hyperleap is a company +that presents \textit{infotainment}. \textit{Infotainment} is a concatenation +of the words \textit{entertainment} and \textit{information} and is a +specialized form of information, namely about the entertainment industry. +Hyperleap manages the largest database containing \textit{infotainment} +containing over $10.000$ events per week on average. It also manages a database +containing over $54.000$ venues delivering the entertainment. Next to the +factual information, Hyperleap also provides reviews, previews, background +information and more via several popular websites specialized on genres or +categories. + +Hyperleap stands compared to other \textit{infotainment} providers because of +the quality and completeness of the data is comparatively high. This is because +all information is checked and matched to existing information before it enters +the database. To ensure the quality of the databases all information enters the +database in roughly two steps. + +In the first step the information is extracted from the raw data sources using +crawlers or via venue channels. Crawlers are specialized applications that are +programmed to extract information from one single source. Venue channels are + specially made XML feeds that contain already very structured information. +The extracted information is put in the so called \textit{Temporum}, the +\textit{Temporum} is a stopping place for the gathered information before it is +entered in the real database. + +The second step in the path of the information is the matching of the data. +This step is the actual quality checking and matching. Using several +techniques, employees have to match the incoming information to existing events +or create new events. This is also a safety net for malfunctioning crawlers, +when a crawler provides wrong information in the \textit{Temporum} the +programmer of the crawler has to be informed then. A large amount of the time +the programmers are busy with repairing crawlers because it is a specialized +task only doable by people with a computer science background. Because of this +it is expensive to repair the crawlers. + +\section{Goal \& Research question} +The goal of the project is to relieve the programmer of repairing crawlers all +the time and make the task of adapting, editing and removing crawlers doable +for someone without programming experience. In practice this means building an +application that lets the user create, edit or remove crawlers. For this +project we focus on RSS\footnote{\url{http://rssboard.org/rss-specification}} +and Atom\footnote{\url{http://tools.ietf.org/html/rfc5023}} publishing feeds +only. The program will maintain crawlers that are able to isolate categories of +information so that the information will appear structured in the +\textit{Temporum} and make the task of matching the data and entering it less +expensive. + +The program is built so that a programmer can easily add fields or categories +to the data to make it flexible for changes. + +\section{Why RSS/Atom} +Information from venues comes in various different format with for each format +several positive and negative points. For this project we chose to focus on +RSS/Atom feeds because they are in general already structured and consistent in +their structure. For example websites change a lot in structure and layout and +thus making it hard to keep crawlers up to date. RSS/Atom feeds generally only +change structure because the website or content management system gets migrated +or upgraded. + +RSS/Atom feeds in comparison to websites doesn't have a structural dimension in +the data. Because of this we have to use different techniques of isolation of +information than present techniques used for extracting information from +websites. RSS/Atom feeds are basically two fields with plain text, however the +text almost always has the same structure and keywords and therefore the +information can be extracted learning from keywords and structure. + +\section{Scientific relevance} +Currently the techniques for conversion from non structured data to structured +data are static and mainly only usable by computer science experts. There is a +great need of data mining in non structured data because the data within +companies and on the internet is piling up and are usually left to catch dust. + +The project is a continuation of the past project done by Roelofs et +al.\cite{Roelofs2009}. The techniques described by Roelofs et al. are more +focussed on extracting data from websites and/or already isolated data so it +can be an addition to the current project. diff --git a/thesis2/thesis.tex b/thesis2/thesis.tex index b550ff7..b9349c0 100644 --- a/thesis2/thesis.tex +++ b/thesis2/thesis.tex @@ -1,7 +1,6 @@ -\documentclass[hidelink]{scrbook} +\documentclass{scrbook} \usepackage{scrhack} -\usepackage{lipsum} % Dummy text \usepackage{graphicx} % Images \usepackage{float} % Better placement float figures \usepackage{listings} % Source code formatting @@ -66,8 +65,7 @@ data to structured data} \clearpage \chapter{Introduction} - \section{Hyperleap} - \section{Problem description} +\input{1.introduction.tex} \chapter{Methods} \section{Application overview}