2aae8a648d6fa236156d43fef800c93d6d4edc9f
[bsc-thesis1415.git] / thesis2 / 1.introduction.tex
1 \section{Hyperleap and their methods}
2 Hyperleap\footnote{\url{http://hyperleap.nl}} is a small company founded in the
3 early years of the internet and is located in Nijmegen. Hyperleap is a company
4 that presents \textit{infotainment}. \textit{Infotainment} is a concatenation
5 of the words \textit{entertainment} and \textit{information} and is a
6 specialized form of information, namely about the entertainment industry.
7 Hyperleap manages the largest database containing \textit{infotainment}
8 containing over $10.000$ events per week on average. It also manages a database
9 containing over $54.000$ venues delivering the entertainment. Next to the
10 factual information, Hyperleap also provides reviews, previews, background
11 information and more via several popular websites specialized on genres or
12 categories.
13
14 Hyperleap stands compared to other \textit{infotainment} providers because of
15 the quality and completeness of the data is comparatively high. This is because
16 all information is checked and matched to existing information before it enters
17 the database. To ensure the quality of the databases all information enters the
18 database in roughly two steps.
19
20 In the first step the information is extracted from the raw data sources using
21 crawlers or via venue channels. Crawlers are specialized applications that are
22 programmed to extract information from one single source. Venue channels are
23 specially made XML feeds that contain already very structured information.
24 The extracted information is put in the so called \textit{Temporum}, the
25 \textit{Temporum} is a stopping place for the gathered information before it is
26 entered in the real database.
27
28 The second step in the path of the information is the matching of the data.
29 This step is the actual quality checking and matching. Using several
30 techniques, employees have to match the incoming information to existing events
31 or create new events. This is also a safety net for malfunctioning crawlers,
32 when a crawler provides wrong information in the \textit{Temporum} the
33 programmer of the crawler has to be informed then. A large amount of the time
34 the programmers are busy with repairing crawlers because it is a specialized
35 task only doable by people with a computer science background. Because of this
36 it is expensive to repair the crawlers.
37
38 \section{Goal \& Research question}
39 The goal of the project is to relieve the programmer of repairing crawlers all
40 the time and make the task of adapting, editing and removing crawlers doable
41 for someone without programming experience. In practice this means building an
42 application that lets the user create, edit or remove crawlers. For this
43 project we focus on RSS\footnote{\url{http://rssboard.org/rss-specification}}
44 and Atom\footnote{\url{http://tools.ietf.org/html/rfc5023}} publishing feeds
45 only. The program will maintain crawlers that are able to isolate categories of
46 information so that the information will appear structured in the
47 \textit{Temporum} and make the task of matching the data and entering it less
48 expensive.
49
50 The program is built so that a programmer can easily add fields or categories
51 to the data to make it flexible for changes.
52
53 \section{Why RSS/Atom}
54 Information from venues comes in various different format with for each format
55 several positive and negative points. For this project we chose to focus on
56 RSS/Atom feeds because they are in general already structured and consistent in
57 their structure. For example websites change a lot in structure and layout and
58 thus making it hard to keep crawlers up to date. RSS/Atom feeds generally only
59 change structure because the website or content management system gets migrated
60 or upgraded.
61
62 RSS/Atom feeds in comparison to websites doesn't have a structural dimension in
63 the data. Because of this we have to use different techniques of isolation of
64 information than present techniques used for extracting information from
65 websites. RSS/Atom feeds are basically two fields with plain text, however the
66 text almost always has the same structure and keywords and therefore the
67 information can be extracted learning from keywords and structure.
68
69 \section{Scientific relevance}
70 Currently the techniques for conversion from non structured data to structured
71 data are static and mainly only usable by computer science experts. There is a
72 great need of data mining in non structured data because the data within
73 companies and on the internet is piling up and are usually left to catch dust.
74
75 The project is a continuation of the past project done by Roelofs et
76 al.\cite{Roelofs2009}. The techniques described by Roelofs et al. are more
77 focussed on extracting data from websites and/or already isolated data so it
78 can be an addition to the current project.