final
[bsc-thesis1415.git] / thesis / introduction.tex
1 \section{Introduction}
2 People are looking on the internet for information about their favourite
3 theater show, music group or movie. All the information is scattered around on
4 the websites, mailinglists, newsfeeds and other sources owned by the venues.
5 This makes the search for cerntain events a energy consuming task. The venues
6 do not have a consistent way of presenting the information and to get all the
7 details different sources must be consulted. Because of this, converting raw
8 information from venues into structured consistent data is a relevant problem.
9
10 \section{HyperLeap}
11 Hyperleap\footnote{\url{http://hyperleap.nl/}} is a small company settled in
12 Nijmegen that is specialized in bundling the information from different sources
13 into a consistent information source about entertainment(infotainment), it
14 administrates several websites for several entertainment categories. Hyperleap
15 differentiates itself from other companies with the same business goals because
16 Hyperleap the most complete information most of the time.
17 Right now, most of the data in the database is added in two different fashions.
18 The first method is inputting the data in the database by hand, an employee
19 scans the raw inputs gathered from websites and has to separate the entries and
20 match them to existing events or create new events. This process is very
21 labour intensive and therefore costly.
22 The second way of adding information to the database is by crawlers programmed
23 specifically for certain websites. Because a programmer is needed to program
24 all the separate crawlers individually this is a costly business. This way of
25 gathering information is also very error-prone, this because when a source
26 changes it structure, for example the layout, the crawler is not functioning
27 anymore. When this happens the programmer has to adapt the crawler again to the
28 new changes and this takes valuable time.
29
30 \section{Research question and practical goals}
31 The goal of the project is to create a software solution to make an employee
32 with no particular programmers background able to train or retrain crawlers for
33 RSS\footnote{\url{http://www.rssboard.org/rss-specification}} or
34 Atom\footnote{\url{http://tools.ietf.org/html/rfc5023}} publishing feeds. This
35 is done in such a way that the information is categorized and put into the
36 database. The software will notice the administrator of the program when a
37 source changed so that the new data can be added to the crawlers trainingset or
38 it can be decided that source crawler will be retrained from scratch.
39
40 This brings up the main research question:
41 \begin{center}
42 \textit{How can we make an adaptive, autonomous and programmable data mining
43 program that can be set up by someone without programmering experience which
44 is capable of transforming raw data into structured data.}
45 \end{center}
46
47 In practise this means that the end product is a software solution which does
48 the previously described tasks.
49
50 \section{Scientific relevance}
51 Currently the techniques for conversion from non structured data to structured
52 data are static and mainly only usable by computer science experts. There is a
53 great need of data mining in non structured data because the data within
54 companies and on the internet is piling up and are usually left to catch dust.
55
56 The project is a continuation of the past project done by Roelofs et
57 al.\cite{Roelofs2009}. The techniques described by Roelofs et al. are more
58 focussed on extracting data from already isolated data so it can be an addition
59 to the current project.
60
61 \section{Why RSS}
62 Web sites change often in overall structure and the way the data is presented
63 does also change a lot because of new insights and layout changes. RSS feeds on
64 the other hand are often generated from the internal database of the venue's
65 site and therefore almost always very precise, structured and consistent. When
66 the structure of a RSS feed changes it is mostly because of the content
67 management system of the website changes.\\
68 Because RSS does not have a structural dimension compared to websites there is
69 less information available, RSS feeds are basically raw strings which contain
70 all the information, sometimes venues choose to put html in the RSS feeds but
71 this is most of the time only to display big chunks of unstructured text in a
72 nicer fashion.