better ignores
[bsc-thesis1415.git] / thesis / introduction.tex
1 \section{Introduction}
2 Within the entertainment business there is no consistent style of informing
3 people about the events. Different venues display their, often incomplete,
4 information in entirely different ways. Because of this, converting raw
5 information from venues to structured consistent data is a relevant problem.
6
7 \section{HyperLeap}
8 Hyperleap is a small company that is specialized in infotainment
9 (information+entertainment) and administrates several websites which bundle
10 information about entertainment in a ordered and complete way. Right now, most
11 of the data input is done by hand and takes a lot of time to type in.
12
13 \section{Research question}
14 The main research question is: \textit{How can we make an adaptive, autonomous
15 and programmable data mining program that can be set up by a non IT
16 professional which is able to transform raw data into structured data.}\\
17
18 The practical goal and aim of the project is to make a crawler(web or other
19 document types) that can autonomously gather information after it has been
20 setup by a, not necessarily IT trained, employer via an intuitive interface.
21 Optionally the crawler shouldn't be susceptible by small structure changes in
22 the website, be able to handle advanced website display techniques such as
23 javascript and should be able to notify the administrator when the site has
24 become uncrawlable and the crawler needs to be reprogrammed for that particular
25 site. But the main purpose is the translation from raw data to structured data.
26 The projects is in principle a continuation of a past project done by Wouter
27 Roelofs\cite{Roelofs2009} which was also supervised by Franc Grootjen and
28 Alessandro Paula, however it was never taken out of the experimental phase and
29 therefore is in need continuation.
30
31 \section{Scientific relevance}
32 Currently the techniques for conversion from non structured data to structured
33 data are static and mainly only usable by IT specialists. There is a great need
34 of data mining in non structured data because the data within companies and on
35 the internet is piling up and are usually left to catch dust.