thesis/introduction.tex

   1 \section{Introduction}
   2 People are looking on the internet for information about their favourite
   3 theater show, music group or movie. All the information is scattered around on
   4 the websites, mailinglists, newsfeeds and other sources owned by the venues.
   5 This makes the search for cerntain events a energy consuming task. The venues
   6 do not have a consistent way of presenting the information and to get all the
   7 details different sources must be consulted. Because of this, converting raw
   8 information from venues into structured consistent data is a relevant problem.
   9
  10 \section{HyperLeap}
  11 Hyperleap\footnote{\url{http://hyperleap.nl/}} is a small company settled in
  12 Nijmegen that is specialized in bundling the information from different sources
  13 into a consistent information source about entertainment(infotainment), it
  14 administrates several websites for several entertainment categories. Hyperleap
  15 differentiates itself from other companies with the same business goals because
  16 Hyperleap the most complete information most of the time.
  17 Right now, most of the data in the database is added in two different fashions.
  18 The first method is inputting the data in the database by hand, an employee
  19 scans the raw inputs gathered from websites and has to separate the entries and
  20 match them to existing events or create new events.  This process is very
  21 labour intensive and therefore costly.
  22 The second way of adding information to the database is by crawlers programmed
  23 specifically for certain websites. Because a programmer is needed to program
  24 all the separate crawlers individually this is a costly business. This way of
  25 gathering information is also very error-prone, this because when a source
  26 changes it structure, for example the layout, the crawler is not functioning
  27 anymore. When this happens the programmer has to adapt the crawler again to the
  28 new changes and this takes valuable time.
  29
  30 \section{Research question and practical goals}
  31 The goal of the project is to create a software solution to make an employee
  32 with no particular programmers background able to train or retrain crawlers for
  33 RSS\footnote{\url{http://www.rssboard.org/rss-specification}} or
  34 Atom\footnote{\url{http://tools.ietf.org/html/rfc5023}} publishing feeds. This
  35 is done in such a way that the information is categorized and put into the
  36 database. The software will notice the administrator of the program when a
  37 source changed so that the new data can be added to the crawlers trainingset or
  38 it can be decided that source crawler will be retrained from scratch.
  39
  40 This brings up the main research question:
  41 \begin{center}
  42         \textit{How can we make an adaptive, autonomous and programmable data mining
  43         program that can be set up by someone without programmering experience which
  44 is capable of transforming raw data into structured data.}
  45 \end{center}
  46
  47 In practise this means that the end product is a software solution which does
  48 the previously described tasks.
  49
  50 \section{Scientific relevance}
  51 Currently the techniques for conversion from non structured data to structured
  52 data are static and mainly only usable by computer science experts. There is a
  53 great need of data mining in non structured data because the data within
  54 companies and on the internet is piling up and are usually left to catch dust.
  55
  56 The project is a continuation of the past project done by Roelofs et
  57 al.\cite{Roelofs2009}. The techniques described by Roelofs et al. are more
  58 focussed on extracting data from already isolated data so it can be an addition
  59 to the current project.
  60
  61 \section{Why RSS}
  62 Web sites change often in overall structure and the way the data is presented
  63 does also change a lot because of new insights and layout changes. RSS feeds on
  64 the other hand are often generated from the internal database of the venue's
  65 site and therefore almost always very precise, structured and consistent. When
  66 the structure of a RSS feed changes it is mostly because of the content
  67 management system of the website changes.\\
  68 Because RSS does not have a structural dimension compared to websites there is
  69 less information available, RSS feeds are basically raw strings which contain
  70 all the information, sometimes venues choose to put html in the RSS feeds but
  71 this is most of the time only to display big chunks of unstructured text in a
  72 nicer fashion.