thesis2/1.introduction.tex

   1 \section{Hyperleap and their methods}
   2 Hyperleap\footnote{\url{http://hyperleap.nl}} is a small company founded in the
   3 early years of the internet and is located in Nijmegen. Hyperleap is a company
   4 that presents \textit{infotainment}. \textit{Infotainment} is a concatenation
   5 of the words \textit{entertainment} and \textit{information} and is a
   6 specialized form of information, namely about the entertainment industry.
   7 Hyperleap manages the largest database containing \textit{infotainment}
   8 containing over $10.000$ events per week on average. It also manages a database
   9 containing over $54.000$ venues delivering the entertainment. Next to the
  10 factual information, Hyperleap also provides reviews, previews, background
  11 information and more via several popular websites specialized on genres or
  12 categories.
  13
  14 Hyperleap stands compared to other \textit{infotainment} providers because of
  15 the quality and completeness of the data is comparatively high. This is because
  16 all information is checked and matched to existing information before it enters
  17 the database. To ensure the quality of the databases all information enters the
  18 database in roughly two steps.
  19
  20 In the first step the information is extracted from the raw data sources using
  21 crawlers or via venue channels. Crawlers are specialized applications that are
  22 programmed to extract information from one single source. Venue channels are
  23  specially made XML feeds that contain already very structured information.
  24 The extracted information is put in the so called \textit{Temporum}, the
  25 \textit{Temporum} is a stopping place for the gathered information before it is
  26 entered in the real database.
  27
  28 The second step in the path of the information is the matching of the data.
  29 This step is the actual quality checking and matching. Using several
  30 techniques, employees have to match the incoming information to existing events
  31 or create new events. This is also a safety net for malfunctioning crawlers,
  32 when a crawler provides wrong information in the \textit{Temporum} the
  33 programmer of the crawler has to be informed then. A large amount of the time
  34 the programmers are busy with repairing crawlers because it is a specialized
  35 task only doable by people with a computer science background. Because of this
  36 it is expensive to repair the crawlers.
  37
  38 \section{Goal \& Research question}
  39 The goal of the project is to relieve the programmer of repairing crawlers all
  40 the time and make the task of adapting, editing and removing crawlers doable
  41 for someone without programming experience. In practice this means building an
  42 application that lets the user create, edit or remove crawlers. For this
  43 project we focus on RSS\footnote{\url{http://rssboard.org/rss-specification}}
  44 and Atom\footnote{\url{http://tools.ietf.org/html/rfc5023}} publishing feeds
  45 only. The program will maintain crawlers that are able to isolate categories of
  46 information so that the information will appear structured in the
  47 \textit{Temporum} and make the task of matching the data and entering it less
  48 expensive.
  49
  50 The program is built so that a programmer can easily add fields or categories
  51 to the data to make it flexible for changes.
  52
  53 \section{Why RSS/Atom}
  54 Information from venues comes in various different format with for each format
  55 several positive and negative points. For this project we chose to focus on
  56 RSS/Atom feeds because they are in general already structured and consistent in
  57 their structure. For example websites change a lot in structure and layout and
  58 thus making it hard to keep crawlers up to date. RSS/Atom feeds generally only
  59 change structure because the website or content management system gets migrated
  60 or upgraded.
  61
  62 RSS/Atom feeds in comparison to websites doesn't have a structural dimension in
  63 the data. Because of this we have to use different techniques of isolation of
  64 information than present techniques used for extracting information from
  65 websites. RSS/Atom feeds are basically two fields with plain text, however the
  66 text almost always has the same structure and keywords and therefore the
  67 information can be extracted learning from keywords and structure.
  68
  69 \section{Scientific relevance}
  70 Currently the techniques for conversion from non structured data to structured
  71 data are static and mainly only usable by computer science experts. There is a
  72 great need of data mining in non structured data because the data within
  73 companies and on the internet is piling up and are usually left to catch dust.
  74
  75 The project is a continuation of the past project done by Roelofs et
  76 al.\cite{Roelofs2009}. The techniques described by Roelofs et al. are more
  77 focussed on extracting data from websites and/or already isolated data so it
  78 can be an addition to the current project.