thesis2/1.introduction.tex

   1 \section{Introduction}
   2 What do people do when they want to grab a movie? Attend a concert? Find out
   3 which theater shows play in their town theater?
   4
   5 In the early days of the internet information about entertainment was gathered
   6 from flyers, books, posters, radio/tv advertisements. People had to look pretty
   7 hard for the information and you could easily miss a show just because it
   8 didn't cross paths with you. When the internet grew to what it is now we would
   9 think that missing an event is impossible because of the loads of information
  10 that you receive every day. The opposite is true.
  11
  12 Nowadays information about entertainment is offered via two main channels on
  13 the internet namely individual venues and combined websites.
  14
  15 Individual venues put a lot of effort and resources in building a beautiful,
  16 fast and most of all modern website that bundles their information with nice
  17 graphics, animations and gimmicks. There also exist companies that bundle the
  18 information from different websites. Because the information that is bundled
  19 ofter comes from the individual websites the information is most of the time
  20 not complete. Individual organisations tend to think it is obvious what the
  21 address of their venue is, that their ticket price is always fixed to
  22 \EURdig$5.-$ and that you need a membership to attend the events. Individual
  23 organizations usually put this in a disclaimer or another page.
  24
  25 Combined websites want to bundle this information, for every event they want
  26 all the details and information for an event. This shows to be a hard task
  27 because these websites don't have the resources and time to combine the
  28 different sources to get a good and complete information overview of an event.
  29 Because of this, there are not many websites that bundle entertainment
  30 information so that the entire database is complete and consistent.
  31 Hyperleap\footnote{\url{http://hyperleap.nl}} tries to achieve this goal.
  32
  33 \section{Hyperleap \& Infotainment}
  34 Hyperleap is a internet company that existed in the time that internet wasn't
  35 widespread. Hyperleap, active since 1993, is specialized in producing,
  36 publishing and maintaining \textit{infotainment}. \textit{Infotainment} is a
  37 combination of the words \textit{information} and \textit{entertainment} and it
  38 means complete information about entertainment in the broadest sense of
  39 entertainment. Entertainment can be a theater show, move showing in the cinema,
  40 the weekly bridge night in the local town center, music concerts etc. Hyperleap
  41 Hyperleap manages the largest database containing \textit{infotainment}. The
  42 database contains over $10.000$ events per week on average and their venue
  43 database contains over $54.000$ venues delivering the entertainment. All the
  44 information is quality checked and therefore very reliable. Hyperleap is the
  45 only in its kind that has such high quality information. The
  46 \textit{infotainment} is presented via several websites specialized per
  47 genre or category and is bundled with other kinds of non factual information
  48 such as reviews, previews, background information and interviews.
  49
  50 As said before Hyperleap is the only in its kind with the high quality data.
  51 This is because a lot of time and resources are spend to crosscompare, match
  52 and check the data that enters the database. To achieve this the data is
  53 inserted in the database in several different steps described in
  54 Figure~\ref{fig:1.1.1}
  55
  56 \begin{figure}[H]
  57         \caption{Information flow Hyperleap database}
  58         \label{fig:1.1.1}
  59         \centering
  60         \scalebox{0.8}{
  61                 \digraph[]{graph111}{
  62                         rankdir=TB;
  63                         node [shape="rectangle",fontsize=10,nodesep=0.5,ranksep=0.75,width=1]
  64                         edge [weight=5.]
  65                         i0 [label="Website"]
  66                         i1 [label="Email"]
  67                         i2 [label="Fax"]
  68                         i3 [label="RSS/Atom"]
  69                         p1 [label="Crawler: Preproccessing"]
  70                         p2 [label="Temporum: Postproccesing"]
  71                         o1 [label="Database: Insertion"]
  72                         o2 [label="TheAgenda"]
  73                         o3 [label="BiosAgenda"]
  74                         o4 [label="..."]
  75                         p1, p2, o1 [width=5];
  76                         i0 -> p1
  77                         i1 -> p1
  78                         i2 -> p1
  79                         i3 -> p1
  80                         p1 -> p2
  81                         p2 -> o1
  82                         o1 -> o2
  83                         o1 -> o3
  84                         o1 -> o4
  85                 }
  86         }
  87 \end{figure}
  88
  89 The first step in the information flow is the source. Hyperleap processes
  90 different sources. The input sources vary by type, for example website or fax,
  91 and by source. The sources vary in reliability a lot. For example private
  92 information streams from venues are very reliable whereas other combined
  93 websites are not reliable at all. Sources also vary in structural consistency,
  94 websites from venues often look very consistent but the entries are usually
  95 hand typed by employees and key information appears often on random places
  96 surrounded by a lot of text. Ticket vendors on the other hand present their
  97 information usually in a structured consistent way. Depending on the amount of
  98 consistency and structure preprocessing happens in step two. All the
  99 preprocessed data is then sent to the \textit{Temporum}.
 100
 101 The \textit{Temporum} is a big bin that contains raw data extracted from
 102 different sources and has to be post processed to be suitable enough for the
 103 actual database. This post processing encompasses several possible tasks. The
 104 first task is to check the validity of the entry. The second step is matching
 105 the entry, entries have to be matched to a venue or organisation in the events
 106 database. Entries also can be matched to existing events that belong to the
 107 same tour or series. These two steps are have a lot of aspects that are and can
 108 be done automatically but a lot of user input is still required to match and
 109 check the data. The \textit{Temporum} functions as safety net for the data.
 110
 111 When the data is post processed it is entered in the final database. The
 112 database contains all the events that happened in the past and all the events
 113 that are going to happen. The database is linked to several categorical
 114 websites that offer the information to users and accompany it with the non
 115 factual information discussed earlier.
 116
 117 \section{Goal \& Research question}
 118 The second step in the information flow is crawling the sources and apply the
 119 preprocessing. This is a expensive task because it requires a programmer to be
 120 hired because currently all crawlers are programmed for one, or a couple,
 121 specific sources. Because a big group of sources often changes this very
 122 expensive and has a long feedback loop. When a source changes it is first
 123 preprocessed in the old way, send to the \textit{Temporum} and checked by a
 124 human and matched. The human then notices the error in the data and contacts
 125 the programmer. The programmer then has to reprogram the specific crawler to
 126 the new structure. This feedback loop, shown in Figure~\ref{fig:1.2.1} can take
 127 days and can be the reason for gaps in the database.
 128 \begin{figure}[H]
 129         \caption{Feedback loop for malfunctioning crawlers}
 130         \label{fig:1.1.2}
 131         \centering
 132         \scalebox{0.8}{
 133                 \digraph[]{graph112}{
 134                         rankdir=LR;
 135                         node [shape="rectangle"]
 136                         source [label="Source"]
 137                         crawler [label="Crawler"]
 138                         temporum [label="Temporum"]
 139                         user [label="User"]
 140                         programmer [label="Programmer"]
 141                         database [label="Database"]
 142                         source -> crawler
 143                         crawler -> temporum
 144                         temporum -> user
 145                         user -> database
 146                         user -> programmer [constraint=false,color="blue"]
 147                         user -> crawler [constraint=false,color="red"]
 148                         programmer -> crawler [constraint=false,color="blue"]
 149                 }
 150         }
 151 \end{figure}
 152
 153 The goal of the project is to relieve the programmer of repairing crawlers all
 154 the time and make the task of adapting, editing and removing crawlers doable
 155 for someone without programming experience. In practice this means in
 156 Figure~\ref{fig:1.1.2} removing the blue arrows by red arrows.
 157
 158 For this project an application has been developed that can provide an
 159 interface to a crawler system that is able to crawl
 160 RSS\footnote{\url{http://rssboard.org/rss-specification}} and
 161 Atom\footnote{\url{http://tools.ietf.org/html/rfc5023}} publishing feeds.
 162 The interface also provides the user with point and click interfaces to create,
 163 modify, test and remove crawlers. The Hyperleap back end can, via this
 164 interface, generate XML feeds that contain the crawled data. For editing the
 165 structure and contents of the program a programmer is in theory also not
 166 necessary because all the things someone wants to change are located in a
 167 single file that is human readable. In practice it means that one person, not
 168 by definition a programmer, can be instructed to change the structure and this
 169 can also greatly reduce programmer intervention time.
 170
 171 \section{RSS/Atom}
 172 RSS/Atom feeds, from now on called RSS feeds, are publishing
 173 feeds in the XML format\footnote{\url{http://www.w3.org/TR/REC-xml/}} that are
 174 used to publish events. Every event or entry consists of several standardized
 175 fields. The main data fields are the \textit{title} and the
 176 \textit{description} field. In these fields the raw data is stored that
 177 describes the event. Further there are several auxiliary fields that store the
 178 link to the full article, store the publishing data, store some media in the
 179 form of an image or video URL and store a \textit{Globally Unique
 180 Identifier}(GUID).
 181
 182 The RSS feeds are mostly used by news sites to publish their articles, the feed
 183 then only contains the headlines and if the user that is reading the feed is
 184 interested he can click the so called deeplink and he is then sent to the full
 185 website containing the article. Users often use programs that bundle user
 186 specified RSS feeds into big summary feeds that can be used to sift through a
 187 lot of news feeds. The RSS feed reader uses the GUID to skip feeds that are
 188 already present in its internal database.
 189
 190 Generating RSS feeds by hand is a tedious task but usually RSS feeds are
 191 generated by the Content Management Systems(CMS) the website runs on. With this
 192 automatic method the RSS feeds are generated using the content published on the
 193 website. Because the process is automatic the RSS feeds are generally very
 194 structured and consistent in its structure. In the entertainment industry
 195 venues often use a CMS for their website to allow users with no programming or
 196 website background be able to post news items and event information.
 197
 198 \section{Why RSS/Atom}
 199 Information from venues comes in various different format with for each format
 200 several positive and negative points. For this project we chose to focus on
 201 RSS feeds. RSS feeds are, as explained earlier, very structured and consistent
 202 in their structure. The structure basically only changes if the CMS changes or
 203 upgrades. Because of this patterns don't have to be retrained a lot.
 204
 205 In comparison to websites RSS feeds don't have a structural dimension in
 206 the data, all the information in an entry is basically two fields of plain
 207 text. Therefore an entirely new strategy has to be applied to train the
 208 patterns.
 209
 210 \section{Scientific relevance and similar research}
 211 Currently the techniques for conversion from non structured data to structured
 212 data are static and mainly only usable by computer science experts. There is a
 213 great need of data mining in non structured data because the data within
 214 companies and on the internet is piling up and are usually left to catch dust.
 215
 216 The project is a followup project of the past project done by Roelofs et
 217 al.\cite{Roelofs2008} and \cite{Roelofs2009}. Roelofs et al. described
 218 techniques on recognizing patterns in website data and published about an
 219 adapted levenstein distance algorithm to recognize data in isolated text. These
 220 techniques of recognizing data in text can still be used to interpret the
 221 isolated extracted parts from this project.
 222
 223 \section{Directed Acyclic Graphs}
 224 A graph is a mathematical structure described with the ordered pair: $G=(V,E)$.
 225 In this ordered pair $V$ is the set of nodes and $E$ is set of ordered pairs
 226 that we will call \textit{undirected edges}.
 227
 228 Directed graphs are special kinds of graphs. A directed graph is described as
 229 the ordered pair: $G=(V,A)$. Where $A$ is a set of ordered pairs that we will
 230 call \textit{directed edges}. Directed edges can be visualized as arrows in a
 231 graph whereas undirected edges are just lines connecting the nodes with no
 232 particular direction.
 233
 234 Directed Acyclic Graphs(DAGs) are again a special kind of directed graph. DAGs
 235 don't allow cycles to be created. Every node is only reachable at maximum once
 236 when starting from an arbitrary point in the graph. DAGs have the nice property
 237 of checking if a sequence appears in the graph, checking if a sequence is
 238 present in a graph only has a computational complexity of $\mathcal{O}(L)$
 239 where $L$ is the length of the sequence.
 240
 241 The type of graph used in the project is another special king of DAG called
 242 Directed Acyclic Word Graphs(DAWGs). This type of graph can be used to store
 243 large dictionaries of words. A DAWGs nodes basically represent indices of a
 244 word and the edges describe the character added when traversing. For example
 245 the graph in Figure~\ref{fig:2.1.1} can describe a dictionary that holds the
 246 words: \textit{abd, bad, bae}. Testing if a word is present in the DAWG is,
 247 just as for a DAG, falls als in the computational complexity class of
 248 $\mathcal{O}(L)$ meaning that it grows linearly with the length of the word.
 249
 250 \begin{figure}[H]
 251         \caption{Example DAWG}
 252         \label{fig:2.1.1}
 253         \centering
 254         \digraph[]{graph21}{
 255                 rankdir=LR;
 256                 1,2,3,4,5 [shape="circle"];
 257                 6 [shape="doublecircle"];
 258                 1 -> 2 [label="a"];
 259                 2 -> 3 [label="b"];
 260                 3 -> 6 [label="d"];
 261                 1 -> 4 [label="b"];
 262                 4 -> 5 [label="a"];
 263                 5 -> 6 [label="a"];
 264                 5 -> 6 [label="e"];
 265         }
 266 \end{figure}