thesis2/1.introduction.tex

   1 \section{Introduction}
   2 What do people do when they want to grab a movie? Attend a concert? Find out
   3 which theater shows play in their town theater?
   4
   5 In the early days of the internet information about entertainment was gathered
   6 from flyers, books, posters, radio/tv advertisements. People had to look pretty
   7 hard for the information and you could easily miss a show just because it
   8 didn't cross paths with you. When the internet grew to what it is now we would
   9 think that missing an event is impossible because of the loads of information
  10 that you receive every day. The opposite is true.
  11
  12 Nowadays information about entertainment is offered via two main channels on
  13 the internet namely individual venues and combined websites.
  14
  15 Individual venues put a lot of effort and resources in building a beautiful,
  16 fast and most of all modern website that bundles their information with nice
  17 graphics, animations and gimmicks. There also exist companies that bundle the
  18 information from different websites. Because the information that is bundled
  19 ofter comes from the individual websites the information is most of the time
  20 not complete. Individual organisations tend to think it is obvious what the
  21 address of their venue is, that their ticket price is always fixed to
  22 \EURdig$5.-$ and that you need a membership to attend the events. Individual
  23 organizations usually put this in a disclaimer or another page.
  24
  25 Combined websites want to bundle this information, for every event they want
  26 all the details and information for an event. This shows to be a hard task
  27 because these websites don't have the resources and time to combine the
  28 different sources to get a good and complete information overview of an event.
  29 Because of this, there are not many websites that bundle entertainment
  30 information so that the entire database is complete and consistent.
  31 Hyperleap\footnote{\url{http://hyperleap.nl}} tries to achieve this goal.
  32
  33 \section{Hyperleap \& Infotainment}
  34 Hyperleap is a internet company that existed in the time that internet was not
  35 widespread. Hyperleap, active since 1995, is specialized in producing,
  36 publishing and maintaining \textit{infotainment}. \textit{Infotainment} is a
  37 combination of the words \textit{information} and \textit{entertainment}. It
  38 means a combination of factual information and subjectual information about a
  39 certain category. In the case of Hyperleap the category is entertainment
  40 industry, entertainment industry encompasses all facets going from cinemas,
  41 theaters, concerts, conferences and so on. The factual information includes
  42 things such as the date, time, host or location. The subjectual information can
  43 be reviews, previews, photos or background information.
  44
  45 Hyperleap manages the largest database containing \textit{infotainment} about
  46 the entertainment industry. The database contains over $10.000$ events per week
  47 on average and their venue database contains over $54.000$ venues delivering
  48 the entertainment. All the subjectual information is gathered or created by
  49 Hyperleap. All subjectual information is gathered from different sources and
  50 quality checked and therefore very reliable. Hyperleap is the only in its kind
  51 that has such high quality information. The \textit{infotainment} is presented
  52 via several websites specialized per genre or category.
  53
  54 \section{Information flow}
  55 As said before Hyperleap is the only in its kind with the high quality data.
  56 This is because a lot of time and resources are spend to crosscompare, match
  57 and check the data that enters the database. To achieve this the data is
  58 inserted in the database in several different steps described in
  59 Figure~\ref{fig:1.1.1}
  60
  61 \begin{figure}[H]
  62         \caption{Information flow Hyperleap database}
  63         \label{fig:1.1.1}
  64         \centering
  65         \scalebox{0.7}{
  66                 \digraph[]{graph111}{
  67                         rankdir=TB;
  68                         node [shape="rectangle",fontsize=10,nodesep=0.5,ranksep=0.75,width=1]
  69                         edge [weight=5.]
  70                         i0 [label="Website"]
  71                         i1 [label="Email"]
  72                         i2 [label="Fax"]
  73                         i3 [label="RSS/Atom"]
  74                         p1 [label="Crawler: Preproccessing"]
  75                         p2 [label="Temporum: Postproccesing"]
  76                         o1 [label="Database: Insertion"]
  77                         o2 [label="TheAgenda"]
  78                         o3 [label="BiosAgenda"]
  79                         o4 [label="..."]
  80                         p1, p2, o1 [width=5];
  81                         i0 -> p1
  82                         i1 -> p1
  83                         i2 -> p1
  84                         i3 -> p1
  85                         p1 -> p2
  86                         p2 -> o1
  87                         o1 -> o2
  88                         o1 -> o3
  89                         o1 -> o4
  90                 }
  91         }
  92 \end{figure}
  93
  94 \paragraph{Sources}
  95 The information that enters the database has to be quality checked. There are
  96 several criteria the information has to comply to before it can enter in the
  97 database. Hyperleap wants at minimum the following fields filled before any
  98 event can enter the database:
  99 \begin{itemize}
 100         \item[What]
 101                 The \textit{What:} field is the field that states the content, content is a
 102                 very broad definition. In practice it can contain the concert tour, theater
 103                 show name, movie title, festival title and many more.
 104         \item[Where]
 105                 The \textit{Where:} field is the location of the event. This is ofter
 106                 omitted because the organization think it is obvious. This field can also
 107                 include different sublocations. For example when a pop concert venue has
 108                 their own building but in the summer they organize a festival in some park.
 109                 This data is often assumed to be trivial and inherent but in practice this
 110                 is not the case. In this example for an outsider only the name of the park
 111                 is often not enough.
 112         \item[When]
 113                 The \textit{When:} field is the time and date of the event. Hyperleap wants
 114                 to have at minimum the date, start time and end time. In the field end
 115                 times are often omitted because they are not fixed or the organization
 116                 think it is obvious.
 117 \end{itemize}
 118
 119 Venues often present incomplete data entries that do not comply to the
 120 requirements explained before.
 121
 122 The first step in the information flow is the source. Hyperleap processes
 123 different sources and every source has different characteristics. Sources can
 124 be modern sources like websites or social media but even today a lot of
 125 information arrives at Hyperleap via flyers, fax or email. As sources vary in
 126 content structure sources also vary in reliability. For example the following
 127 entry is an example of a very well structured and probably generated, and thus
 128 probably also reliable, event description. The entry can originate for example
 129 from a RSS feed.
 130
 131 \begin{flushleft}
 132         \texttt{2015-05-20, 18:00-23:00 - \textit{Foobar} presenting their new CD in
 133         combination with a show. Location: small salon.}
 134 \end{flushleft}
 135
 136 An example of a terrible entry could be for example the following text that
 137 could originate from a flyer or social media post.
 138
 139 \begin{flushleft}
 140         \texttt{\textit{Foobar} playing to celebrate their CD release in the park
 141         tomorrow evening.}
 142 \end{flushleft}
 143
 144 This example lacks a precise date, time and location and is therefore hard for
 145 people to grasp at first, let alone for a machine. When someone wants to get
 146 the full information he has to tap in different resources which might not
 147 always be available.
 148
 149 \paragraph{Crawler}
 150 The second steps is the crawling of the websites. This happens currently in a
 151 couple of methods. Some sites are very structured and a programmer creates a
 152 program that can visit the website systematically and extract that information.
 153 Some other sources are very hard to programmatically crawled and they are
 154 visited by employees. The programmed crawlers are always specifically created
 155 for one or a couple sources and when the source changes for example structure
 156 the programmer has to adapt the crawler which is costly. Information from the
 157 different crawlers then goes to the \textit{Temporum}.
 158
 159 \paragraph{Temporum}
 160 The \textit{Temporum} is a big bin that contains raw data extracted from
 161 different sources and has to be post processed to be suitable enough for the
 162 actual database. This post processing encompasses several possible tasks.
 163
 164 The first task is to check the validity of the entry. This is a very shallow
 165 test to check if the crawler is not malfunctioning and there is no nonsense in
 166 the data.
 167
 168 The second step is matching the entry to several objects. For example the entry
 169 has to be matched to a certain venue when its source is a ticket vendor who
 170 sells tickets for multiple venues. Another example is that the event is a pop
 171 concert and is part of a big tour. Many of these tasks are done alone by or
 172 with aid of a computer program. Almost no data is going straight through the
 173 \textit{Temporum} and this property makes the \textit{Temporum} a safety net
 174 for all the incoming data.
 175
 176 \paragraph{Database \& Publication}
 177 When the data is post processed it is entered in the final database. The
 178 database contains all the events that happened in the past and all the events
 179 that are going to happen in the future. The database also contains information
 180 about the venues. The database is linked to several categorical websites that
 181 offer the information to users and accompany it with the other part of the
 182 \textit{infotainment} namely the subjectual information that is usually
 183 presented in the form of trivia, photos, interviews, reviews, previews and much
 184 more.
 185
 186 \section{Goal \& Research question}
 187 The second step in the information flow is crawling the sources and apply the
 188 preprocessing. This is a expensive task because it requires a programmer to be
 189 hired because currently all crawlers are programmed for one, or a couple,
 190 specific sources. Because a big group of sources often changes this very
 191 expensive and has a long feedback loop. When a source changes it is first
 192 preprocessed in the old way, send to the \textit{Temporum} and checked by a
 193 human and matched. The human then notices the error in the data and contacts
 194 the programmer. The programmer then has to reprogram the specific crawler to
 195 the new structure. This feedback loop, shown in Figure~\ref{fig:1.2.1} can take
 196 days and can be the reason for gaps or faulty information in the database.
 197 \begin{figure}[H]
 198         \caption{Feedback loop for malfunctioning crawlers}
 199         \label{fig:1.1.2}
 200         \centering
 201         \scalebox{0.8}{
 202                 \digraph[]{graph112}{
 203                         rankdir=LR;
 204                         node [shape="rectangle"]
 205                         source [label="Source"]
 206                         crawler [label="Crawler"]
 207                         temporum [label="Temporum"]
 208                         user [label="User"]
 209                         programmer [label="Programmer"]
 210                         database [label="Database"]
 211                         source -> crawler
 212                         crawler -> temporum
 213                         temporum -> user
 214                         user -> database
 215                         user -> programmer [constraint=false,style="dotted"]
 216                         user -> crawler [constraint=false,style="dashed"]
 217                         programmer -> crawler [constraint=false,style="dotted"]
 218                 }
 219         }
 220 \end{figure}
 221
 222 The goal of the project is to relieve the programmer of repairing crawlers all
 223 the time and make the task of adapting, editing and removing crawlers doable
 224 for someone without programming experience. In practice this means in
 225 Figure~\ref{fig:1.1.2} removing the dotted arrows by dashed arrow.
 226
 227 For this project an application has been developed that can provide an
 228 interface to a crawler system that is able to crawl
 229 RSS\footnote{\url{http://rssboard.org/rss-specification}} and
 230 Atom\footnote{\url{http://tools.ietf.org/html/rfc5023}} publishing feeds.
 231 The interface also provides the user with point and click interfaces to create,
 232 modify, test and remove crawlers. The Hyperleap back end can, via this
 233 interface, generate XML feeds that contain the crawled data. For editing the
 234 structure and contents of the program a programmer is in theory also not
 235 necessary because all the things someone wants to change are located in a
 236 single file that is human readable. In practice it means that one person, not
 237 by definition a programmer, can be instructed to change the structure and this
 238 can also greatly reduce programmer intervention time.
 239
 240 \section{RSS/Atom}
 241 RSS/Atom feeds, from now on called RSS feeds, are publishing
 242 feeds in the XML format\footnote{\url{http://www.w3.org/TR/REC-xml/}} that are
 243 used to publish events. Every event or entry consists of several standardized
 244 fields. The main data fields are the \textit{title} and the
 245 \textit{description} field. In these fields the raw data is stored that
 246 describes the event. Further there are several auxiliary fields that store the
 247 link to the full article, store the publishing data, store some media in the
 248 form of an image or video URL and store a \textit{Globally Unique
 249 Identifier}(GUID).
 250
 251 The RSS feeds are mostly used by news sites to publish their articles, the feed
 252 then only contains the headlines and if the user that is reading the feed is
 253 interested he can click the so called deeplink and he is then sent to the full
 254 website containing the article. Users often use programs that bundle user
 255 specified RSS feeds into big summary feeds that can be used to sift through a
 256 lot of news feeds. The RSS feed reader uses the GUID to skip feeds that are
 257 already present in its internal database.
 258
 259 Generating RSS feeds by hand is a tedious task but usually RSS feeds are
 260 generated by the Content Management Systems(CMS) the website runs on. With this
 261 automatic method the RSS feeds are generated using the content published on the
 262 website. Because the process is automatic the RSS feeds are generally very
 263 structured and consistent in its structure. In the entertainment industry
 264 venues often use a CMS for their website to allow users with no programming or
 265 website background be able to post news items and event information.
 266
 267 \section{Why RSS/Atom}
 268 Information from venues comes in various different format with for each format
 269 several positive and negative points. For this project we chose to focus on
 270 RSS feeds. RSS feeds are, as explained earlier, very structured and consistent
 271 in their structure. The structure basically only changes if the CMS changes or
 272 upgrades. Because of this patterns don't have to be retrained a lot.
 273
 274 In comparison to websites RSS feeds don't have a structural dimension in
 275 the data, all the information in an entry is basically two fields of plain
 276 text. Therefore an entirely new strategy has to be applied to train the
 277 patterns.
 278
 279 \section{Scientific relevance and similar research}
 280 Currently the techniques for conversion from non structured data to structured
 281 data are static and mainly only usable by computer science experts. There is a
 282 great need of data mining in non structured data because the data within
 283 companies and on the internet is piling up and are usually left to catch dust.
 284
 285 The project is a followup project of the past project done by Roelofs et
 286 al.\cite{Roelofs2008} and \cite{Roelofs2009}. Roelofs et al. described
 287 techniques on recognizing patterns in website data and published about an
 288 adapted levenstein distance algorithm to recognize data in isolated text. These
 289 techniques of recognizing data in text can still be used to interpret the
 290 isolated extracted parts from this project.
 291
 292 \section{Directed Acyclic Graphs}
 293 A graph($G$) is a mathematical structure to describe relations between nodes
 294 with edges. A standard graph is defined as the ordered pair: $G=(V,E)$.  In
 295 this ordered pair $V$ is the set of nodes and $E$ is set of undirected edges
 296 where every undirected edge is a tuple of two nodes.
 297 Figure~\ref{fig:graphexample} is specified as:
 298         $$G=({n1, n2, n3}, {(n1, n2), (n2, n3), (n2, n3)})$$
 299
 300 \begin{figure}[H]
 301         \caption{Example Graph}
 302         \label{fig:graphexample}
 303         \centering
 304         \digraph[]{graphexample}{
 305                 rankdir=LR
 306                 n1 -> n2 [dir="none"]
 307                 n2 -> n3 [dir="none"]
 308                 n2 -> n3 [dir="none"]
 309         }
 310 \end{figure}
 311
 312
 313 Directed graphs are special kinds of graphs. A directed graph is described as
 314 the ordered pair: $G=(V,A)$. Where $A$ is a set of ordered pairs that we will
 315 call \textit{directed edges}. Directed edges can be visualized as arrows in a
 316 graph whereas undirected edges are just lines connecting the nodes with no
 317 particular direction.
 318
 319 Directed Acyclic Graphs(DAGs) are again a special kind of directed graph. DAGs
 320 don't allow cycles to be created. Every node is only reachable at maximum once
 321 when starting from an arbitrary point in the graph. DAGs have the nice property
 322 of checking if a sequence appears in the graph, checking if a sequence is
 323 present in a graph only has a computational complexity of $\mathcal{O}(L)$
 324 where $L$ is the length of the sequence.
 325
 326 The type of graph used in the project is another special king of DAG called
 327 Directed Acyclic Word Graphs(DAWGs). This type of graph can be used to store
 328 large dictionaries of words. A DAWGs nodes basically represent indices of a
 329 word and the edges describe the character added when traversing. For example
 330 the graph in Figure~\ref{fig:2.1.1} can describe a dictionary that holds the
 331 words: \textit{abd, bad, bae}. Testing if a word is present in the DAWG is,
 332 just as for a DAG, falls als in the computational complexity class of
 333 $\mathcal{O}(L)$ meaning that it grows linearly with the length of the word.
 334
 335 \begin{figure}[H]
 336         \caption{Example DAWG}
 337         \label{fig:2.1.1}
 338         \centering
 339         \digraph[]{graph21}{
 340                 rankdir=LR;
 341                 1,2,3,4,5 [shape="circle"];
 342                 6 [shape="doublecircle"];
 343                 1 -> 2 [label="a"];
 344                 2 -> 3 [label="b"];
 345                 3 -> 6 [label="d"];
 346                 1 -> 4 [label="b"];
 347                 4 -> 5 [label="a"];
 348                 5 -> 6 [label="a"];
 349                 5 -> 6 [label="e"];
 350         }
 351 \end{figure}