thesis2/1.introduction.tex

   1 \section{Introduction}
   2 What do people do when they want to grab a movie? Attend a concert? Find out
   3 which theater shows play in their town theater?
   4
   5 In the early days of the internet information about entertainment was gathered
   6 from flyers, books, posters, radio/tv advertisements. People had to look pretty
   7 hard for the information and you could easily miss a show just because it
   8 didn't cross paths with you. When the internet grew to what it is now we would
   9 think that missing an event is impossible because of the loads of information
  10 that you receive every day. The opposite is true.
  11
  12 Nowadays information about entertainment is offered via two main channels on
  13 the internet namely individual venues and combined websites.
  14
  15 Individual venues put a lot of effort and resources in building a beautiful,
  16 fast and most of all modern website that bundles their information with nice
  17 graphics, animations and gimmicks. There also exist companies that bundle the
  18 information from different websites. Because the information that is bundled
  19 ofter comes from the individual websites the information is most of the time
  20 not complete. Individual organisations tend to think it is obvious what the
  21 address of their venue is, that their ticket price is always fixed to
  22 \EURdig$5.-$ and that you need a membership to attend the events. Individual
  23 organizations usually put this in a disclaimer or another page.
  24
  25 Combined websites want to bundle this information, for every event they want
  26 all the details and information for an event. This shows to be a hard task
  27 because these websites don't have the resources and time to combine the
  28 different sources to get a good and complete information overview of an event.
  29 Because of this, there are not many websites that bundle entertainment
  30 information so that the entire database is complete and consistent.
  31 Hyperleap\footnote{\url{http://hyperleap.nl}} tries to achieve this goal.
  32
  33 \section{Hyperleap \& Infotainment}
  34 Hyperleap is a internet company that existed in the time that internet was not
  35 widespread. Hyperleap, active since 1995, is specialized in producing,
  36 publishing and maintaining \textit{infotainment}. \textit{Infotainment} is a
  37 combination of the words \textit{information} and \textit{entertainment}. It
  38 means a combination of factual information and subjectual information about a
  39 certain category. In the case of Hyperleap the category is entertainment
  40 industry, entertainment industry encompasses all facets going from cinemas,
  41 theaters, concerts, conferences and so on. The factual information includes
  42 things such as the date, time, host or location. The subjectual information can
  43 be reviews, previews, photos or background information.
  44
  45 Hyperleap manages the largest database containing \textit{infotainment} about
  46 the entertainment industry. The database contains over $10.000$ events per week
  47 on average and their venue database contains over $54.000$ venues delivering
  48 the entertainment. All the subjectual information is gathered or created by
  49 Hyperleap. All subjectual information is gathered from different sources and
  50 quality checked and therefore very reliable. Hyperleap is the only in its kind
  51 that has such high quality information. The \textit{infotainment} is presented
  52 via several websites specialized per genre or category.
  53
  54 \section{Information flow}
  55 As said before Hyperleap is the only in its kind with the high quality data.
  56 This is because a lot of time and resources are spend to crosscompare, match
  57 and check the data that enters the database. To achieve this the data is
  58 inserted in the database in several different steps described in
  59 Figure~\ref{informationflow}
  60
  61 \begin{figure}[H]
  62         \caption{Information flow Hyperleap database}
  63         \label{informationflow}
  64         \centering
  65         \scalebox{0.7}{
  66                 \digraph[]{graph111}{
  67                         rankdir=TB;
  68                         node [shape="rectangle",fontsize=10,nodesep=0.7,ranksep=0.75,width=1]
  69                         edge [weight=5.]
  70                         i0 [label="Website"]
  71                         i1 [label="Email"]
  72                         i2 [label="Fax"]
  73                         i3 [label="RSS/Atom"]
  74                         p1 [label="Crawler: Preproccessing"]
  75                         p2 [label="Temporum: Postproccesing"]
  76                         o1 [label="Database: Insertion"]
  77                         o2 [label="TheAgenda"]
  78                         o3 [label="BiosAgenda"]
  79                         o4 [label="..."]
  80                         p1, p2, o1 [width=5];
  81                         i0 -> p1
  82                         i1 -> p1
  83                         i2 -> p1
  84                         i3 -> p1
  85                         p1 -> p2
  86                         p2 -> o1
  87                         o1 -> o2
  88                         o1 -> o3
  89                         o1 -> o4
  90                 }
  91         }
  92 \end{figure}
  93
  94 \paragraph{Sources}
  95 The information that enters the database has to be quality checked. There are
  96 several criteria the information has to comply to before it can enter in the
  97 database. Hyperleap wants at minimum the following fields filled before any
  98 event can enter the database:
  99 \begin{itemize}
 100         \item[What]
 101                 The \textit{What:} field is the field that states the content, content is a
 102                 very broad definition. In practice it can contain the concert tour, theater
 103                 show name, movie title, festival title and many more.
 104         \item[Where]
 105                 The \textit{Where:} field is the location of the event. This is ofter
 106                 omitted because the organization think it is obvious. This field can also
 107                 include different sublocations. For example when a pop concert venue has
 108                 their own building but in the summer they organize a festival in some park.
 109                 This data is often assumed to be trivial and inherent but in practice this
 110                 is not the case. In this example for an outsider only the name of the park
 111                 is often not enough.
 112         \item[When]
 113                 The \textit{When:} field is the time and date of the event. Hyperleap wants
 114                 to have at minimum the date, start time and end time. In the field end
 115                 times are often omitted because they are not fixed or the organization
 116                 think it is obvious.
 117 \end{itemize}
 118
 119 Venues often present incomplete data entries that do not comply to the
 120 requirements explained before.
 121
 122 The first step in the information flow is the source. Hyperleap processes
 123 different sources and every source has different characteristics. Sources can
 124 be modern sources like websites or social media but even today a lot of
 125 information arrives at Hyperleap via flyers, fax or email. As sources vary in
 126 content structure sources also vary in reliability. For example the following
 127 entry is an example of a very well structured and probably generated, and thus
 128 probably also reliable, event description. The entry can originate for example
 129 from a RSS feed.
 130
 131 \begin{flushleft}
 132         \texttt{2015-05-20, 18:00-23:00 - \textit{Foobar} presenting their new CD in
 133         combination with a show. Location: small salon.}
 134 \end{flushleft}
 135
 136 An example of a terrible entry could be for example the following text that
 137 could originate from a flyer or social media post.
 138
 139 \begin{flushleft}
 140         \texttt{\textit{Foobar} playing to celebrate their CD release in the park
 141         tomorrow evening.}
 142 \end{flushleft}
 143
 144 This example lacks a precise date, time and location and is therefore hard for
 145 people to grasp at first, let alone for a machine. When someone wants to get
 146 the full information he has to tap in different resources which might not
 147 always be available.
 148
 149 \paragraph{Crawler}
 150 The second steps is the crawling of the websites. This happens currently in a
 151 couple of methods. Some sites are very structured and a programmer creates a
 152 program that can visit the website systematically and extract that information.
 153 Some other sources are very hard to programmatically crawled and they are
 154 visited by employees. The programmed crawlers are always specifically created
 155 for one or a couple sources and when the source changes for example structure
 156 the programmer has to adapt the crawler which is costly. Information from the
 157 different crawlers then goes to the \textit{Temporum}.
 158
 159 \paragraph{Temporum}
 160 The \textit{Temporum} is a big bin that contains raw data extracted from
 161 different sources and has to be post processed to be suitable enough for the
 162 actual database. This processing encompasses several possible tasks.  The first
 163 task is to check the validity of the entry. This is a very shallow test to
 164 check if the crawler is not malfunctioning and there is no nonsense in the
 165 data. Most of the data is not directly checked for validity, the data is
 166 skimmed for strange things but not every datapoint is checked.  The second step
 167 is matching the entry to several objects. For example the entry has to be
 168 matched to a certain venue when its source is a ticket vendor who sells tickets
 169 for multiple venues. Another example is that the event is a pop concert and is
 170 part of a big tour. Many of these tasks are done alone by or with aid of a
 171 computer program. Almost no data is going straight through the
 172 \textit{Temporum} and this property makes the \textit{Temporum} a safety net
 173 for all the incoming data.
 174
 175 \paragraph{Database \& Publication}
 176 When the data is post processed it is entered in the final database. The
 177 database contains all the events that happened in the past and all the events
 178 that are going to happen in the future. The database also contains information
 179 about the venues. The database is linked to several categorical websites that
 180 offer the information to users and accompany it with the other part of the
 181 \textit{infotainment} namely the subjectual information that is usually
 182 presented in the form of trivia, photos, interviews, reviews, previews and much
 183 more.
 184
 185 \section{Goal \& Research question}
 186 The second step in the information flow is crawling the sources and apply the
 187 preprocessing. This is a expensive task because it requires a programmer to be
 188 hired because currently all crawlers are programmed for one, or a couple,
 189 specific sources. Because a big group of sources often changes this very
 190 expensive and has a long feedback loop. When a source changes it is first
 191 preprocessed in the old way, send to the \textit{Temporum} and checked by a
 192 human and matched. The human then notices the error in the data and contacts
 193 the programmer. The programmer then has to reprogram the specific crawler to
 194 the new structure. This feedback loop, shown in Figure~\ref{feedbackloop} can
 195 take days and can be the reason for gaps or faulty information in the database.
 196 \begin{figure}[H]
 197         \caption{Feedback loop for malfunctioning crawlers}
 198         \label{feedbackloop}
 199         \centering
 200         \scalebox{0.5}{
 201                 \digraph[]{graph112}{
 202                         rankdir=LR;
 203                         node [shape="rectangle"]
 204                         source [label="Source"]
 205                         crawler [label="Crawler"]
 206                         temporum [label="Temporum"]
 207                         user [label="User"]
 208                         programmer [label="Programmer"]
 209                         database [label="Database"]
 210                         source -> crawler
 211                         crawler -> temporum
 212                         temporum -> user
 213                         user -> database
 214                         user -> programmer [constraint=false,style="dotted"]
 215                         user -> crawler [constraint=false,style="dashed"]
 216                         programmer -> crawler [constraint=false,style="dotted"]
 217                 }
 218         }
 219 \end{figure}
 220
 221 The goal of the project is to relieve the programmer of repairing crawlers all
 222 the time and make the task of adapting, editing and removing crawlers doable
 223 for someone without programming experience. In practice this means in
 224 Figure~\ref{feedbackloop} removing the dotted arrows by dashed arrow.
 225
 226 For this project an application has been developed that can provide an
 227 interface to a crawler system that is able to crawl RSS\cite{Rss} and
 228 Atom\cite{Atom} publishing feeds.  The interface also provides the user with
 229 point and click interfaces to create, modify, test and remove crawlers. The
 230 Hyperleap back end can, via this interface, generate XML feeds that contain the
 231 crawled data. For editing the structure and contents of the program a
 232 programmer is in theory also not necessary because all the things someone wants
 233 to change are located in a single file that is human readable. In practice it
 234 means that one person, not by definition a programmer, can be instructed to
 235 change the structure and this can also greatly reduce programmer intervention
 236 time.
 237
 238 \section{RSS/Atom}
 239 RSS/Atom feeds, from now on called RSS feeds, are publishing feeds in the XML
 240 format\cite{Xml} that are used to publish events. Every event or entry consists
 241 of several standardized fields. The main data fields are the \textit{title} and
 242 the \textit{description} field. In these fields the raw data is stored that
 243 describes the event. Further there are several auxiliary fields that store the
 244 link to the full article, store the publishing data, store some media in the
 245 form of an image or video URL and store a \textit{Globally Unique
 246 Identifier}(GUID)\footnote{A GUID is a unique identifier that in most cases is
 247 the permalink of the article. A permalink is a link that will point to the
 248 article}. An example of a RSS feed can be found in Listing~\ref{examplerss},
 249 this listing shows a partly truncated RSS feed of a well known venue in the
 250 Netherlands.
 251
 252 \begin{listing}
 253         \caption{An example of a, partly truncated RSS feed of a well known dutch
 254         venue}
 255         \label{examplerss}
 256         \xmlcode{exrss.xml}
 257 \end{listing}
 258
 259 The RSS feeds are mostly used by news sites to publish their articles, the feed
 260 then only contains the headlines and if the user that is reading the feed is
 261 interested he can click the so called deeplink and he is then sent to the full
 262 website containing the article. Users often use programs that bundle user
 263 specified RSS feeds into big summary feeds that can be used to sift through a
 264 lot of news feeds. The RSS feed reader uses the GUID to skip feeds that are
 265 already present in its internal database.
 266
 267 Generating RSS feeds by hand is a tedious task but usually RSS feeds are
 268 generated by the Content Management Systems(CMS) the website runs on. With this
 269 automatic method the RSS feeds are generated using the content published on the
 270 website. Because the process is automatic the RSS feeds are generally very
 271 structured and consistent in its structure. In the entertainment industry
 272 venues often use a CMS for their website to allow users with no programming or
 273 website background be able to post news items and event information.
 274
 275 \section{Why RSS?}
 276 \label{sec:whyrss}
 277 The first proposal described formats like HTML, fax/email, RSS, XML and some
 278 more. Because of the scope of the project and the time planned for it we had to
 279 remove some of the input formats because they all require different techniques.
 280 For example when the input source is in HTML format, most probably a website,
 281 then there can be a great deal of information extraction be automated using the
 282 structural information which is characteristic for HTML. For fax/email however
 283 there is almost no structural information and most of the automation techniques
 284 require natural language processing. We chose RSS feeds because RSS feeds lack
 285 inherent structural information but are still very structured. This structure
 286 is because, as said above, the RSS feeds are generated and therefore almost
 287 always look the same. Also, in RSS feeds most venues use particular structural
 288 identifiers that are characters. They separate fields with vertical bars,
 289 commas, whitespace and more non text characters. These field separators and
 290 keywords can be hard for a computer to detect but people are very good in
 291 detecting these. With one look they can identify the characters and keywords
 292 and build a pattern in their head.
 293 Another reason we chose RSS is their temporal consistency, RSS feeds are almost
 294 always generated and because of that the structure of the entries is very
 295 unlikely to change. Basically the RSS feeds only change structure when the CMS
 296 that generates it changes the generation algorithm. This property is usefull
 297 because the crawlers then do not have to be retrained very often. Because the
 298 non inherent structure entirely new strategies had to be applied.
 299
 300 \section{Directed Acyclic Graphs}
 301 \paragraph{Normal graphs}
 302 A graph($G$) is a mathematical structure to describe relations between nodes
 303 with edges. A standard graph is defined as the ordered pair: $G=(V,E)$.  In
 304 this ordered pair $V$ is the set of nodes and $E$ is set of undirected edges
 305 where every undirected edge is a tuple of two nodes.
 306 Figure~\ref{graphexample} is specified as:
 307         $$G=(\{n1, n2, n3\}, \{(n1, n2), (n2, n3), (n2, n3)\})$$
 308
 309 \begin{figure}[H]
 310         \caption{Example Graph}
 311         \label{graphexample}
 312         \centering
 313         \scalebox{0.7}{
 314                 \digraph[]{graphexample}{
 315                         rankdir=LR
 316                         n1 -> n2 [dir="none"]
 317                         n2 -> n3 [dir="none"]
 318                         n2 -> n3 [dir="none"]
 319                 }
 320         }
 321 \end{figure}
 322
 323 \paragraph{Directed graphs}
 324 Directed graphs are special kind of graphs. A directed graph $G$ is also
 325 defined by the ordered pair: $G=(V,E)$. $V$ still defines the nodes and $E$
 326 still the edges but the inherent difference is that the edges are ordered
 327 tuples in stead of not ordered. Adding this property gives the edges a
 328 direction. Every edge has a specific start and end and are therefore called
 329 directed edges. A directed graph would look the same as the graph in
 330 Figure~\ref{graphexample} but then visualized with arrows instead of normal
 331 lines. The arrows specifically go from one node to the other and not the other
 332 way around. However bidirectional connection can occur. For example graph the
 333 graph shown in Figure~\ref{dgexample} is directional with a bidirectional
 334 connection.
 335 $$G=(\{n1, n2\}, \{(n1, n2), (n2, n1)\}$$
 336
 337 \begin{figure}[H]
 338         \caption{Example directed graph}
 339         \label{dgexample}
 340         \centering
 341         \scalebox{0.7}{
 342                 \digraph[]{dgexample}{
 343                         rankdir=LR
 344                         n1 -> n2
 345                         n2 -> n1
 346                 }
 347         }
 348 \end{figure}
 349
 350 \paragraph{Directed acyclic graphs}
 351 Directed Acyclic Graphs(DAGs) are a special kind of directed graphs. DAGs
 352 are also defined as $G=(V,E)$ but with a restriction on $E$. Namely that cycles
 353 are not allowed. Figure~\ref{dagexample} shows two graphs. The bottom graph
 354 ontains a cycle and the right graph does not. Only the top graph is a valid
 355 DAG. A cycle is defined by a sequence of edges where nodes are visited more
 356 then once. Adding the property of non-cyclicity to graphs lowers the
 357 computational complexity of checking if a node sequence is present in the
 358 graph to $\mathcal{O}(L)$ where $L$ is the length of the sequence.
 359
 360 \begin{figure}[H]
 361         \caption{Example DAG}
 362         \label{dagexample}
 363         \centering
 364         \scalebox{0.7}{
 365                 \digraph[]{dagexample}{
 366                         rankdir=LR
 367                         n01 -> n02
 368                         n02 -> n03
 369                         n03 -> n01
 370                         n11 -> n12
 371                         n12 -> n13
 372                         n12 -> n14
 373                 }
 374         }
 375 \end{figure}
 376
 377 \paragraph{Directed Acyclic Word Graphs}
 378 The type of graph used in the project is a special kind of DAG called
 379 Directed Acyclic Word Graphs(DAWGs). A DAWG can be defined by the ordered
 380 triple $G=(V,E,F)$.
 381 $V$ is the same as in directed graphs. $E$ is also the same besides the fact
 382 that the ordered pair of nodes that describes the edge it now is a triple
 383 consisting of a start node, an end node and a label. In a standard DAWG the
 384 label is always one character. $F$ describes the set of final nodes, final
 385 nodes are nodes that can be the end of a sequence even if another arrow leads
 386 out. In the example graph in Figure~\ref{exampledawg} the final nodes are
 387 visualized with a double circle as node shape. In this example it is purely
 388 cosmetic because $n6$ is a final node anyways because there are no arrows
 389 leading out. But for example in  $G=(\{n1, n2, n3\}, \{(n1, n2), (n2, n3)\},
 390 \{n2, n3\})$ there is a distinct use for the final node marking. Graph $G$
 391 accepts the words \textit{a,ab} and to simplify the graph node $n2$ and $n3$
 392 are final. Because of the property of labeled edges data can be stored
 393 in a DAWG. When traversing a DAWG and saving all the edgelabels one can
 394 construct words. Because of properties of graphs a DAWG can store big sets of
 395 words in a small amount of storage because it can re-use edges to specify
 396 transitions. For example the graph in Figure~\ref{exampledawg} can describe a
 397 dictionary that holds the words: \textit{abd, bad, bae}. Testing if a word is
 398 present in the DAWG is, just as for a DAG, falls in the computational
 399 complexity class of $\mathcal{O}(L)$ meaning that it grows linearly with the
 400 length of the word.
 401
 402 \begin{figure}[H]
 403         \caption{Example DAWG}
 404         \label{exampledawg}
 405         \centering
 406         \scalebox{0.7}{
 407                 \digraph[]{graph21}{
 408                         rankdir=LR;
 409                         n1,n2,n3,n4,n5 [shape="circle"];
 410                         n6 [shape="doublecircle"];
 411                         n1 -> n2 [label="a"];
 412                         n2 -> n3 [label="b"];
 413                         n3 -> n6 [label="d"];
 414                         n1 -> n4 [label="b"];
 415                         n4 -> n5 [label="a"];
 416                         n5 -> n6 [label="d"];
 417                         n5 -> n6 [label="e"];
 418                 }
 419         }
 420 \end{figure}