thesis2/1.introduction.tex

   1 \section{Introduction}
   2 What do people do when they want to grab a movie? Attend a concert? Find out
   3 which theater shows play in their town theater?
   4
   5 When the internet was in its early days and it started to be accessible to most
   6 of the people information about entertainment was still obtained almost
   7 exclusively from flyers, books, posters, radio/tv advertisements. People had to
   8 look pretty hard for the information and you could easily miss a show just
   9 because it didn't cross paths with you.
  10 Today the internet is used by almost everyone in the westen society on a daily
  11 basis and we would think that missing an event would be impossible because of
  12 the loads of information you receive every day. The opposite is true.
  13
  14 Nowadays information about entertainment is offered via two main channels on
  15 the internet namely individual venues and combined websites.
  16
  17 Individual venues put a lot of effort and resources in building a beautiful,
  18 fast and most of all modern website that bundles their information with nice
  19 graphics, animations and gimmicks. There also exist companies that bundle the
  20 information from different websites. Information bundling websites often have
  21 the individual venue website as the source for their information and therefore
  22 the information is most of the time not complete.
  23 Individual organisations tend to think, for example, that it is obvious what
  24 the address of their venue is, that their ticket price is always fixed to
  25 \EURdig$5.-$ and that you need a membership to attend the events. Individual
  26 organizations usually put this non specific information in a disclaimer or a
  27 separate page and information bundling website miss out on these types of
  28 information a lot.
  29
  30 Combining the information from the different data source turns out to be a hard
  31 task for such information bundling websites. It is a hard task because
  32 information bundling websites do not have the resources and time reserved for
  33 these tasks and therefore often also serve incomplete information.  Because of
  34 the complexity of complete information there are not many websites trying to
  35 bundle entertainment information into a complete and consistent databese.
  36 Hyperleap\footnote{\url{http://hyperleap.nl}} tries to achieve goal of serving
  37 complete and consistent information.
  38
  39 \section{Hyperleap \& Infotainment}
  40 Hyperleap is a internet company that existed in the time that internet was not
  41 widespread. Hyperleap, active since 1995, is specialized in producing,
  42 publishing and maintaining \textit{infotainment}. \textit{Infotainment} is a
  43 combination of the words \textit{information} and \textit{entertainment}. It
  44 means a combination of factual information and subjectual information
  45 (entertainment) within a certain category. In the case of Hyperleap the
  46 category is the entertainment industry, entertainment industry encompasses all
  47 facets of entertainment going from cinemas, theaters, concerts to swimming
  48 pools, bridge matches and conferences.  Within the entertainment industry
  49 factual information includes, but is not limited to, information such as
  50 starting time, location, host or venue and location. Subjectual information
  51 includes, but is not limited to, things such as reviews, previews, photos and
  52 background information or trivia.
  53
  54 Hyperleap manages the largest database containing \textit{infotainment} about
  55 the entertainment industry. The database contains over $10.000$ categorized
  56 events per week on average and their venue database contains over $54.000$
  57 venues delivering the entertainment ranging from theaters and music venues to
  58 petting zoos and fastfood restaurants. All the subjectual information is
  59 obtained or created by Hyperleap and all factual information is gathered from
  60 different sources and quality checked and therefore very reliable. Hyperleap is
  61 the only company in its kind that has such high quality information. The
  62 \textit{infotainment} is presented via several websites specialized per genre
  63 or category and some sites attract over $500.000$ visitors per month.
  64
  65 \section{Information flow}
  66 The reason why Hyperleap is the only in its kind with the high quality data is
  67 because Hyperleap spends a lot of time and resources on quality checking, cross
  68 comparing and consistency checking before the data enters the database. To
  69 achieve this the data is inserted in the database in several different stages
  70 that are visualized in Figure~\ref{informationflow} as an information flow
  71 diagram using nodes as processing or publishing steps and arrows are
  72 information flow.
  73
  74 \begin{figure}[H]
  75         \caption{Information flow Hyperleap database}
  76         \label{informationflow}
  77         \centering
  78         \scalebox{0.7}{
  79                 \digraph[]{graph111}{
  80                         rankdir=TB;
  81                         node [shape="rectangle",fontsize=10,nodesep=0.7,ranksep=0.75,width=1]
  82                         edge [weight=5.]
  83                         i0 [label="Website"]
  84                         i1 [label="Email"]
  85                         i2 [label="Fax"]
  86                         i3 [label="RSS/Atom"]
  87                         p1 [label="Preproccessing"]
  88                         p2 [label="Temporum: Postproccesing"]
  89                         o1 [label="Database: Insertion"]
  90                         o2 [label="TheAgenda"]
  91                         o3 [label="BiosAgenda"]
  92                         o4 [label="..."]
  93                         p1, p2, o1 [width=5];
  94                         i0 -> p1
  95                         i1 -> p1
  96                         i2 -> p1
  97                         i3 -> p1
  98                         p1 -> p2
  99                         p2 -> o1
 100                         o1 -> o2
 101                         o1 -> o3
 102                         o1 -> o4
 103                 }
 104         }
 105 \end{figure}
 106
 107 \paragraph{Sources}
 108 The information that enters the database has to be quality checked. There are
 109 several criteria the information and the source have to comply to before any of
 110 them can enter the database. For example the source has to be reliable,
 111 consistent and free by licence whereas the event entries have to have at least
 112 the following fields:
 113 \begin{itemize}
 114         \item[What]
 115                 The \textit{What} field is the field that describes the content, content is
 116                 a very broad definition. In practice it can be describing the concert tour
 117                 name, theater show title, movie title, festival title and many more.
 118         \item[Where]
 119                 The \textit{Where} field is the location of the event. This is ofter
 120                 omitted because the organization think it is obvious. This field can also
 121                 include different sublocations. For example when a pop concert venue has
 122                 their own building but in the summer they organize a festival in some park.
 123                 This data is often assumed to be trivial and inherent but in practice this
 124                 is not the case. In this example for an outsider only the name of the park
 125                 is often not enough.
 126         \item[When]
 127                 The \textit{When} field is the time and date of the event. Hyperleap wants
 128                 to have at minimum the date, start time and end time. In the field end
 129                 times are often omitted because they are not fixed or the organization
 130                 think it is obvious.
 131 \end{itemize}
 132
 133 Venues often present incomplete data entries that do not comply to the
 134 requirements explained before.
 135
 136 The first step in the information flow is the source. Hyperleap processes
 137 different sources and every source has different characteristics. Sources can
 138 be modern sources like websites or social media but even today a lot of
 139 information arrives at Hyperleap via flyers, fax or email. As sources vary in
 140 content structure sources also vary in reliability. For example the following
 141 entry is an example of a very well structured and probably generated, and thus
 142 probably also reliable, event description. The entry can originate for example
 143 from a RSS feed.
 144
 145 \begin{flushleft}
 146         \texttt{2015-05-20, 18:00-23:00 - \textit{Foobar} presenting their new CD in
 147         combination with a show. Location: small salon.}
 148 \end{flushleft}
 149
 150 An example of a terrible entry could be for example the following text that
 151 could originate from a flyer or social media post.
 152
 153 \begin{flushleft}
 154         \texttt{\textit{Foobar} playing to celebrate their CD release in the park
 155         tomorrow evening.}
 156 \end{flushleft}
 157
 158 This example lacks a precise date, time and location and is therefore hard for
 159 people to grasp at first, let alone for a machine. When someone wants to get
 160 the full information he has to tap in different resources which might not
 161 always be available.
 162
 163 \paragraph{Crawler}
 164 The second steps is the crawling of the websites. This happens currently in a
 165 couple of methods. Some sites are very structured and a programmer creates a
 166 program that can visit the website systematically and extract that information.
 167 Some other sources are very hard to programmatically crawled and they are
 168 visited by employees. The programmed crawlers are always specifically created
 169 for one or a couple sources and when the source changes for example structure
 170 the programmer has to adapt the crawler which is costly. Information from the
 171 different crawlers then goes to the \textit{Temporum}.
 172
 173 \paragraph{Temporum}
 174 The \textit{Temporum} is a big bin that contains raw data extracted from
 175 different sources and has to be post processed to be suitable enough for the
 176 actual database. This processing encompasses several possible tasks.  The first
 177 task is to check the validity of the entry. This is a very shallow test to
 178 check if the crawler is not malfunctioning and there is no nonsense in the
 179 data. Most of the data is not directly checked for validity, the data is
 180 skimmed for strange things but not every datapoint is checked.  The second step
 181 is matching the entry to several objects. For example the entry has to be
 182 matched to a certain venue when its source is a ticket vendor who sells tickets
 183 for multiple venues. Another example is that the event is a pop concert and is
 184 part of a big tour. Many of these tasks are done alone by or with aid of a
 185 computer program. Almost no data is going straight through the
 186 \textit{Temporum} and this property makes the \textit{Temporum} a safety net
 187 for all the incoming data.
 188
 189 \paragraph{Database \& Publication}
 190 When the data is post processed it is entered in the final database. The
 191 database contains all the events that happened in the past and all the events
 192 that are going to happen in the future. The database also contains information
 193 about the venues. The database is linked to several categorical websites that
 194 offer the information to users and accompany it with the other part of the
 195 \textit{infotainment} namely the subjectual information that is usually
 196 presented in the form of trivia, photos, interviews, reviews, previews and much
 197 more.
 198
 199 \section{Goal \& Research question}
 200 The second step in the information flow is crawling the sources and apply the
 201 preprocessing. This is a expensive task because it requires a programmer to be
 202 hired because currently all crawlers are programmed for one, or a couple,
 203 specific sources. Because a big group of sources often changes this very
 204 expensive and has a long feedback loop. When a source changes it is first
 205 preprocessed in the old way, send to the \textit{Temporum} and checked by a
 206 human and matched. The human then notices the error in the data and contacts
 207 the programmer. The programmer then has to reprogram the specific crawler to
 208 the new structure. This feedback loop, shown in Figure~\ref{feedbackloop} can
 209 take days and can be the reason for gaps or faulty information in the database.
 210 \begin{figure}[H]
 211         \caption{Feedback loop for malfunctioning crawlers}
 212         \label{feedbackloop}
 213         \centering
 214         \scalebox{0.5}{
 215                 \digraph[]{graph112}{
 216                         rankdir=LR;
 217                         node [shape="rectangle"]
 218                         source [label="Source"]
 219                         crawler [label="Crawler"]
 220                         temporum [label="Temporum"]
 221                         user [label="User"]
 222                         programmer [label="Programmer"]
 223                         database [label="Database"]
 224                         source -> crawler
 225                         crawler -> temporum
 226                         temporum -> user
 227                         user -> database
 228                         user -> programmer [constraint=false,style="dotted"]
 229                         user -> crawler [constraint=false,style="dashed"]
 230                         programmer -> crawler [constraint=false,style="dotted"]
 231                 }
 232         }
 233 \end{figure}
 234
 235 The goal of the project is to relieve the programmer of repairing crawlers all
 236 the time and make the task of adapting, editing and removing crawlers doable
 237 for someone without programming experience. In practice this means in
 238 Figure~\ref{feedbackloop} removing the dotted arrows by dashed arrow.
 239
 240 For this project an application has been developed that can provide an
 241 interface to a crawler system that is able to crawl RSS\cite{Rss} and
 242 Atom\cite{Atom} publishing feeds.  The interface also provides the user with
 243 point and click interfaces to create, modify, test and remove crawlers. The
 244 Hyperleap back end can, via this interface, generate XML feeds that contain the
 245 crawled data. For editing the structure and contents of the program a
 246 programmer is in theory also not necessary because all the things someone wants
 247 to change are located in a single file that is human readable. In practice it
 248 means that one person, not by definition a programmer, can be instructed to
 249 change the structure and this can also greatly reduce programmer intervention
 250 time.
 251
 252 \section{RSS/Atom}
 253 RSS/Atom feeds, from now on called RSS feeds, are publishing feeds in the XML
 254 format\cite{Xml} that are used to publish events. Every event or entry consists
 255 of several standardized fields. The main data fields are the \textit{title} and
 256 the \textit{description} field. In these fields the raw data is stored that
 257 describes the event. Further there are several auxiliary fields that store the
 258 link to the full article, store the publishing data, store some media in the
 259 form of an image or video URL and store a \textit{Globally Unique
 260 Identifier}(GUID)\footnote{A GUID is a unique identifier that in most cases is
 261 the permalink of the article. A permalink is a link that will point to the
 262 article}. An example of a RSS feed can be found in Listing~\ref{examplerss},
 263 this listing shows a partly truncated RSS feed of a well known venue in the
 264 Netherlands.
 265
 266 \begin{listing}
 267         \caption{An example of a, partly truncated RSS feed of a well known dutch
 268         venue}
 269         \label{examplerss}
 270         \xmlcode{exrss.xml}
 271 \end{listing}
 272
 273 The RSS feeds are mostly used by news sites to publish their articles, the feed
 274 then only contains the headlines and if the user that is reading the feed is
 275 interested he can click the so called deeplink and he is then sent to the full
 276 website containing the article. Users often use programs that bundle user
 277 specified RSS feeds into big summary feeds that can be used to sift through a
 278 lot of news feeds. The RSS feed reader uses the GUID to skip feeds that are
 279 already present in its internal database.
 280
 281 Generating RSS feeds by hand is a tedious task but usually RSS feeds are
 282 generated by the Content Management Systems(CMS) the website runs on. With this
 283 automatic method the RSS feeds are generated using the content published on the
 284 website. Because the process is automatic the RSS feeds are generally very
 285 structured and consistent in its structure. In the entertainment industry
 286 venues often use a CMS for their website to allow users with no programming or
 287 website background be able to post news items and event information.
 288
 289 \section{Why RSS?}
 290 \label{sec:whyrss}
 291 The first proposal described formats like HTML, fax/email, RSS, XML and some
 292 more. Because of the scope of the project and the time planned for it we had to
 293 remove some of the input formats because they all require different techniques.
 294 For example when the input source is in HTML format, most probably a website,
 295 then there can be a great deal of information extraction be automated using the
 296 structural information which is characteristic for HTML. For fax/email however
 297 there is almost no structural information and most of the automation techniques
 298 require natural language processing. We chose RSS feeds because RSS feeds lack
 299 inherent structural information but are still very structured. This structure
 300 is because, as said above, the RSS feeds are generated and therefore almost
 301 always look the same. Also, in RSS feeds most venues use particular structural
 302 identifiers that are characters. They separate fields with vertical bars,
 303 commas, whitespace and more non text characters. These field separators and
 304 keywords can be hard for a computer to detect but people are very good in
 305 detecting these. With one look they can identify the characters and keywords
 306 and build a pattern in their head.
 307 Another reason we chose RSS is their temporal consistency, RSS feeds are almost
 308 always generated and because of that the structure of the entries is very
 309 unlikely to change. Basically the RSS feeds only change structure when the CMS
 310 that generates it changes the generation algorithm. This property is usefull
 311 because the crawlers then do not have to be retrained very often. Because the
 312 non inherent structure entirely new strategies had to be applied.
 313
 314 \section{Directed Acyclic Graphs}
 315 \paragraph{Normal graphs}
 316 A graph($G$) is a mathematical structure to describe relations between nodes
 317 with edges. A standard graph is defined as the ordered pair: $G=(V,E)$.  In
 318 this ordered pair $V$ is the set of nodes and $E$ is set of undirected edges
 319 where every undirected edge is a tuple of two nodes.
 320 Figure~\ref{graphexample} is specified as:
 321         $$G=(\{n1, n2, n3\}, \{(n1, n2), (n2, n3), (n2, n3)\})$$
 322
 323 \begin{figure}[H]
 324         \caption{Example Graph}
 325         \label{graphexample}
 326         \centering
 327         \scalebox{0.7}{
 328                 \digraph[]{graphexample}{
 329                         rankdir=LR
 330                         n1 -> n2 [dir="none"]
 331                         n2 -> n3 [dir="none"]
 332                         n2 -> n3 [dir="none"]
 333                 }
 334         }
 335 \end{figure}
 336
 337 \paragraph{Directed graphs}
 338 Directed graphs are special kind of graphs. A directed graph $G$ is also
 339 defined by the ordered pair: $G=(V,E)$. $V$ still defines the nodes and $E$
 340 still the edges but the inherent difference is that the edges are ordered
 341 tuples in stead of not ordered. Adding this property gives the edges a
 342 direction. Every edge has a specific start and end and are therefore called
 343 directed edges. A directed graph would look the same as the graph in
 344 Figure~\ref{graphexample} but then visualized with arrows instead of normal
 345 lines. The arrows specifically go from one node to the other and not the other
 346 way around. However bidirectional connection can occur. For example graph the
 347 graph shown in Figure~\ref{dgexample} is directional with a bidirectional
 348 connection.
 349 $$G=(\{n1, n2\}, \{(n1, n2), (n2, n1)\}$$
 350
 351 \begin{figure}[H]
 352         \caption{Example directed graph}
 353         \label{dgexample}
 354         \centering
 355         \scalebox{0.7}{
 356                 \digraph[]{dgexample}{
 357                         rankdir=LR
 358                         n1 -> n2
 359                         n2 -> n1
 360                 }
 361         }
 362 \end{figure}
 363
 364 \paragraph{Directed acyclic graphs}
 365 Directed Acyclic Graphs(DAGs) are a special kind of directed graphs. DAGs
 366 are also defined as $G=(V,E)$ but with a restriction on $E$. Namely that cycles
 367 are not allowed. Figure~\ref{dagexample} shows two graphs. The bottom graph
 368 ontains a cycle and the right graph does not. Only the top graph is a valid
 369 DAG. A cycle is defined by a sequence of edges where nodes are visited more
 370 then once. Adding the property of non-cyclicity to graphs lowers the
 371 computational complexity of checking if a node sequence is present in the
 372 graph to $\mathcal{O}(L)$ where $L$ is the length of the sequence.
 373
 374 \begin{figure}[H]
 375         \caption{Example DAG}
 376         \label{dagexample}
 377         \centering
 378         \scalebox{0.7}{
 379                 \digraph[]{dagexample}{
 380                         rankdir=LR
 381                         n01 -> n02
 382                         n02 -> n03
 383                         n03 -> n01
 384                         n11 -> n12
 385                         n12 -> n13
 386                         n12 -> n14
 387                 }
 388         }
 389 \end{figure}
 390
 391 \paragraph{Directed Acyclic Word Graphs}
 392 The type of graph used in the project is a special kind of DAG called
 393 Directed Acyclic Word Graphs(DAWGs). A DAWG can be defined by the ordered
 394 triple $G=(V,E,F)$.
 395 $V$ is the same as in directed graphs. $E$ is also the same besides the fact
 396 that the ordered pair of nodes that describes the edge it now is a triple
 397 consisting of a start node, an end node and a label. In a standard DAWG the
 398 label is always one character. $F$ describes the set of final nodes, final
 399 nodes are nodes that can be the end of a sequence even if another arrow leads
 400 out. In the example graph in Figure~\ref{exampledawg} the final nodes are
 401 visualized with a double circle as node shape. In this example it is purely
 402 cosmetic because $n6$ is a final node anyways because there are no arrows
 403 leading out. But for example in  $G=(\{n1, n2, n3\}, \{(n1, n2), (n2, n3)\},
 404 \{n2, n3\})$ there is a distinct use for the final node marking. Graph $G$
 405 accepts the words \textit{a,ab} and to simplify the graph node $n2$ and $n3$
 406 are final. Because of the property of labeled edges data can be stored
 407 in a DAWG. When traversing a DAWG and saving all the edgelabels one can
 408 construct words. Because of properties of graphs a DAWG can store big sets of
 409 words in a small amount of storage because it can re-use edges to specify
 410 transitions. For example the graph in Figure~\ref{exampledawg} can describe a
 411 dictionary that holds the words: \textit{abd, bad, bae}. Testing if a word is
 412 present in the DAWG is, just as for a DAG, falls in the computational
 413 complexity class of $\mathcal{O}(L)$ meaning that it grows linearly with the
 414 length of the word.
 415
 416 \begin{figure}[H]
 417         \caption{Example DAWG}
 418         \label{exampledawg}
 419         \centering
 420         \scalebox{0.7}{
 421                 \digraph[]{graph21}{
 422                         rankdir=LR;
 423                         n1,n2,n3,n4,n5 [shape="circle"];
 424                         n6 [shape="doublecircle"];
 425                         n1 -> n2 [label="a"];
 426                         n2 -> n3 [label="b"];
 427                         n3 -> n6 [label="d"];
 428                         n1 -> n4 [label="b"];
 429                         n4 -> n5 [label="a"];
 430                         n5 -> n6 [label="d"];
 431                         n5 -> n6 [label="e"];
 432                 }
 433         }
 434 \end{figure}