thesis2/1.introduction.tex

   1 \section{Introduction}
   2 What do people do when they want to grab a movie? Attend a concert? Find out
   3 which theater shows play in local theater?
   4
   5 In the early days of the internet, access to the web was not available for most
   6 of the people. Information about leisure activities was almost exclusively
   7 obtained from flyers, posters and other written media and radio/TV
   8 advertisements. People had to put effort in searching for information and it
   9 was easy to miss a show just because you did not cross paths with it. Today the
  10 internet is used on a daily basis by almost everyone in the western society and
  11 one would think that missing an event would be impossible because of the
  12 enormous loads of information you can receive every day using the internet.
  13 For leisure activities the opposite is true, complete and reliable information
  14 about events is still hard to find.
  15
  16 Nowadays information on the internet about entertainment is offered via two
  17 main channels: individual venues websites and information bundling websites.
  18
  19 Individual venues put a lot of effort and resources in building a beautiful,
  20 fast and most of all modern website that bundles their information with nice
  21 graphics, animations and other gimmicks. Information bundling websites are run
  22 by companies that try to provide an overview of multiple venues. Information
  23 bundling websites often have individual venue websites as their source for
  24 information. Individual venues assume, for example, that it is obvious what the
  25 address of their venue is, that their ticket price is always fixed to
  26 \EURdig$5.-$ and that you need a membership to attend the events. Individual
  27 organizations usually put this non specific information in a disclaimer or a
  28 separate page. Because of the less structured way of providing information the
  29 information bundling websites have a hard time finding complete information.
  30 The event data can be crawled using automated crawlers but the miscellaneous
  31 information usually has to be crawled by hand.
  32
  33 Combining the information from the different data source turns out to be a
  34 complicated task for information bundling websites. The task is difficult
  35 because the companies behind these information bundling websites do not have
  36 the resources and time reserved for these tasks and therefore often serve
  37 incomplete information. Because of the complexity of getting complete
  38 information there are not many companies trying to bundle entertainment
  39 information into a complete and consistent database and website.
  40 Hyperleap\footnote{\url{http://hyperleap.nl}} tries to achieve the goal of
  41 serving complete and consistent information and offers it via various
  42 information bundling websites.
  43 \newpage
  44 \section{Hyperleap \& Infotainment}
  45 Hyperleap is an internet company that was founded in the time that internet was
  46 not widespread. Hyperleap, active since 1995, is specialized in producing,
  47 publishing and maintaining \textit{infotainment}. \textit{Infotainment} is a
  48 combination of the words \textit{information} and \textit{entertainment}. It
  49 represents a combination of factual information, the \textit{information} part,
  50 and non factual information or subjectual information, the
  51 \textit{entertainment} part, within a certain category or field. In the case of
  52 Hyperleap the category is the leisure industry, leisure industry encompasses
  53 all facets of entertainment ranging from cinemas, theaters, concerts to
  54 swimming pools, bridge competitions and conferences. Within the entertainment
  55 industry factual information includes, but is not limited to, starting time,
  56 location, host or venue and duration.  Subjectual information includes, but is
  57 not limited to, reviews, previews, photos, background information and trivia.
  58
  59 Hyperleap says to manage the largest database containing \textit{infotainment}
  60 about the leisure industry focussed on the Netherlands and surrounding regions.
  61 The database contains over $10.000$ categorized events on average per week and
  62 their venue database contains over $54.000$ venues delivering the leisure
  63 activities ranging from theaters and music venues to petting zoos and fast food
  64 restaurants. All the subjectual information is obtained or created by Hyperleap
  65 and all factual information is gathered from different sources, quality checked
  66 and therefore very reliable. Hyperleap is the only company in its kind that has
  67 such high quality information. The \textit{infotainment} is presented via
  68 several websites specialized per genre or category and some sites attract over
  69 $500.000$ visitors each month.
  70
  71 \section{Information flow}
  72 Hyperleap can keep up the high quality data by investing a lot of time and
  73 resources in quality checking, cross comparing and consistency checking. By
  74 doing so the chances of incomplete or wrong data are much lower.
  75 To achieve this, the data will go through several different stages before it
  76 enters the database. These stages are visualized in
  77 Figure~\ref{informationflow} as an information flow diagram. In this diagram
  78 the nodes are processing steps and the arrows denote information transfer or
  79 flow.
  80
  81 \begin{figure}[H]
  82         \centering
  83         \includegraphics[width=\linewidth]{informationflow.pdf}
  84         \caption{Information flow Hyperleap database\label{informationflow}}
  85 \end{figure}
  86
  87 \subsection{Sources}
  88 A source is a service, location or medium in which information about events is
  89 stored or published. A source can have different source shapes such as HTML,
  90 email, flyer, RSS and so on. All information gathered from a source has to be
  91 quality checked before it is even considered for automated crawling. There are
  92 several criteria to which the source has to comply before an automated crawler
  93 can be made. The prerequisites for a source are for example the fact that the
  94 source has to be reliable, consistent and free by licence.  Event information
  95 from a source must have at least the \textit{What, Where} and \textit{When}
  96 information.
  97
  98 The \textit{What} information is the information that describes the content,
  99 content is a very broad definition but in practice it can be describing the
 100 concert tour name, theater show title, movie title, festival title and many
 101 more.
 102
 103 The \textit{Where} information is the location of the event. The location is
 104 often omitted because the organization presenting the information thinks it is
 105 obvious. This information can also include different sub locations. For example
 106 when a pop concert venue has their own building but in the summer they organize
 107 a festival in some park. This data is often assumed to be trivial and inherent
 108 but in practice this is not the case. In this example for an outsider only the
 109 name of the park is often not enough.
 110
 111 The \textit{When} field is the time and date of the event. Hyperleap wants to
 112 have at minimum the date, start time and end time. In the field end times for
 113 example are often omitted because they are not fixed or the organization think
 114 it is obvious.
 115 \strut\\
 116
 117 Venues often present incomplete data entries that do not comply to the
 118 requirements explained before. Within the information flow categorizing and
 119 grading the source is the first step. Hyperleap processes different sources and
 120 source types and every source has different characteristics. Sources can be
 121 modern sources like websites or social media but even today a lot of
 122 information arrives at Hyperleap via flyers, fax or email. As source types vary
 123 in content structure sources also vary in reliability. For example the
 124 following entry is an example of a very well structured and probably generated,
 125 and thus probably also reliable, event description. The entry can originate for
 126 example from the title of an entry in a RSS feed. The example has a clear
 127 structure and almost all information required is available directly from the
 128 entry.
 129
 130 \begin{flushleft}
 131         \texttt{2015-05-20, 18:00-23:00 - \textit{Foobar} presenting their %
 132 new CD in combination with a show. Location: small salon.}
 133 \end{flushleft}
 134
 135 An example of a low quality item could be for example the following text that
 136 could originate from a flyer or social media post. This example lacks a
 137 precise date, time and location and is therefore hard for people to grasp at
 138 first, let alone for a machine. When someone wants to get the full information
 139 he has to tap in different resources which might not always be available.
 140
 141 \begin{flushleft}
 142         \texttt{\textit{Foobar} playing to celebrate their CD release in the %
 143 park tomorrow evening.}
 144 \end{flushleft}
 145
 146 Information with such a low quality is often not suitable for automated
 147 crawling. In Figure~\ref{informationflow} this manual route is shown by the
 148 arrow going straight from the source to the database. Non digital source types
 149 or very sudden changes such as surprise concerts or cancellations are also
 150 manually crawled.
 151
 152 \subsection{Crawler}
 153 When the source has been determined and classified the next step is
 154 periodically crawling the source using an automated crawler. As said before
 155 sources have to be structured and reliable, when this is the case a programmer
 156 will create a program that will visit the website systematically and
 157 automatically to extract all the new information. The programmed crawlers are
 158 usually specifically created for one or more sources and when the source
 159 changes, the programmer has to adapt the crawler. Such a change is usually a
 160 change in structure. Since writing and adapting requires a programmer the
 161 process is costly. Automatically crawled information is not inserted into the
 162 database directly because the information is not reliable enough. In case of a
 163 change in the source malformed data can pass through. As a safety net and final
 164 check the information first goes to the \textit{Temporum} before it will be
 165 entered in the database.
 166
 167 \subsection{Temporum}
 168 The \textit{Temporum} is a big bin that contains raw data extracted from
 169 different sources using automated crawlers. Some of the information in the
 170 \textit{Temporum} might not be suitable for the final database and therefore
 171 has to be post processed. The post-processing encompasses several different
 172 steps.
 173
 174 The first step is to check the validity of the event entries from a certain
 175 source. Validity checking is useful to detect outdated automated crawlers
 176 before the data can leak into the database. Crawlers become outdated when a
 177 source changes and the crawler can not crawl the website using the original
 178 method. Validity checking happens at random on certain event entries.
 179
 180 An event entry usually contains one occurrence of an event. In a lot of cases
 181 there is parent information that the event entry is part of. For example in the
 182 case of a concert tour the parent information is the concert tour and the event
 183 entry is a certain performance. The second step in post processing is matching
 184 the event entries to possible parent information. This parent information can
 185 be a venue, a tour, a showing, a tournament and much more.
 186
 187 Both of the post processing tasks are done by people with the aid of automated
 188 functionality. Within the two post processing steps malformed data can be
 189 spotted very fast and the \textit{Temporum} thus acts as a safety net to keep
 190 the probability of malformed data leaking into the database as low as possible.
 191
 192 \subsection{Database \& Publication}
 193 Postprocessed data that leaves the \textit{Temporum} will enter the final
 194 database. This database contains all the information about all the events that
 195 happened in the past and the events that will happen in the future. The
 196 database also contains the parent information such as information about venues.
 197 Several categorical websites use the database to offer the information to users
 198 and accompany it with the second part of \textit{infotainment} namely the
 199 subjectual information. The \textit{entertainment} part will usually be
 200 presented in the form of trivia, photos, interviews, reviews, previews and much
 201 more.
 202
 203 \section{Goal \& Research question}
 204 Maintaining the automated crawlers and the infrastructure that provides the
 205 \textit{Temporum} and its matching aid automation are the parts within the
 206 data flow that require the most amount of resources. Both of these parts require
 207 a programmer to execute and therefore are costly. In the case of the automated
 208 crawlers it requires a programmer because the crawlers are scripts or programs
 209 are specifically designed for a particular website. Changing such a script or
 210 program requires knowledge about the source, the programming framework and
 211 about the \textit{Temporum}. In practice both of the tasks mean changing code.
 212
 213 A large group of sources often change in structure. Because of such changes
 214 the task of reprogramming crawlers has to be repeated a lot. The detection of
 215 malfunctioning crawlers happens in the \textit{Temporum} and not in an earlier
 216 stage. Late detection elongates the feedback loop because there is not always a
 217 tight communication between the programmers and the \textit{Temporum} workers.
 218 In the case of a malfunction the source is first crawled. Most likely the
 219 malformed data will get processed and will produce rubbish that is sent to the
 220 \textit{Temporum}. Within the \textit{Temporum} after a while the error is
 221 detected and the programmers have to be contacted. Finally the crawler will be
 222 adapted to the new structure and will produce good data again. This feedback
 223 loop, shown in Figure~\ref{feedbackloop}, can take days and can be the reason
 224 for gaps and faulty information in the database. The figure shows information
 225 flow with arrows. The solid and dotted lines form the current feedback loop.
 226 \begin{figure}[H]
 227         \centering
 228         \includegraphics[width=0.8\linewidth]{feedbackloop.pdf}
 229         \caption{Feedback loop for malfunctioning crawlers\label{feedbackloop}}
 230 \end{figure}
 231
 232 The specific goal of this project is to relieve the programmer of spending a
 233 lot of time repairing
 234 crawlers and make the task of adapting, editing and removing
 235 crawlers feasible for someone without programming experience. In practice this
 236 means shortening the feedback loop. The shorter feedback loop is also shown in
 237 Figure~\ref{feedbackloop}. The dashed line shows the shorter feedback loop that
 238 relieves the programmer.
 239
 240 For this project a system has been developed that provides an
 241 interface to a crawler generation system that is able to crawl RSS\cite{Rss}
 242 and Atom\cite{Atom} publishing feeds. The interface provides the user with
 243 point and click interfaces meaning that no computer science background is
 244 needed to use the interface and to create, modify, test and remove crawlers.
 245 The current Hyperleap backend system that handles the data can query XML feeds
 246 that contain the crawled data.
 247
 248 The actual research question can then be formulated as:
 249 \begin{center}
 250         \textit{Is it possible to shorten the feedback loop for repairing and %
 251 adding crawlers by making a system that can create, add and maintain crawlers %
 252 for RSS feeds}
 253 \end{center}
 254
 255 \section{RSS/Atom}
 256 RSS/Atom feeds, from now on called RSS feeds, are publishing feeds. Such feeds
 257 publish their data in a restricted XML format\cite{Xml} consisting of entries.
 258 Every entry usually represents an event and consists of standardized data
 259 fields. The data fields we are interested in are the \textit{title} and the
 260 \textit{description} fields. Those fields store the raw data which describes
 261 the event. Besides the fields we are interested in there are there are several
 262 auxiliary fields that for example store the link to the full article, store the
 263 publishing data, store some media in the form of an image or video URL or store
 264 a \textit{Globally Unique Identifier}(GUID)\footnote{A GUID is a unique
 265 identifier that in most cases is the permalink of the article. A permalink is a
 266 link that will point to the article}. An example of a RSS feed can be found in
 267 Listing~\ref{examplerss}, this listing shows a, partly truncated, RSS feed of a
 268 well known venue in the Netherlands. Every RSS feed contains a \texttt{channel}
 269 field and within that field there is some metadata and a list op \texttt{item}
 270 fields. Every \texttt{item} field has a fixed number of different fields. The
 271 most important fields for RSS within the leisure industry are the
 272 \texttt{title} and the \texttt{description} field.
 273
 274 \lstinputlisting[language=xml,label={examplerss},caption={An example of a,%
 275 partly truncated RSS feed of a well known dutch venue}]{exrss.xml}
 276
 277 RSS feeds are mostly used by news sites to publish their articles. A RSS feed
 278 only contains the headlines of the entries. If the user, who reads the feed, is
 279 interested it can click the so called deeplink and will be sent to the full
 280 website containing the full entry or article. Users often use programs that
 281 bundle user specified RSS feeds into big combined feeds that can be used to
 282 sift through a lot of news feeds. The RSS feed reader uses the unique GUID to
 283 skip feeds that are already present in its internal database.
 284
 285 Generating RSS feeds by hand is a tedious task but almost all RSS feeds are
 286 generated by a Content Management Systems(CMS) on which the website runs. With
 287 this automatic method the RSS feeds are generated for the content published
 288 on the website. Because the process is automatic the RSS feeds are generally
 289 very structured and consistent in its structure. In the entertainment industry
 290 venues often use a CMS for their website to allow users with no programming or
 291 website background be able to post news items and event information and thus
 292 should often have RSS feeds.
 293
 294 \section{Why RSS?}
 295 \label{sec:whyrss}
 296 There are lots of different source formats like HTML, fax/email, RSS and XML.
 297 Because of the limited scope of the project and the time planned for it we had
 298 to remove some of the input formats because they all require different
 299 techniques and approaches to tackle. For example when the input source is in
 300 HTML format, most probably a website, then there can be a great deal of
 301 information extraction be automated using the structural information which is a
 302 characteristic for HTML. For fax/email however there is almost no structural
 303 information and most of the automation techniques require natural language
 304 processing and possibly OCR. We chose RSS feeds because RSS feeds lack inherent
 305 structural information but are still very structured. This structure is
 306 because, as said above, the RSS feeds are generated and therefore almost always
 307 look the same.  Also, in RSS feeds most venues use particular structural
 308 identifiers that are characters. They separate fields with vertical bars,
 309 commas, whitespace and more non text characters. These field separators and
 310 keywords can be hard for a computer to detect but people are very good in
 311 detecting these. With one look they can identify the characters and keywords
 312 and build a pattern in their head. Another reason we chose RSS is their
 313 temporal consistency, RSS feeds are almost always generated and because of that
 314 the structure of the entries is very unlikely to change. Basically the RSS
 315 feeds only change structure when the CMS that generates it changes the
 316 generation algorithm. This property is useful because the crawlers then do not
 317 have to be retrained very often. To detect the underlying structures a
 318 technique is used that exploits subword matching with graphs.
 319
 320 \section{Directed Acyclic Graphs}
 321 \paragraph{Directed graphs}
 322 Directed graphs(DG) are mathematical structures that can describe relations
 323 between nodes. A directed graph $G$ is be defined as the tuple $(V,E)$ where
 324 $V$ is a set of named nodes and $E$ is the set of edges defined by $E\subseteq
 325 V\times V$. An edge $e\in E$ is defined as $(v1, v2)$ where $v1,v2\in V$ and is
 326 show in the figure as an arrow between node $v1$ and node $v2$. Multiple
 327 connections between two nodes are possible if the directions differ. For
 328 example the graph visualized in Figure~\ref{graphexample} can be mathematically
 329 described as:
 330 $$G=(\{n1, n2, n3, n4\}, \{(n1, n2), (n2, n1), (n2, n3), (n3, n4), (n1, n4)\}$$
 331
 332 \begin{figure}[H]
 333         \centering
 334         \includegraphics[scale=0.7]{graphexample.pdf}
 335         \caption{Example DG\label{graphexample}}
 336 \end{figure}
 337
 338 \paragraph{Directed acyclic graphs}
 339 Directed Acyclic Graphs(DAGs) are a special kind of directed graphs. DAGs
 340 are also defined as $G=(V,E)$ but with a restriction on $E$. Namely that cycles
 341 are not allowed. Figure~\ref{dagexample} shows two graphs. The bottom graph
 342 contains a cycle and the right graph does not. Only the top graph is a valid
 343 DAG. A cycle can be defined as follows:
 344
 345 If $e\in E$ is defined as $(v_1,v_n)\in E$ or $v_1\rightarrow v_n$ then
 346 $v_1\xrightarrow{+}v_n$ which means $v_1\rightarrow v_2 \rightarrow \ldots
 347 \rightarrow v_{n-1} \rightarrow v_n$ meaning that there is a connection with a
 348 length larger then $1$ between $v_1$ and $v_2$.  In a non cyclic graph the
 349 following holds $\nexists v\in V: v\xrightarrow{+}v$. In words this means that
 350 a node can not be reached again traveling the graph. Adding the property of non
 351 cyclicity to graphs lowers the computational complexity of path existence in
 352 the graph to $\mathcal{O}(L)$ where $L$ is the length of the path.
 353
 354 \begin{figure}[H]
 355         \centering
 356         \includegraphics[scale=0.7]{dagexample.pdf}
 357         \caption{Example DAG\label{dagexample}}
 358 \end{figure}
 359
 360 \newpage
 361 \paragraph{Directed Acyclic Word Graphs}
 362 The type of graph used in the project is a special kind of DAG called Directed
 363 Acyclic Word Graphs(DAWGs). A DAWG can be defined by the tuple $G=(V,v_0,E,F)$.
 364 $V$ is the same as in directed graphs, namely the set of nodes. $E$ is the set
 365 of edges described by $E\subseteq V\times V\times L$ where $L\in A$ where $A$
 366 is the alphabet used as node labels. In words this means that an edge is a
 367 labeled arrow between two nodes and for example the edge $(v_1, v_2, a)$ can be
 368 visualized as: $v_1\xrightarrow{a}v_2$. In a standard DAWG the alphabet $A$
 369 contains all the characters used in natural language but in theory the alphabet
 370 $A$ can contain anything. $F$ describes the set of final nodes, final nodes are
 371 nodes that can be the end of a sequence even if there is an arrow leading out.
 372 In the example graph in Figure~\ref{exampledawg} the final nodes are visualized
 373 with a double circle as node shape. In this example it is purely cosmetic
 374 because $n6$ is a final node anyways because there are no arrows leading out.
 375 But this does not need to be the case, for example in $G=(\{n1, n2, n3\},
 376 \{(n1, n2), (n2, n3)\}, \{n2, n3\})$ there is a distinct use for the final node
 377 marking. The only final node is the example is $n_6$, marked with a
 378 double circle. $v_0$ describes the initial node, this is visualized in figures
 379 as an incoming arrow. Because of the property of labeled edges, data can be
 380 stored in a DAWG. When traversing a DAWG and saving all the edge labels one can
 381 construct words. Using graph minimisation big sets of words can be stored using
 382 a small amount of storage because edges can be re-used to specify transitions.
 383 For example the graph in Figure~\ref{exampledawg} can describe the language $L$
 384 where all words $w$ that are accepted $w\in\{abd, bad, bae\}$. Testing if a
 385 word is present in the DAWG is the same technique as testing if a node path is
 386 present in a normal DAG and therefore also falls in the computational
 387 complexity class of $\mathcal{O}(L)$.  This means that it grows linearly with
 388 the length of the word.
 389
 390 \begin{figure}[H]
 391         \centering
 392         \includegraphics[scale=0.7]{dawgexample.pdf}
 393         \caption{Example DAWG\label{exampledawg}}
 394 \end{figure}
 395
 396 \section{Structure}
 397 The following chapters will describe the system that has been created and the
 398 used methods. Chapter~2 shows the requirements design for the program followed
 399 in Chapter~3 by the underlying methods used for the actual matching. Finally
 400 Chapter~4 concludes with the results, discussion and future research.