thesis2/1.introduction.tex

   1 \section{Introduction}
   2 What do people do when they want to grab a movie? Attend a concert? Find out
   3 which theater shows play in local theater?
   4
   5 In the early days of the internet, access to the web was not available for most
   6 of the people. Information about leisure activities was almost exclusively
   7 obtained from flyers, posters and other written media and radio/TV
   8 advertisements. People had to put effort in searching for information and it
   9 was easy to miss a show just because you did not cross paths with it. Today the
  10 internet is used on a daily basis by almost everyone in the western society and
  11 one would think that missing an event would be impossible because of the
  12 enormous loads of information you can receive every day using the internet.
  13 For leisure activities the opposite is true, complete and reliable information
  14 about events is still hard to find.
  15
  16 Nowadays information on the internet about entertainment is offered via two
  17 main channels: individual venues websites and information bundling websites.
  18
  19 Individual venues put a lot of effort and resources in building a beautiful,
  20 fast and most of all modern website that bundles their information with nice
  21 graphics, animations and other gimmicks. Information bundling websites are run
  22 by companies that try to provide an overview of multiple venues. Information
  23 bundling websites often have individual venue websites as their source for
  24 information. Individual venues assume, for example, that it is obvious what the
  25 address of their venue is, that their ticket price is always fixed to
  26 \EURdig$5.-$ and that you need a membership to attend the events. Individual
  27 organizations usually put this non specific information in a disclaimer or a
  28 separate page. Because of the less structured way of providing information the
  29 information bundling websites have a hard time finding complete information.
  30 The event data can be crawled using automated crawlers but the miscellaneous
  31 information usually has to be crawled by hand.
  32
  33 Combining the information from the different data source turns out to be a
  34 complicated task for information bundling websites. The task is difficult
  35 because the companies behind these information bundling websites do not have
  36 the resources and time reserved for these tasks and therefore often serve
  37 incomplete information. Because of the complexity of getting complete
  38 information there are not many companies trying to bundle entertainment
  39 information into a complete and consistent database and website.
  40 Hyperleap\footnote{\url{http://hyperleap.nl}} tries to achieve goal of serving
  41 complete and consistent information and offers it via various information
  42 bundling websites.
  43 \newpage
  44 \section{Hyperleap \& Infotainment}
  45 Hyperleap is an internet company that was founded in the time that internet was
  46 not widespread. Hyperleap, active since 1995, is specialized in producing,
  47 publishing and maintaining \textit{infotainment}. \textit{Infotainment} is a
  48 combination of the words \textit{information} and \textit{entertainment}. It
  49 represents a combination of factual information, the \textit{information} part,
  50 and subjectual information, the \textit{entertainment} part, within a certain
  51 category or field. In the case of Hyperleap the category is the leisure
  52 industry, leisure industry encompasses all facets of entertainment ranging from
  53 cinemas, theaters, concerts to swimming pools, bridge competitions and
  54 conferences. Within the entertainment industry factual information includes,
  55 but is not limited to, starting time, location, host or venue and duration.
  56 Subjectual information includes, but is not limited to, reviews, previews,
  57 photos, background information and trivia.
  58
  59 Hyperleap says to manage the largest database containing \textit{infotainment}
  60 about the leisure industry focussed on the Netherlands and surrounding regions.
  61 The database contains over $10.000$ categorized events on average per week and
  62 their venue database contains over $54.000$ venues delivering the leisure
  63 activities ranging from theaters and music venues to petting zoos and fast food
  64 restaurants. All the subjectual information is obtained or created by Hyperleap
  65 and all factual information is gathered from different sources, quality checked
  66 and therefore very reliable. Hyperleap is the only company in its kind that has
  67 such high quality information. The \textit{infotainment} is presented via
  68 several websites specialized per genre or category and some sites attract over
  69 $500.000$ visitors each month.
  70
  71 \section{Information flow}
  72 Hyperleap can keep up the high quality data by investing a lot of time and
  73 resources in quality checking, cross comparing and consistency checking. By
  74 doing so the chances of incomplete or wrong data are much lower.
  75 To achieve this, the data will go through several different stages before it
  76 enters the database. These stages are visualized in
  77 Figure~\ref{informationflow} as an information flow diagram. In this diagram
  78 the nodes are processing steps and the arrows denote information transfer or
  79 flow.
  80
  81 \begin{figure}[H]
  82 \label{informationflow}
  83         \centering
  84         \includegraphics[width=\linewidth]{informationflow.pdf}
  85         \caption{Information flow Hyperleap database}
  86 \end{figure}
  87
  88 \subsection{Sources}
  89 A source is a service, location or medium in which information about events is
  90 stored or published. A source can have different source shapes such as HTML,
  91 email, flyer, RSS and so on. All information gathered from a source has to be
  92 quality checked before it is even considered for automated crawling. There are
  93 several criteria information from the source has to comply to before an
  94 automated crawler can be made. The prerequisites for a source are for example
  95 the fact that the source has to be reliable, consistent and free by licence.
  96 Event information from a source must have at least the \textit{What, Where} and
  97 \textit{When} information.
  98
  99 The \textit{What} information is the information that describes the content,
 100 content is a very broad definition but in practice it can be describing the
 101 concert tour name, theater show title, movie title, festival title and many
 102 more.
 103
 104 The \textit{Where} information is the location of the event. The location is
 105 often omitted because the organization behind source presenting the information
 106 thinks it is obvious. This information can also include different sub
 107 locations. For example when a pop concert venue has their own building but in
 108 the summer they organize a festival in some park. This data is often assumed to
 109 be trivial and inherent but in practice this is not the case. In this example
 110 for an outsider only the name of the park is often not enough.
 111
 112 The \textit{When} field is the time and date of the event. Hyperleap wants to
 113 have at minimum the date, start time and end time. In the field end times for
 114 example are often omitted because they are not fixed or the organization think
 115 it is obvious.
 116 \strut\\
 117
 118 Venues often present incomplete data entries that do not comply to the
 119 requirements explained before. Within the information flow categorizing and
 120 grading the source is the first step. Hyperleap processes different sources and
 121 source types and every source has different characteristics. Sources can be
 122 modern sources like websites or social media but even today a lot of
 123 information arrives at Hyperleap via flyers, fax or email. As source types vary
 124 in content structure sources also vary in reliability. For example the
 125 following entry is an example of a very well structured and probably generated,
 126 and thus probably also reliable, event description. The entry can originate for
 127 example from the title of an entry in a RSS feed. The example has a clear
 128 structure and almost all information required is available directly from the
 129 entry.
 130
 131 \begin{flushleft}
 132         \texttt{2015-05-20, 18:00-23:00 - \textit{Foobar} presenting their %
 133 new CD in combination with a show. Location: small salon.}
 134 \end{flushleft}
 135
 136 An example of a low quality item could be for example the following text that
 137 could originate from a flyer or social media post. This example lacks a
 138 precise date, time and location and is therefore hard for people to grasp at
 139 first, let alone for a machine. When someone wants to get the full information
 140 he has to tap in different resources which might not always be available.
 141
 142 \begin{flushleft}
 143         \texttt{\textit{Foobar} playing to celebrate their CD release in the %
 144 park tomorrow evening.}
 145 \end{flushleft}
 146
 147 Information with such a low quality is often not suitable for automated
 148 crawling. In Figure~\ref{informationflow} this manual route is shown by the
 149 arrow going straight from the source to the database. Non digital source types
 150 or very sudden changes such as surprise concerts or cancellations are also
 151 manually crawled.
 152
 153 \subsection{Crawler}
 154 When the source has been determined and classified the next step is
 155 periodically crawling the source using an automated crawler. As said before
 156 sources have to be structured and reliable, when this is the case a programmer
 157 will create a program that will visit the website systematically and
 158 automatically to extract all the new information. The programmed crawlers are
 159 usually specifically created for one or more sources and when the source
 160 changes, the programmer has to adapt the crawler. Such a change is usually a
 161 change in structure. Since writing and adapting requires a programmer the
 162 process is costly. Automatically crawled information is not inserted into the
 163 database directly because the information is not reliable enough. In case of a
 164 change in the source malformed data can pass through. As a safety net and final
 165 check the information first goes to the \textit{Temporum} before it will be
 166 entered in the database.
 167
 168 \subsection{Temporum}
 169 The \textit{Temporum} is a big bin that contains raw data extracted from
 170 different sources using automated crawlers. All the information in the
 171 \textit{Temporum} might not be suitable for the final database and therefore
 172 has to be post processed. The post-processing encompasses several different
 173 steps.
 174
 175 The first step is to check the validity of the event entries from a certain
 176 source. Validity checking is useful to detect faulty automated crawlers
 177 before the data can leak into the database. Validity checking happens at random
 178 on certain event entries.
 179
 180 An event entry usually contains one occurrence of an event. In a lot of cases
 181 there is parent information that the event entry is part of. For example in the
 182 case of a concert tour the parent information is the concert tour and the event
 183 entry is a certain performance. The second step in post processing is matching
 184 the event entries to possible parent information. This parent information can
 185 be a venue, a tour, a showing, a tournament and much more.
 186
 187 Both of the post processing tasks are done by people with the aid of automated
 188 functionality. Within the two post processing steps malformed data can be
 189 spotted very fast and the \textit{Temporum} thus acts as a safety net to keep
 190 the probability of malformed data leaking into the database as low as possible.
 191
 192 \subsection{Database \& Publication}
 193 Postprocessed data that leaves the \textit{Temporum} will enter the final
 194 database. This database contains all the information about all the events that
 195 happened in the past and the events that will happen in the future. The
 196 database also contains the parent information such as information about venues.
 197 Several categorical websites use the database to offer the information to users
 198 and accompany it with the second part of \textit{infotainment} namely the
 199 subjectual information. The \textit{entertainment} part will usually be
 200 presented in the form of trivia, photos, interviews, reviews, previews and much
 201 more.
 202
 203 \section{Goal \& Research question}
 204 Maintaining the automated crawlers and the infrastructure that provides the
 205 \textit{Temporum} and its matching aid automization are the parts within the
 206 dataflow that require the most amount of resources. Both of these parts require
 207 a programmer to execute and therefore are costly. In the case of the automated
 208 crawlers it requires a programmer because the crawlers are scripts or programs
 209 created are website-specific. Changing such a script or program requires
 210 knowledge about the source, the programming framework and about the
 211 \textit{Temporum}. In practice both of the tasks mean changing code.
 212
 213 A large group of sources often changes in structure. Because of such changes
 214 the task of reprogramming crawlers has to be repeated a lot. The detection of
 215 malfunctioning crawlers happens in the \textit{Temporum} and not in an earlier
 216 stage. Late detection elongates the feedback loop because there is not always a
 217 tight communication between the programmers and the \textit{Temporum} workers.
 218 In the case of a malfunction the source is first crawled. Most likely the
 219 malformed data will get processed and will produce rubbish that is sent to the
 220 \textit{Temporum}. Within the \textit{Temporum} after a while the error is
 221 detected and the programmers have to be contacted. Finally the crawler will be
 222 adapted to the new structure and will produce good data again. This feedback
 223 loop, shown in Figure~\ref{feedbackloop}, can take days and can be the reason
 224 for gaps and faulty information in the database. The figure shows information
 225 flow with arrows. The solid and dotted lines form the current feedbackloop.
 226 \begin{figure}[H]
 227 \label{feedbackloop}
 228         \centering
 229         \includegraphics[width=0.8\linewidth]{feedbackloop.pdf}
 230         \caption{Feedback loop for malfunctioning crawlers}
 231 \end{figure}
 232
 233 The specific goal of this project is to relieve the programmer of spending a
 234 lot of time repairing
 235 crawlers and make the task of adapting, editing and removing
 236 crawlers feasible for someone without programming experience. In practice this
 237 means shortening the feedbackloop. The shorter feedback loop is also shown in
 238 Figure~\ref{feedbackloop}. The dashed line shows the shorter feedbackloop that
 239 relieves the programmer.
 240
 241 For this project a system has been developed that provides an
 242 interface to a crawler generation system that is able to crawl RSS\cite{Rss}
 243 and Atom\cite{Atom} publishing feeds. The interface provides the user with
 244 point and click interfaces meaning that no computer science background is
 245 needed to use the interface and to create, modify, test and remove crawlers.
 246 The current Hyperleap backend system that handles the data can query XML feeds
 247 that contain the crawled data.
 248
 249 The actual research question can then be formulated as:
 250 \begin{center}
 251         \textit{Is it possible to shorten the feedback loop for repairing and %
 252 adding crawlers by making a system that can create, add and maintain crawlers %
 253 for RSS feeds}
 254 \end{center}
 255
 256 \section{RSS/Atom}
 257 RSS/Atom feeds, from now on called RSS feeds, are publishing feeds. Such feeds
 258 publish their data in a restricted XML format\cite{Xml} consisting of entries.
 259 Every entry usually represents an event and consists of standardized data
 260 fields. The data fields we are interested in are the \textit{title} and the
 261 \textit{description} fields. Those fields store the raw data which describes
 262 the event. Besides the fields we are interested in there are there are several
 263 auxiliary fields that for example store the link to the full article, store the
 264 publishing data, store some media in the form of an image or video URL or store
 265 a \textit{Globally Unique Identifier}(GUID)\footnote{A GUID is a unique
 266 identifier that in most cases is the permalink of the article. A permalink is a
 267 link that will point to the article}. An example of a RSS feed can be found in
 268 Listing~\ref{examplerss}, this listing shows a, partly truncated, RSS feed of a
 269 well known venue in the Netherlands. Every RSS feed contains a \texttt{channel}
 270 field and within that field there is some metadata and a list op \texttt{item}
 271 fields. Every \texttt{item} field has a fixed number of different fields. The
 272 most important fields for RSS within the leisure industry are the
 273 \texttt{title} and the \texttt{description} field.
 274
 275 \lstinputlisting[language=xml,label={examplerss},caption={An example of a,%
 276 partly truncated RSS feed of a well known dutch venue}]{exrss.xml}
 277
 278 RSS feeds are mostly used by news sites to publish their articles. A RSS feed
 279 only contains the headlines of the entries. If the user, who reads the feed, is
 280 interested it can click the so called deeplink and will be sent to the full
 281 website containing the full entry or article. Users often use programs that
 282 bundle user specified RSS feeds into big combined feeds that can be used to
 283 sift through a lot of news feeds. The RSS feed reader uses the unique GUID to
 284 skip feeds that are already present in its internal database.
 285
 286 Generating RSS feeds by hand is a tedious task but almost all RSS feeds are
 287 generated by a Content Management Systems(CMS) on which the website runs. With
 288 this automatic method the RSS feeds are generated for the content published
 289 on the website. Because the process is automatic the RSS feeds are generally
 290 very structured and consistent in its structure. In the entertainment industry
 291 venues often use a CMS for their website to allow users with no programming or
 292 website background be able to post news items and event information and thus
 293 should often have RSS feeds.
 294
 295 \section{Why RSS?}
 296 \label{sec:whyrss}
 297 There are lots of different source formats like HTML, fax/email, RSS and XML.
 298 Because of the limited scope of the project and the time planned for it we had
 299 to remove some of the input formats because they all require different
 300 techniques and approaches to tackle. For example when the input source is in
 301 HTML format, most probably a website, then there can be a great deal of
 302 information extraction be automated using the structural information which is a
 303 characteristic for HTML. For fax/email however there is almost no structural
 304 information and most of the automation techniques require natural language
 305 processing. We chose RSS feeds because RSS feeds lack inherent structural
 306 information but are still very structured. This structure is because, as said
 307 above, the RSS feeds are generated and therefore almost always look the same.
 308 Also, in RSS feeds most venues use particular structural identifiers that are
 309 characters. They separate fields with vertical bars, commas, whitespace and
 310 more non text characters. These field separators and keywords can be hard for a
 311 computer to detect but people are very good in detecting these. With one look
 312 they can identify the characters and keywords and build a pattern in their
 313 head. Another reason we chose RSS is their temporal consistency, RSS feeds are
 314 almost always generated and because of that the structure of the entries is
 315 very unlikely to change. Basically the RSS feeds only change structure when the
 316 CMS that generates it changes the generation algorithm. This property is useful
 317 because the crawlers then do not have to be retrained very often. To detect
 318 the underlying structures a technique is used that exploits subword matching
 319 with graphs.
 320
 321 \section{Directed Acyclic Graphs}
 322 \paragraph{Directed graphs}
 323 Directed graphs(DG) are mathematical structures that can describe relations
 324 between nodes. A directed graph $G$ is be defined as the tuple $(V,E)$ where
 325 $V$ is a set of named nodes and $E$ is the set of edges defined by $E\subseteq
 326 V\times V$. An edge $e\in E$ is defined as $(v1, v2)$ where $v1,v2\in V$ and is
 327 show in the figure as an arrow between node $v1$ and node $v2$. Multiple
 328 connections between two nodes are possible if the directions differ. For
 329 example the graph visualized in Figure~\ref{graphexample} can be mathematically
 330 described as:
 331 $$G=(\{n1, n2, n3, n4\}, \{(n1, n2), (n2, n1), (n2, n3), (n3, n4), (n1, n4)\}$$
 332
 333 \begin{figure}[H]
 334         \label{graphexample}
 335         \centering
 336         \includegraphics[scale=0.7]{graphexample.pdf}
 337         \caption{Example DG}
 338 \end{figure}
 339
 340 \paragraph{Directed acyclic graphs}
 341 Directed Acyclic Graphs(DAGs) are a special kind of directed graphs. DAGs
 342 are also defined as $G=(V,E)$ but with a restriction on $E$. Namely that cycles
 343 are not allowed. Figure~\ref{dagexample} shows two graphs. The bottom graph
 344 contains a cycle and the right graph does not. Only the top graph is a valid
 345 DAG. A cycle can be defined as follows:
 346
 347 If $e\in E$ is defined as $(v_1,v_n)\in E$ or $v_1\rightarrow v_n$ then
 348 $v_1\xrightarrow{+}v_n$ which means $v_1\rightarrow v_2 \rightarrow \ldots
 349 \rightarrow v_{n-1} \rightarrow v_n$ meaning that there is a connection with a
 350 length larger then $1$ between $v_1$ and $v_2$.  In a non cyclic graph the
 351 following holds $\nexists v\in V: v\xrightarrow{+}v$. In words this means that
 352 a node can not be reached again traveling the graph. Adding the property of non
 353 cyclicity to graphs lowers the computational complexity of path existence in
 354 the graph to $\mathcal{O}(L)$ where $L$ is the length of the path.
 355
 356 \begin{figure}[H]
 357         \label{dagexample}
 358         \centering
 359         \includegraphics[scale=0.7]{dagexample.pdf}
 360         \caption{Example DAG}
 361 \end{figure}
 362
 363 \paragraph{Directed Acyclic Word Graphs}
 364 The type of graph used in the project is a special kind of DAG called Directed
 365 Acyclic Word Graphs(DAWGs). A DAWG can be defined by the tuple $G=(V,v_0,E,F)$.
 366 $V$ is the same as in directed graphs, namely the set of nodes. $E$ is the set
 367 of edges described by $E\subseteq V\times V\times L$ where $L\in A$ where $A$
 368 is the alphabet used as node labels. In words this means that an edge is a
 369 labeled arrow between two nodes and for example the edge $(v_1, v_2, a)$ can be
 370 visualized as: $v_1\xrightarrow{a}v_2$. In a standard DAWG the alphabet $A$
 371 contains all the characters used in natural language but in theory the alphabet
 372 $A$ can contain anything. $F$ describes the set of final nodes, final nodes are
 373 nodes that can be the end of a sequence even if there is an arrow leading out.
 374 In the example graph in Figure~\ref{exampledawg} the final nodes are visualized
 375 with a double circle as node shape. In this example it is purely cosmetic
 376 because $n6$ is a final node anyways because there are no arrows leading out.
 377 But this does not need to be the case, for example in $G=(\{n1, n2, n3\},
 378 \{(n1, n2), (n2, n3)\}, \{n2, n3\})$ there is a distinct use for the final node
 379 marking. Graph $G$ accepts the words \textit{a,ab} and to simplify the graph
 380 node $n2$ and $n3$ are final. Finally $v_0$ describes the initial node, this is
 381 visualized in figures as an incoming arrow. Because of the property of labeled
 382 edges, data can be stored in a DAWG. When traversing a DAWG and saving all the
 383 edgelabels one can construct words. Using graph minimalization big sets of
 384 words can be stored using a small amouth of storage because edges can be
 385 re-used to specify transitions. For example the graph in
 386 Figure~\ref{exampledawg} can describe the language $L$ where all words $w$ that
 387 are accepted $w\in\{abd, bad, bae\}$. Testing if a word is present in the DAWG
 388 is the same technique as testing if a node path is present in a normal DAG and
 389 therefore also falls in the computational complexity class of $\mathcal{O}(L)$.
 390 This means that it grows linearly with the length of the word.
 391
 392 \begin{figure}[H]
 393         \label{exampledawg}
 394         \centering
 395         \includegraphics[scale=0.7]{dawgexample.pdf}
 396         \caption{Example DAWG}
 397 \end{figure}