thesis2/1.introduction.tex

   1 \section{Introduction}
   2 What do people do when they want to grab a movie? Attend a concert? Find out
   3 which theater shows play in local theater?
   4
   5 When the internet was in its early days and it started to be accessible to most
   6 of the people information about entertainment was still obtained almost
   7 exclusively from flyers, books, posters, radio/TV advertisements. People had to
   8 look hard for information and you could easily miss a show just because you did
   9 not cross paths with it.
  10 Today the internet is used by almost everyone in the western society on a daily
  11 basis and we would think that missing an event would be impossible because of
  12 the enormous loads of information you can receive every day using the internet.
  13 The opposite is true for information about leisure activities.
  14
  15 Nowadays information on the internet about entertainment is offered via two
  16 main channels: individual venues websites and information bundling websites.
  17
  18 Individual venues put a lot of effort and resources in building a beautiful,
  19 fast and most of all modern website that bundles their information with nice
  20 graphics, animations and gimmicks. There also exist companies that bundle the
  21 information from different websites to provide an overview. Information
  22 bundling websites often have multiple individual venue websites as the source
  23 for their information and therefore the information is most of the time not
  24 complete. This is because individual venues tend to think, for example, that it
  25 is obvious what the address of their venue is, that their ticket price is
  26 always fixed to \EURdig$5.-$ and that you need a membership to attend the
  27 events. Individual organizations usually put this non specific information in a
  28 disclaimer or a separate page and information bundling website miss out on
  29 these types of information a lot. They miss out because they can crawl these
  30 individual events but gathering the miscellaneous information is usually done
  31 by hand.
  32
  33 Combining the information from the different data source turns out to be a hard
  34 task for such information bundling websites. It is a hard task because
  35 information bundling websites do not have the resources and time reserved for
  36 these tasks and therefore often serve incomplete information. Because of
  37 the complexity of getting complete information there are not many websites
  38 trying to bundle entertainment information into a complete and consistent
  39 database. Hyperleap\footnote{\url{http://hyperleap.nl}} tries to achieve goal
  40 of serving complete and consistent information.
  41
  42 \section{Hyperleap \& Infotainment}
  43 Hyperleap is an internet company that existed in the time that internet was not
  44 widespread. Hyperleap, active since 1995, is specialized in producing,
  45 publishing and maintaining \textit{infotainment}. \textit{Infotainment} is a
  46 combination of the words \textit{information} and \textit{entertainment}. It
  47 represents a combination of factual information and subjectual information
  48 (entertainment) within a certain category or field. In the case of Hyperleap
  49 the category is the leisure industry, leisure industry encompasses all facets
  50 of entertainment going from cinemas, theaters, concerts to swimming pools,
  51 bridge competitions and conferences. Within the entertainment industry factual
  52 information includes, but is not limited to, information such as starting time,
  53 location, host or venue and duration. Subjectual information includes, but is
  54 not limited to, information such as reviews, previews, photos, background
  55 information and trivia.
  56
  57 Hyperleap manages the largest database containing \textit{infotainment} about
  58 the leisure industry focussed on the Netherlands and surrounding regions. The
  59 database contains over $10.000$ categorized events on average per week and
  60 their venue database contains over $54.000$ venues delivering the leisure
  61 activities ranging from theaters and music venues to petting zoos and fast food
  62 restaurants. All the subjectual information is obtained or created by Hyperleap
  63 and all factual information is gathered from different sources, quality checked
  64 and therefore very reliable. Hyperleap is the only company in its kind that has
  65 such high quality information. The \textit{infotainment} is presented via
  66 several websites specialized per genre or category and some sites attract over
  67 $500.000$ visitors per month.
  68
  69 \section{Information flow}
  70 The reason why Hyperleap is the only in its kind with the high quality data is
  71 because Hyperleap spends a lot of time and resources on quality checking, cross
  72 comparing and consistency checking before the data enters the database. To
  73 achieve this, the data will go through several different stages before it
  74 enters the database. These stages are visualized in
  75 Figure~\ref{informationflow} as an information flow diagram. In this diagram
  76 the nodes are processing steps and the arrows denote information transfer or
  77 flow.
  78
  79 \begin{figure}[H]
  80         \label{informationflow}
  81         \centering
  82         \includegraphics[scale=0.7]{informationflow.eps}
  83         \strut\\
  84         \strut\\
  85         \caption{Information flow Hyperleap database}
  86 \end{figure}
  87
  88 \newpage
  89 \subsection*{Sources}
  90 The information that enters the database has to be quality checked and this
  91 checking starts at the source of the data. There are several criteria the
  92 information from the source have to comply to before it can enter the
  93 database. The prerequisites for a source are for example the fact that the
  94 source has to be reliable, consistent and free by licence. Event information
  95 from a source must have at least the following fields of information.\\
  96 \textbf{What:}\\
  97 The \textit{What} field is the field that describes the content, content is a
  98 very broad definition. In practice it can be describing the concert tour name,
  99 theater show title, movie title, festival title and many more.\\
 100 \textbf{Where:}\\
 101 The \textit{Where} field is the location of the event. The location is often
 102 omitted because the organization behind source presenting the information think
 103 it is obvious. This field can also include different sub locations. For
 104 example when a pop concert venue has their own building but in the summer they
 105 organize a festival in some park. This data is often assumed to be trivial and
 106 inherent but in practice this is not the case. In this example for an outsider
 107 only the name of the park is often not enough.\\
 108 \textbf{When:}\\
 109 The \textit{When} field is the time and date of the event. Hyperleap wants
 110 to have at minimum the date, start time and end time. In the field end
 111 times are often omitted because they are not fixed or the organization
 112 think it is obvious.\\
 113 \strut\\
 114 Venues often present incomplete data entries that do not comply to the
 115 requirements explained before. Within the information flow categorizing and
 116 grading the source is the first step. Hyperleap processes different sources and
 117 source types and every source has different characteristics. Sources can be
 118 modern sources like websites or social media but even today a lot of
 119 information arrives at Hyperleap via flyers, fax or email. As sources vary in
 120 content structure sources also vary in reliability. For example the following
 121 entry is an example of a very well structured and probably generated, and thus
 122 probably also reliable, event description. The entry can originate for example
 123 from the title of an entry in a RSS feed. The example has a clear structure and
 124 almost all information required is available directly from the entry.
 125
 126 \begin{flushleft}
 127         \texttt{2015-05-20, 18:00-23:00 - \textit{Foobar} presenting their %
 128 new CD in combination with a show. Location: small salon.}
 129 \end{flushleft}
 130
 131 An example of a terrible item could be for example the following text that
 132 could originate from a flyer or social media post. This example lacks a
 133 precise date, time and location and is therefore hard for people to grasp at
 134 first, let alone for a machine. When someone wants to get the full information
 135 he has to tap in different resources which might not always be available.
 136
 137 \begin{flushleft}
 138         \texttt{\textit{Foobar} playing to celebrate their CD release in the%
 139 park tomorrow evening.}
 140 \end{flushleft}
 141
 142 \subsection*{Crawler}
 143 When the source has been determined and classified the next step is
 144 periodically crawling the source. At the moment the crawling happens using two
 145 main methods.\\
 146 \textbf{Manual crawling:}\\
 147 Manual crawling is basically letting an employee access the source and put the
 148 information directly in the database. This often happens with non digital
 149 sources and with very sudden events or event changes such as surprise concerts
 150 or event cancellation.\\
 151 \textbf{Automatic crawling:}\\
 152 Some sites are very structured and a programmer can create a program that can
 153 visit the website systematically and automatically to extract all the new
 154 information. Not all digital sources are suitable to be crawled automatically
 155 and will still need manual crawling. The programmed crawlers are always
 156 specifically created for one or a couple sources and when the source changes
 157 for example structure the programmer has to adapt the crawler which is costly.
 158 Information from the all the crawlers goes first to the \textit{Temporum}.
 159
 160 \subsection*{Temporum}
 161 The \textit{Temporum} is a big bin that contains raw data extracted from
 162 different sources and has to be post processed to be suitable enough for the
 163 actual database. This post-processing encompasses several possible tasks. The
 164 first task is to check the validity of the entry. This is not done for every
 165 entry but random samples are taken and these entries will be tested to see
 166 if the crawler is not malfunctioning and there is no nonsense in the
 167 data.
 168
 169 The second step is matching the entry to several objects. Objects in the
 170 broadest sense can basically be anything, in the case of a theater show
 171 performance the matched object can be the theater show itself, describing the
 172 show for all locations. In case of a concert it can be that the object is a
 173 tour the band is doing. When the source is a ticket vendor the object is the
 174 venue where the event takes place since the source itself is not a venue.
 175 Many of these tasks are done by a human or with aid of a computer program.
 176 The \textit{Temporum} acts as a safety net because no data is going straight
 177 through to the database. It is first checked, matched and validated.
 178
 179 \subsection*{Database \& Publication}
 180 When the data is post processed it is entered in the final database. The
 181 database contains all the events that happened in the past and all the events
 182 that are going to happen in the future. The database also contains information
 183 about the venues. The database is linked to several categorical websites that
 184 offer the information to users and accompany it with the other part of the
 185 \textit{infotainment} namely the subjectual information that is usually
 186 presented in the form of trivia, photos, interviews, reviews, previews and much
 187 more.
 188
 189 \section{Goal \& Research question}
 190 Crawling the sources and applying the preprocessing is the most expensive task
 191 in the entire information flow because it requires a programmer to perform
 192 these tasks. Programmers are needed because all the crawlers are scripts or
 193 programs that were created specifically for a website. Changing a crawler or
 194 preprocessing job means changing the code.
 195 A big group of sources often changes and that makes that this task
 196 becomes very expensive and has a long feedback loop. When a source changes the
 197 source is first preprocessed in the old way, send to the \textit{Temporum} and
 198 checked by a human and matched. The human then notices the error in the data
 199 and has to contact the programmer. The programmer then has to reprogram the
 200 specific crawler or job to the new structure. This feedback loop, shown in
 201 Figure~\ref{feedbackloop} can take days and can be the reason for gaps and
 202 faulty information in the database. In the figure the dotted arrows denote the
 203 current feedback loop for crawlers.
 204 \begin{figure}[H]
 205         \label{feedbackloop}
 206         \centering
 207         \includegraphics[width=0.8\linewidth]{feedbackloop.eps}
 208         \strut\\\strut\\
 209         \caption{Feedback loop for malfunctioning crawlers}
 210 \end{figure}
 211 \strut\\
 212 The goal of this project is specifically to relieve the programmer of repairing
 213 crawlers all the time and make the task of adapting, editing and removing
 214 crawlers doable for someone without programming experience. In practice this
 215 means in Figure~\ref{feedbackloop} that the project will remove the replace
 216 part of the feedback loop that contains dotted arrows with the shorter loop
 217 with the dashed arrows.
 218
 219 For this project an application has been developed that can provide an
 220 interface to a crawler generation system that is able to crawl RSS\cite{Rss}
 221 and Atom\cite{Atom} publishing feeds. The interface provides the user with
 222 point and click interfaces meaning that no computer science background is
 223 needed to use the interface and to create, modify, test and remove crawlers.
 224 The current Hyperleap backend system that handles the data can, via an query,
 225 generate XML feeds that contain the crawled data. The structure of the
 226 application is very modular and generic and therefore it is easy to change
 227 things in the program without having to know a lot about the programming
 228 language used. To achieve this all visual things such as buttons are defined in
 229 a human readable text files. In practice it means that one person, not by
 230 definition a programmer, can be instructed to change the structure and this can
 231 also greatly reduce programmer intervention time.
 232
 233 The actual problem statement then becomes:
 234 \begin{center}
 235         \textit{Is it possible to shorten the feedback loop for repairing and%
 236 adding crawlers by making a system that can create, add and maintain crawlers%
 237 for RSS feeds}
 238 \end{center}
 239
 240 \section{RSS/Atom}
 241 RSS/Atom feeds, from now on called RSS feeds, are publishing feeds in the XML
 242 format\cite{Xml} that are used to publish events. Every event or entry consists
 243 of several standardized fields. The main data fields are the \textit{title} and
 244 the \textit{description} field. In these fields the raw data is stored that
 245 describes the event. Further there are several auxiliary fields that store the
 246 link to the full article, store the publishing data, store some media in the
 247 form of an image or video URL and store a \textit{Globally Unique
 248 Identifier}(GUID)\footnote{A GUID is a unique identifier that in most cases is
 249 the permalink of the article. A permalink is a link that will point to the
 250 article}. An example of a RSS feed can be found in Listing~\ref{examplerss},
 251 this listing shows a partly truncated RSS feed of a well known venue in the
 252 Netherlands. As visible in the listing the similarities with XML are very
 253 clear. Every RSS feed contains a \texttt{channel} tag and within that tag there
 254 is some metadata and a list op \texttt{item} tags. Every \texttt{item} tag has
 255 a fixed number of different fields. The most important fields for RSS within
 256 the leisure industry are the \texttt{title} and the \texttt{description} field.
 257
 258 \lstinputlisting[language=xml,label={examplerss},caption={An example of a,%
 259 partly truncated RSS feed of a well known dutch venue}]{exrss.xml}
 260
 261 RSS feeds are mostly used by news sites to publish their articles. A RSS feed
 262 only contains the headlines of the entries. If the user, who reads the feed, is
 263 interested it can click the so called deeplink and will be sent to the full
 264 website containing the full entry or article. Users often use programs that
 265 bundle user specified RSS feeds into big combined feeds that can be used to
 266 sift through a lot of news feeds. The RSS feed reader uses the GUID to skip
 267 feeds that are already present in its internal database.
 268
 269 Generating RSS feeds by hand is a tedious task but almost all RSS feeds are
 270 generated by a Content Management Systems(CMS) on which the website runs. With
 271 this automatic method the RSS feeds are generated for the content published
 272 on the website. Because the process is automatic the RSS feeds are generally
 273 very structured and consistent in its structure. In the entertainment industry
 274 venues often use a CMS for their website to allow users with no programming or
 275 website background be able to post news items and event information and thus
 276 should often have RSS feeds.
 277
 278 \section{Why RSS?}
 279 \label{sec:whyrss}
 280 We described a lot of different source formats like HTML, fax/email, RSS and
 281 XML. Because of the limited scope of the project and the time planned for it
 282 we had to remove some of the input formats because they all require different
 283 techniques and approaches to tackle. For example when the input source is in
 284 HTML format, most probably a website, then there can be a great deal of
 285 information extraction be automated using the structural information which is a
 286 characteristic for HTML. For fax/email however there is almost no structural
 287 information and most of the automation techniques require natural language
 288 processing. We chose RSS feeds because RSS feeds lack inherent structural
 289 information but are still very structured. This structure is because, as said
 290 above, the RSS feeds are generated and therefore almost always look the same.
 291 Also, in RSS feeds most venues use particular structural identifiers that are
 292 characters. They separate fields with vertical bars, commas, whitespace and
 293 more non text characters. These field separators and keywords can be hard for
 294 a computer to detect but people are very good in detecting these. With one look
 295 they can identify the characters and keywords and build a pattern in their
 296 head. Another reason we chose RSS is their temporal consistency, RSS feeds are
 297 almost always generated and because of that the structure of the entries is
 298 very unlikely to change. Basically the RSS feeds only change structure when
 299 the CMS that generates it changes the generation algorithm. This property is
 300 useful because the crawlers then do not have to be retrained very often.
 301 To detect the structures we propose a technique using subword matching using
 302 graphs.
 303
 304 \section{Directed Acyclic Graphs}
 305 \paragraph{Directed graphs}
 306 Directed graphs(DG) are a mathematical structure that can describe relations
 307 between nodes. A directed graph $G$ is be defined as the tuple $(V,E)$ where
 308 $V$ is a set of named nodes and $E$ is the set of edges defined by $E\subseteq
 309 V\times V$. An edge $e\in E$ can be seen as $(v1, v2)$ where $v1,v2\in V$ and
 310 can be interpreted as in a figure as an arrow between node $v1$ and node $v2$.
 311 Multiple connections between two nodes are possible. For example the graph
 312 visualized in Figure~\ref{graphexample} can be mathematically described as:
 313 $$G=(\{n1, n2, n3, n4\}, \{(n1, n2), (n2, n1), (n2, n3), (n3, n4), (n1, n4)\}$$
 314
 315 \begin{figure}[H]
 316         \label{graphexample}
 317         \centering
 318         \includegraphics[scale=0.7]{graphexample.eps}
 319         \strut\\\strut\\
 320         \caption{Example DG}
 321 \end{figure}
 322
 323 \paragraph{Directed acyclic graphs}
 324 Directed Acyclic Graphs(DAGs) are a special kind of directed graphs. DAGs
 325 are also defined as $G=(V,E)$ but with a restriction on $E$. Namely that cycles
 326 are not allowed. Figure~\ref{dagexample} shows two graphs. The bottom graph
 327 contains a cycle and the right graph does not. Only the top graph is a valid
 328 DAG. A cycle can be defined as follows:
 329
 330 Say $e\in E$ can be seen as $(v_1,v_2)\in E$ or $v_1\rightarrow v2$ then
 331 $v_1\xrightarrow{+}v_2$ is defined as $v_1\rightarrow \ldots
 332 \rightarrow v_2$ meaning that there is a path with a length bigger then $1$
 333 between $v_1$ and $v_2$. In a non cyclic graph the following holds $\nexists
 334 v1: v1\xrightarrow{+}v1$. In words this means that a node can not be reached
 335 again traveling the graph. Adding the property of non cyclicity to graphs
 336 lowers the computational complexity of path existence in the graph to
 337 $\mathcal{O}(L)$ where $L$ is the length of the path.
 338
 339 \begin{figure}[H]
 340         \label{dagexample}
 341         \centering
 342         \includegraphics[scale=0.7]{dagexample.eps}
 343         \strut\\\strut\\
 344         \caption{Example DAG}
 345 \end{figure}
 346
 347 \paragraph{Directed Acyclic Word Graphs}
 348 The type of graph used in the project is a special kind of DAG called Directed
 349 Acyclic Word Graphs(DAWGs). A DAWG can be defined by the tuple $G=(V,v_0,E,F)$.
 350 $V$ is the same as in directed graphs, namely the set of nodes. $E$ is the set
 351 of edges described by $E\subseteq V\times V\times L$ where $L\in A$ where $A$
 352 is the alphabet used as node labels. In words this means that an edge is a
 353 labeled arrow between two nodes and for example the edge $(v_1, v_2, a)$ can be
 354 visualized as: $v_1\xrightarrow{a}v_2$. In a standard DAWG the alphabet $A$
 355 contains all the characters used in natural language but in theory the alphabet
 356 $A$ can contain anything. $F$ describes the set of final nodes, final nodes are
 357 nodes that can be the end of a sequence even if there is an arrow leading out.
 358 In the example graph in Figure~\ref{exampledawg} the final nodes are visualized
 359 with a double circle as node shape. In this example it is purely cosmetic
 360 because $n6$ is a final node anyways because there are no arrows leading out.
 361 But this does not need to be the case, for example in $G=(\{n1, n2, n3\},
 362 \{(n1, n2), (n2, n3)\}, \{n2, n3\})$ there is a distinct use for the final node
 363 marking. Graph $G$ accepts the words \textit{a,ab} and to simplify the graph
 364 node $n2$ and $n3$ are final. Finally $v_0$ describes the initial node, this is
 365 visualized in figures as an incoming arrow. Because of the property of labeled
 366 edges, data can be stored in a DAWG. When traversing a DAWG and saving all the
 367 edgelabels one can construct words. Using graph minimalization big sets of
 368 words can be stored using a small amouth of storage because edges can be
 369 re-used to specify transitions. For example the graph in
 370 Figure~\ref{exampledawg} can describe the language $L$ where all words $w$ that
 371 are accepted $w\in\{abd, bad, bae\}$. Testing if a word is present in the DAWG
 372 is the same technique as testing if a node path is present in a normal DAG and
 373 therefore also falls in the computational complexity class of $\mathcal{O}(L)$.
 374 This means that it grows linearly with the length of the word.
 375
 376 \begin{figure}[H]
 377         \label{exampledawg}
 378         \centering
 379         \includegraphics[scale=0.7]{dawgexample.eps}
 380         \strut\\\strut\\
 381         \caption{Example DAWG}
 382 \end{figure}