up
[bsc-thesis1415.git] / thesis2 / 1.introduction.tex
1 \section{Introduction}
2 What do people do when they want to grab a movie? Attend a concert? Find out
3 which theater shows play in local theater?
4
5 When the internet was in its early days and it started to be accessible to most
6 of the people information about entertainment was still obtained almost
7 exclusively from flyers, books, posters, radio/TV advertisements. People had to
8 look hard for information and you could easily miss a show just because you did
9 not cross paths with it.
10 Today the internet is used by almost everyone in the western society on a daily
11 basis and we would think that missing an event would be impossible because of
12 the enormous loads of information you can receive every day using the internet.
13 The opposite is true for information about leisure activities.
14
15 Nowadays information on the internet about entertainment is offered via two
16 main channels: individual venues websites and information bundling websites.
17
18 Individual venues put a lot of effort and resources in building a beautiful,
19 fast and most of all modern website that bundles their information with nice
20 graphics, animations and gimmicks. There also exist companies that bundle the
21 information from different websites to provide an overview. Information
22 bundling websites often have multiple individual venue websites as the source
23 for their information and therefore the information is most of the time not
24 complete. This is because individual venues tend to think, for example, that it
25 is obvious what the address of their venue is, that their ticket price is
26 always fixed to \EURdig$5.-$ and that you need a membership to attend the
27 events. Individual organizations usually put this non specific information in a
28 disclaimer or a separate page and information bundling website miss out on
29 these types of information a lot. They miss out because they can crawl these
30 individual events but gathering the miscellaneous information is usually done
31 by hand.
32
33 Combining the information from the different data source turns out to be a hard
34 task for such information bundling websites. It is a hard task because
35 information bundling websites do not have the resources and time reserved for
36 these tasks and therefore often serve incomplete information. Because of
37 the complexity of getting complete information there are not many websites
38 trying to bundle entertainment information into a complete and consistent
39 database. Hyperleap\footnote{\url{http://hyperleap.nl}} tries to achieve goal
40 of serving complete and consistent information.
41
42 \section{Hyperleap \& Infotainment}
43 Hyperleap is an internet company that existed in the time that internet was not
44 widespread. Hyperleap, active since 1995, is specialized in producing,
45 publishing and maintaining \textit{infotainment}. \textit{Infotainment} is a
46 combination of the words \textit{information} and \textit{entertainment}. It
47 represents a combination of factual information and subjectual information
48 (entertainment) within a certain category or field. In the case of Hyperleap
49 the category is the leisure industry, leisur industry encompasses all facets of
50 entertainment going from cinemas, theaters, concerts to swimming pools, bridge
51 competitions and conferences. Within the entertainment industry factual
52 information includes, but is not limited to, information such as starting time,
53 location, host or venue and duration. Subjectual information includes, but is
54 not limited to, information such as reviews, previews, photos, background
55 information and trivia.
56
57 Hyperleap manages the largest database containing \textit{infotainment} about
58 the leisure industry in the Netherlands and surroundings. The database contains
59 over $10.000$ categorized events on average per week and their venue database
60 contains over $54.000$ venues delivering the leisure activities ranging from
61 theaters and music venues to petting zoos and fastfood restaurants. All the
62 subjectual information is obtained or created by Hyperleap and all factual
63 information is gathered from different sources, quality checked and therefore
64 very reliable. Hyperleap is the only company in its kind that has such high
65 quality information. The \textit{infotainment} is presented via several
66 websites specialized per genre or category and some sites attract over
67 $500.000$ visitors per month.
68
69 \section{Information flow}
70 The reason why Hyperleap is the only in its kind with the high quality data is
71 because Hyperleap spends a lot of time and resources on quality checking, cross
72 comparing and consistency checking before the data enters the database. To
73 achieve this the data is inserted in the database in several different stages
74 that are visualized in Figure~\ref{informationflow} as an information flow
75 diagram using nodes as processing or publishing steps and arrows are
76 information flow.
77
78 \begin{figure}[H]
79 \label{informationflow}
80 \centering
81 \scalebox{0.7}{
82 \digraph[]{graph111}{
83 rankdir=TB;
84 node [shape="rectangle",fontsize=10,nodesep=0.7,ranksep=0.75,width=1];
85 edge [weight=5.];
86 i0 [label="Website"];
87 i1 [label="Email"];
88 i2 [label="Fax"];
89 i3 [label="RSS/Atom"];
90 p1 [label="Preproccessing"];
91 p2 [label="Temporum: Postproccesing"];
92 o1 [label="Database: Insertion"];
93 o2 [label="TheAgenda"];
94 o3 [label="BiosAgenda"];
95 o4 [label="..."];
96 node [width=5]; p1 p2 o1;
97 i0 -> p1;
98 i1 -> p1;
99 i2 -> p1;
100 i3 -> p1;
101 p1 -> p2;
102 p2 -> o1;
103 o1 -> o2;
104 o1 -> o3;
105 o1 -> o4;
106 }
107 }
108 \caption{Information flow Hyperleap database}
109 \end{figure}
110
111 \paragraph{Sources}
112 The information that enters the database has to be quality checked. There are
113 several criteria the information and the source have to comply to before any of
114 them can enter the database. For example the source has to be reliable,
115 consistent and free by licence whereas the event entries have to have at least
116 the following fields:
117 \begin{itemize}
118 \item[What]
119 The \textit{What} field is the field that describes the content, content is
120 a very broad definition. In practice it can be describing the concert tour
121 name, theater show title, movie title, festival title and many more.
122 \item[Where]
123 The \textit{Where} field is the location of the event. The location is often
124 omitted because the organization behind source presenting the information
125 think it is obvious. This field can also include different sublocations.
126 For example when a pop concert venue has their own building but in the
127 summer they organize a festival in some park. This data is often assumed
128 to be trivial and inherent but in practice this is not the case. In this
129 example for an outsider only the name of the park is often not enough.
130 \item[When]
131 The \textit{When} field is the time and date of the event. Hyperleap wants
132 to have at minimum the date, start time and end time. In the field end
133 times are often omitted because they are not fixed or the organization
134 think it is obvious.
135 \end{itemize}
136
137 Venues often present incomplete data entries that do not comply to the
138 requirements explained before.
139
140 The first step in the information flow is the source. Hyperleap processes
141 different sources and every source has different characteristics. Sources can
142 be modern sources like websites or social media but even today a lot of
143 information arrives at Hyperleap via flyers, fax or email. As sources vary in
144 content structure sources also vary in reliability. For example the following
145 entry is an example of a very well structured and probably generated, and thus
146 probably also reliable, event description. The entry can originate for example
147 from a RSS feed.
148
149 \begin{flushleft}
150 \texttt{2015-05-20, 18:00-23:00 - \textit{Foobar} presenting their new CD in
151 combination with a show. Location: small salon.}
152 \end{flushleft}
153
154 An example of a terrible entry could be for example the following text that
155 could originate from a flyer or social media post.
156
157 \begin{flushleft}
158 \texttt{\textit{Foobar} playing to celebrate their CD release in the park
159 tomorrow evening.}
160 \end{flushleft}
161
162 This example lacks a precise date, time and location and is therefore hard for
163 people to grasp at first, let alone for a machine. When someone wants to get
164 the full information he has to tap in different resources which might not
165 always be available.
166
167 \paragraph{Crawler}
168 When the source has been determined and classified the next step is
169 periodically crawling the source. At the moment the crawling happens using a
170 couple of different methods. Some sites are very structured and a programmer
171 creates a program that can visit the website systematically and extract that
172 information. Some other sources are very hard to programmatically crawled and
173 they are visited by employees. The programmed crawlers are always specifically
174 created for one or a couple sources and when the source changes for example
175 structure the programmer has to adapt the crawler which is costly. Information
176 from the different crawlers then goes to the \textit{Temporum}.
177
178 \paragraph{Temporum}
179 The \textit{Temporum} is a big bin that contains raw data extracted from
180 different sources and has to be post processed to be suitable enough for the
181 actual database. This post-processing encompasses several possible tasks. The
182 first task is to check the validity of the entry. This is not done for every
183 entry but random samples are taken and these entries will be tested to see
184 if the crawler is not malfunctioning and there is no nonsense in the
185 data.
186 The second step is matching the entry to several objects. Objects in the
187 broadest sense can basically be anything, in the case of a theater show
188 performance the matched object can be the theater show itself, describing the
189 show for all locations. In case of a concert it can be that the object is a
190 tour the band is doing. When the source is a ticket vendor the object is the
191 venue where the event takes place since the source itself is not a venue.
192 Many of these tasks are done by a human or with aid of a computer program.
193 The \textit{Temporum} acts as a safety net because no data is going straigt
194 through to the database. It is first checked, matched and validated.
195
196 \paragraph{Database \& Publication}
197 When the data is post processed it is entered in the final database. The
198 database contains all the events that happened in the past and all the events
199 that are going to happen in the future. The database also contains information
200 about the venues. The database is linked to several categorical websites that
201 offer the information to users and accompany it with the other part of the
202 \textit{infotainment} namely the subjectual information that is usually
203 presented in the form of trivia, photos, interviews, reviews, previews and much
204 more.
205
206 \section{Goal \& Research question}
207 Crawling the sources and applying the preprocessing is the most expensive task
208 in the entire information flow because it requires a programmer to be hired.
209 Only programmers currently can create crawlers and design preprocessing jobs
210 because all current programs are specifically designed to do one thing, most of
211 the time for one source. Because a big group of sources often changes this task
212 becomes very expensive and has a long feedback loop. When a source changes the
213 source is first preprocessed in the old way, send to the \textit{Temporum} and
214 checked by a human and matched. The human then notices the error in the data
215 and contacts the programmer. The programmer then has to reprogram the specific
216 crawler to the new structure. This feedback loop, shown in
217 Figure~\ref{feedbackloop} can take days and can be the reason for gaps and
218 faulty information in the database.
219 \begin{figure}[H]
220 \label{feedbackloop}
221 \centering
222 \scalebox{0.5}{
223 \digraph[]{graph112}{
224 rankdir=LR;
225 node [shape="rectangle"];
226 source [label="Source"];
227 crawler [label="Crawler"];
228 temporum [label="Temporum"];
229 employee [label="User"];
230 programmer [label="Programmer"];
231 database [label="Database"];
232 source -> crawler;
233 crawler -> temporum;
234 temporum -> user;
235 user -> database;
236 user -> programmer [constraint=false,style="dotted"];
237 user -> crawler [constraint=false,style="dashed"];
238 programmer -> crawler [constraint=false,style="dotted"];
239 }
240 }
241 \caption{Feedback loop for malfunctioning crawlers}
242 \end{figure}
243
244 The goal of the project is to relieve the programmer of repairing crawlers all
245 the time and make the task of adapting, editing and removing crawlers doable
246 for someone without programming experience. In practice this means in
247 Figure~\ref{feedbackloop} removing the dotted arrows by dashed arrow.
248
249 For this project an application has been developed that can provide an
250 interface to a crawler generation system that is able to crawl RSS\cite{Rss}
251 and Atom\cite{Atom} publishing feeds. The interface provides the user with
252 point and click interfaces meaning that no computer science background is
253 needed to use the interface, to create, modify, test and remove crawlers. The
254 current Hyperleap backend system that handles the data can, via an interface,
255 generate XML feeds that contain the crawled data. The structure of the
256 application is very modular and generic and therefore it is easy to change
257 things in the program without having to know a lot about the programming
258 language used. This is because for example all visual things like buttons are
259 defined in a human readable text file. In practice it means that one person,
260 not by definition a programmer, can be instructed to change the structure and
261 this can also greatly reduce programmer intervention time.
262
263 The actual problem statement then becomes:
264 \begin{center}
265 \textit{Is it possible to shorten the feedback loop for repairing and adding
266 crawlers by making a system that can create, add and maintain crawlers for
267 RSS feeds}
268 \end{center}
269
270 \section{RSS/Atom}
271 RSS/Atom feeds, from now on called RSS feeds, are publishing feeds in the XML
272 format\cite{Xml} that are used to publish events. Every event or entry consists
273 of several standardized fields. The main data fields are the \textit{title} and
274 the \textit{description} field. In these fields the raw data is stored that
275 describes the event. Further there are several auxiliary fields that store the
276 link to the full article, store the publishing data, store some media in the
277 form of an image or video URL and store a \textit{Globally Unique
278 Identifier}(GUID)\footnote{A GUID is a unique identifier that in most cases is
279 the permalink of the article. A permalink is a link that will point to the
280 article}. An example of a RSS feed can be found in Listing~\ref{examplerss},
281 this listing shows a partly truncated RSS feed of a well known venue in the
282 Netherlands. As visible in the listing the similarities with XML are very
283 clear. Every RSS feed contains a \texttt{channel} tag and within that tag there
284 is some metadata and a list op \texttt{item} tags. Every \texttt{item} tag has
285 a fixed number of different fields. The most important fields for RSS within
286 the leisure industry are the \texttt{title} and the \texttt{description field}.
287
288 \begin{listing}
289 \caption{An example of a, partly truncated RSS feed of a well known dutch
290 venue}
291 \label{examplerss}
292 \xmlcode{exrss.xml}
293 \end{listing}
294
295 RSS feeds are mostly used by news sites to publish their articles, the feed
296 then only contains the headlines and if the user that is reading the feed is
297 interested he can click the so called deeplink and he is then sent to the full
298 website containing the article. Users often use programs that bundle user
299 specified RSS feeds into big summary feeds that can be used to sift through a
300 lot of news feeds. The RSS feed reader uses the GUID to skip feeds that are
301 already present in its internal database.
302
303 Generating RSS feeds by hand is a tedious task but usually RSS feeds are
304 generated by the Content Management Systems(CMS) the website runs on. With this
305 automatic method the RSS feeds are generated using the content published on the
306 website. Because the process is automatic the RSS feeds are generally very
307 structured and consistent in its structure. In the entertainment industry
308 venues often use a CMS for their website to allow users with no programming or
309 website background be able to post news items and event information.
310
311 \section{Why RSS?}
312 \label{sec:whyrss}
313 We described a lot of different source formats like HTML, fax/email, RSS and
314 XML. Because of the limited scope of the project and the time planned for it
315 we had to remove some of the input formats because they all require different
316 techniques. For example when the input source is in HTML format, most probably
317 a website, then there can be a great deal of information extraction be
318 automated using the structural information which is characteristic for HTML.
319 For fax/email however there is almost no structural information and most of the
320 automation techniques require natural language processing. We chose RSS feeds
321 because RSS feeds lack inherent structural information but are still very
322 structured. This structure is because, as said above, the RSS feeds are
323 generated and therefore almost always look the same. Also, in RSS feeds most
324 venues use particular structural identifiers that are characters. They separate
325 fields with vertical bars, commas, whitespace and more non text characters.
326 These field separators and keywords can be hard for a computer to detect but
327 people are very good in detecting these. With one look they can identify the
328 characters and keywords and build a pattern in their head. Another reason we
329 chose RSS is their temporal consistency, RSS feeds are almost always generated
330 and because of that the structure of the entries is very unlikely to change.
331 Basically the RSS feeds only change structure when the CMS that generates it
332 changes the generation algorithm. This property is usefull because the crawlers
333 then do not have to be retrained very often. Because the non inherent structure
334 entirely new strategies had to be applied.
335
336 \section{Directed Acyclic Graphs}
337 \paragraph{Directed graphs}
338 Directed graphs are a mathematical structure that can describe relations
339 between nodes. A directed graph $G$ is be defined as the tuple $(V,E)$ where
340 $V$ is a set of named nodes and $E$ is the set of edges defined by $E\subseteq
341 V\times V$. An edge $e\in E$ can be seen as $(v1, v2)$ where $v1,v2\in V$ and
342 can be interpreted as in a figure as an arrow between node $v1$ and node $v2$.
343 Multiple connections between two nodes are possible. For example the graph
344 visualized in Figure~\ref{graphexample} can be mathematically described as:
345 $$G=(\{n1, n2, n3, n4\}, \{(n1, n2), (n2, n1), (n2, n3), (n3, n4), (n1, n4)\}$$
346
347 \begin{figure}[H]
348 \label{graphexample}
349 \scalebox{0.7}{
350 \digraph[]{graphexample}{
351 rankdir=LR;
352 n1->n2;
353 n2->n1;
354 n2->n3;
355 n3->n4;
356 n1->n4;
357 }
358 }
359 \caption{Example DG}
360 \end{figure}
361
362 \paragraph{Directed acyclic graphs}
363 Directed Acyclic Graphs(DAGs) are a special kind of directed graphs. DAGs
364 are also defined as $G=(V,E)$ but with a restriction on $E$. Namely that cycles
365 are not allowed. Figure~\ref{dagexample} shows two graphs. The bottom graph
366 ontains a cycle and the right graph does not. Only the top graph is a valid
367 DAG. A cycle can be defined as follows:
368
369 Say $e\in E$ can be seen as $(v_1,v_2)\in E$ or $v_1\rightarrow v2$ then
370 $v_1\xrightarrow{+}v_2$ is defined as $v_1\rightarrow \ldots
371 \rightarrow v_2$ meaning that there is a path with a length bigger then $1$
372 between $v_1$ and $v_2$. In a non cyclic graph the following holds $\nexists
373 v1: v1\xrightarrow{+}v1$. In words this means that a node can not be reached
374 again traveling the graph. Adding the property of non cyclicity to graphs
375 lowers the computational complexity of path existence in the graph to
376 $\mathcal{O}(L)$ where $L$ is the length of the path.
377
378 \begin{figure}[H]
379 \label{dagexample}
380 \centering
381 \scalebox{0.7}{
382 \digraph[]{dagexample}{
383 rankdir=LR;
384 n01 -> n02;
385 n02 -> n03;
386 n03 -> n01;
387 n11 -> n12;
388 n12 -> n13;
389 n12 -> n14;
390 }
391 }
392 \caption{Example DAG}
393 \end{figure}
394
395 \paragraph{Directed Acyclic Word Graphs}
396 The type of graph used in the project is a special kind of DAG called Directed
397 Acyclic Word Graphs(DAWGs). A DAWG can be defined by the tuple $G=(V,v_0,E,F)$.
398 $V$ is the same as in directed graphs, namely the set of nodes. $E$ is the set
399 of edges described by $E\subseteq V\times V\times L$ where $L\in A$ where $A$
400 is the alphabet used as node labels. In words this means that an edge is a
401 labeled arrow between two nodes and for example the edge $(v_1, v_2, a)$ can be
402 visualized as: $v_1\xrightarrow{a}v_2$. In a standard DAWG the alphabet $A$
403 contains all the characters used in natural language but in theory the alphabet
404 $A$ can contain anything. $F$ describes the set of final nodes, final nodes are
405 nodes that can be the end of a sequence even if there is an arrow leading out.
406 In the example graph in Figure~\ref{exampledawg} the final nodes are visualized
407 with a double circle as node shape. In this example it is purely cosmetic
408 because $n6$ is a final node anyways because there are no arrows leading out.
409 But this does not need to be the case, for example in $G=(\{n1, n2, n3\},
410 \{(n1, n2), (n2, n3)\}, \{n2, n3\})$ there is a distinct use for the final node
411 marking. Graph $G$ accepts the words \textit{a,ab} and to simplify the graph
412 node $n2$ and $n3$ are final. Finally $v_0$ describes the initial node, this is
413 visualized in figures as an incoming arrow. Because of the property of labeled
414 edges, data can be stored in a DAWG. When traversing a DAWG and saving all the
415 edgelabels one can construct words. Using graph minimalization big sets of
416 words can be stored using a small amouth of storage because edges can be
417 re-used to specify transitions. For example the graph in
418 Figure~\ref{exampledawg} can describe the language $L$ where all words $w$ that
419 are accepted $w\in\{abd, bad, bae\}$. Testing if a word is present in the DAWG
420 is the same technique as testing if a node path is present in a normal DAG and
421 therefore also falls in the computational complexity class of $\mathcal{O}(L)$.
422 This means that it grows linearly with the length of the word.
423
424 \begin{figure}[H]
425 \label{exampledawg}
426 \centering
427 \scalebox{0.7}{
428 \digraph[]{graph21}{
429 rankdir=LR;
430 node [shape="circle"]; n1 n2 n3 n4 n5;
431 n6 [shape="doublecircle"];
432 n [style=invis];
433 n -> n1;
434 n1 -> n2 [label="a"];
435 n2 -> n3 [label="b"];
436 n3 -> n6 [label="d"];
437 n1 -> n4 [label="b"];
438 n4 -> n5 [label="a"];
439 n5 -> n6 [label="d"];
440 n5 -> n6 [label="e"];
441 }
442 }
443 \caption{Example DAWG}
444 \end{figure}