thesis update
[bsc-thesis1415.git] / thesis2 / 1.introduction.tex
1 \section{Introduction}
2 What do people do when they want to grab a movie? Attend a concert? Find out
3 which theater shows play in their town theater?
4
5 When the internet was in its early days and it started to be accessible to most
6 of the people information about entertainment was still obtained almost
7 exclusively from flyers, books, posters, radio/tv advertisements. People had to
8 look pretty hard for the information and you could easily miss a show just
9 because it didn't cross paths with you.
10 Today the internet is used by almost everyone in the westen society on a daily
11 basis and we would think that missing an event would be impossible because of
12 the loads of information you receive every day. The opposite is true.
13
14 Nowadays information about entertainment is offered via two main channels on
15 the internet namely individual venues and combined websites.
16
17 Individual venues put a lot of effort and resources in building a beautiful,
18 fast and most of all modern website that bundles their information with nice
19 graphics, animations and gimmicks. There also exist companies that bundle the
20 information from different websites. Information bundling websites often have
21 the individual venue website as the source for their information and therefore
22 the information is most of the time not complete.
23 Individual organisations tend to think, for example, that it is obvious what
24 the address of their venue is, that their ticket price is always fixed to
25 \EURdig$5.-$ and that you need a membership to attend the events. Individual
26 organizations usually put this non specific information in a disclaimer or a
27 separate page and information bundling website miss out on these types of
28 information a lot.
29
30 Combining the information from the different data source turns out to be a hard
31 task for such information bundling websites. It is a hard task because
32 information bundling websites do not have the resources and time reserved for
33 these tasks and therefore often also serve incomplete information. Because of
34 the complexity of complete information there are not many websites trying to
35 bundle entertainment information into a complete and consistent databese.
36 Hyperleap\footnote{\url{http://hyperleap.nl}} tries to achieve goal of serving
37 complete and consistent information.
38
39 \section{Hyperleap \& Infotainment}
40 Hyperleap is a internet company that existed in the time that internet was not
41 widespread. Hyperleap, active since 1995, is specialized in producing,
42 publishing and maintaining \textit{infotainment}. \textit{Infotainment} is a
43 combination of the words \textit{information} and \textit{entertainment}. It
44 means a combination of factual information and subjectual information
45 (entertainment) within a certain category. In the case of Hyperleap the
46 category is the entertainment industry, entertainment industry encompasses all
47 facets of entertainment going from cinemas, theaters, concerts to swimming
48 pools, bridge matches and conferences. Within the entertainment industry
49 factual information includes, but is not limited to, information such as
50 starting time, location, host or venue and location. Subjectual information
51 includes, but is not limited to, things such as reviews, previews, photos and
52 background information or trivia.
53
54 Hyperleap manages the largest database containing \textit{infotainment} about
55 the entertainment industry. The database contains over $10.000$ categorized
56 events per week on average and their venue database contains over $54.000$
57 venues delivering the entertainment ranging from theaters and music venues to
58 petting zoos and fastfood restaurants. All the subjectual information is
59 obtained or created by Hyperleap and all factual information is gathered from
60 different sources and quality checked and therefore very reliable. Hyperleap is
61 the only company in its kind that has such high quality information. The
62 \textit{infotainment} is presented via several websites specialized per genre
63 or category and some sites attract over $500.000$ visitors per month.
64
65 \section{Extracting data from plain text}
66
67
68 \section{Information flow}
69 The reason why Hyperleap is the only in its kind with the high quality data is
70 because Hyperleap spends a lot of time and resources on quality checking, cross
71 comparing and consistency checking before the data enters the database. To
72 achieve this the data is inserted in the database in several different stages
73 that are visualized in Figure~\ref{informationflow} as an information flow
74 diagram using nodes as processing or publishing steps and arrows are
75 information flow.
76
77 \begin{figure}[H]
78 \label{informationflow}
79 \centering
80 \scalebox{0.7}{
81 \digraph[]{graph111}{
82 rankdir=TB;
83 node [shape="rectangle",fontsize=10,nodesep=0.7,ranksep=0.75,width=1];
84 edge [weight=5.];
85 i0 [label="Website"];
86 i1 [label="Email"];
87 i2 [label="Fax"];
88 i3 [label="RSS/Atom"];
89 p1 [label="Preproccessing"];
90 p2 [label="Temporum: Postproccesing"];
91 o1 [label="Database: Insertion"];
92 o2 [label="TheAgenda"];
93 o3 [label="BiosAgenda"];
94 o4 [label="..."];
95 node [width=5]; p1 p2 o1;
96 i0 -> p1;
97 i1 -> p1;
98 i2 -> p1;
99 i3 -> p1;
100 p1 -> p2;
101 p2 -> o1;
102 o1 -> o2;
103 o1 -> o3;
104 o1 -> o4;
105 }
106 }
107 \caption{Information flow Hyperleap database}
108 \end{figure}
109
110 \paragraph{Sources}
111 The information that enters the database has to be quality checked. There are
112 several criteria the information and the source have to comply to before any of
113 them can enter the database. For example the source has to be reliable,
114 consistent and free by licence whereas the event entries have to have at least
115 the following fields:
116 \begin{itemize}
117 \item[What]
118 The \textit{What} field is the field that describes the content, content is
119 a very broad definition. In practice it can be describing the concert tour
120 name, theater show title, movie title, festival title and many more.
121 \item[Where]
122 The \textit{Where} field is the location of the event. The location is often
123 omitted because the organization behind source presenting the information
124 think it is obvious. This field can also include different sublocations.
125 For example when a pop concert venue has their own building but in the
126 summer they organize a festival in some park. This data is often assumed
127 to be trivial and inherent but in practice this is not the case. In this
128 example for an outsider only the name of the park is often not enough.
129 \item[When]
130 The \textit{When} field is the time and date of the event. Hyperleap wants
131 to have at minimum the date, start time and end time. In the field end
132 times are often omitted because they are not fixed or the organization
133 think it is obvious.
134 \end{itemize}
135
136 Venues often present incomplete data entries that do not comply to the
137 requirements explained before.
138
139 The first step in the information flow is the source. Hyperleap processes
140 different sources and every source has different characteristics. Sources can
141 be modern sources like websites or social media but even today a lot of
142 information arrives at Hyperleap via flyers, fax or email. As sources vary in
143 content structure sources also vary in reliability. For example the following
144 entry is an example of a very well structured and probably generated, and thus
145 probably also reliable, event description. The entry can originate for example
146 from a RSS feed.
147
148 \begin{flushleft}
149 \texttt{2015-05-20, 18:00-23:00 - \textit{Foobar} presenting their new CD in
150 combination with a show. Location: small salon.}
151 \end{flushleft}
152
153 An example of a terrible entry could be for example the following text that
154 could originate from a flyer or social media post.
155
156 \begin{flushleft}
157 \texttt{\textit{Foobar} playing to celebrate their CD release in the park
158 tomorrow evening.}
159 \end{flushleft}
160
161 This example lacks a precise date, time and location and is therefore hard for
162 people to grasp at first, let alone for a machine. When someone wants to get
163 the full information he has to tap in different resources which might not
164 always be available.
165
166 \paragraph{Crawler}
167 When the source has been determined and classified the next step is
168 periodically crawling the source. At the moment the crawling happens using a
169 couple of different methods. Some sites are very structured and a programmer
170 creates a program that can visit the website systematically and extract that
171 information. Some other sources are very hard to programmatically crawled and
172 they are visited by employees. The programmed crawlers are always specifically
173 created for one or a couple sources and when the source changes for example
174 structure the programmer has to adapt the crawler which is costly. Information
175 from the different crawlers then goes to the \textit{Temporum}.
176
177 \paragraph{Temporum}
178 The \textit{Temporum} is a big bin that contains raw data extracted from
179 different sources and has to be post processed to be suitable enough for the
180 actual database. This post-processing encompasses several possible tasks. The
181 first task is to check the validity of the entry. This is not done for every
182 entry but random samples are taken and these entries will be tested to see
183 if the crawler is not malfunctioning and there is no nonsense in the
184 data.
185 The second step is matching the entry to several objects. Objects in the
186 broadest sense can basically be anything, in the case of a theater show
187 performance the matched object can be the theater show itself, describing the
188 show for all locations. In case of a concert it can be that the object is a
189 tour the band is doing. When the source is a ticket vendor the object is the
190 venue where the event takes place since the source itself is not a venue.
191 Many of these tasks are done by a human or with aid of a computer program.
192 The \textit{Temporum} acts as a safety net because no data is going straigt
193 through to the database. It is first checked, matched and validated.
194
195 \paragraph{Database \& Publication}
196 When the data is post processed it is entered in the final database. The
197 database contains all the events that happened in the past and all the events
198 that are going to happen in the future. The database also contains information
199 about the venues. The database is linked to several categorical websites that
200 offer the information to users and accompany it with the other part of the
201 \textit{infotainment} namely the subjectual information that is usually
202 presented in the form of trivia, photos, interviews, reviews, previews and much
203 more.
204
205 \section{Goal \& Research question}
206 Crawling the sources and applying the preprocessing is the most expensive task
207 in the entire information flow because it requires a programmer to be hired.
208 Only programmers currently can create crawlers and design preprocessing jobs
209 because all current programs are specifically designed to do one thing, most of
210 the time for one source. Because a big group of sources often changes this task
211 becomes very expensive and has a long feedback loop. When a source changes the
212 source is first preprocessed in the old way, send to the \textit{Temporum} and
213 checked by a human and matched. The human then notices the error in the data
214 and contacts the programmer. The programmer then has to reprogram the specific
215 crawler to the new structure. This feedback loop, shown in
216 Figure~\ref{feedbackloop} can take days and can be the reason for gaps and
217 faulty information in the database.
218 \begin{figure}[H]
219 \label{feedbackloop}
220 \centering
221 \scalebox{0.5}{
222 \digraph[]{graph112}{
223 rankdir=LR;
224 node [shape="rectangle"];
225 source [label="Source"];
226 crawler [label="Crawler"];
227 temporum [label="Temporum"];
228 employee [label="User"];
229 programmer [label="Programmer"];
230 database [label="Database"];
231 source -> crawler;
232 crawler -> temporum;
233 temporum -> user;
234 user -> database;
235 user -> programmer [constraint=false,style="dotted"];
236 user -> crawler [constraint=false,style="dashed"];
237 programmer -> crawler [constraint=false,style="dotted"];
238 }
239 }
240 \caption{Feedback loop for malfunctioning crawlers}
241 \end{figure}
242
243 The goal of the project is to relieve the programmer of repairing crawlers all
244 the time and make the task of adapting, editing and removing crawlers doable
245 for someone without programming experience. In practice this means in
246 Figure~\ref{feedbackloop} removing the dotted arrows by dashed arrow.
247
248 For this project an application has been developed that can provide an
249 interface to a crawler generation system that is able to crawl RSS\cite{Rss}
250 and Atom\cite{Atom} publishing feeds. The interface provides the user with
251 point and click interfaces meaning that no computer science background is
252 needed to use the interface, to create, modify, test and remove crawlers. The
253 current Hyperleap backend system that handles the data can, via an interface,
254 generate XML feeds that contain the crawled data. The structure of the
255 application is very modular and generic and therefore it is easy to change
256 things in the program without having to know a lot about the programming
257 language used. This is because for example all visual things like buttons are
258 defined in a human readable text file. In practice it means that one person,
259 not by definition a programmer, can be instructed to change the structure and
260 this can also greatly reduce programmer intervention time.
261
262 The actual problem statement then becomes:
263 \begin{center}
264 \textit{Is it possible to shorten the feedback loop for repairing and adding
265 crawlers by making a system that can create, add and maintain crawlers for
266 RSS feeds}
267 \end{center}
268
269 \section{RSS/Atom}
270 RSS/Atom feeds, from now on called RSS feeds, are publishing feeds in the XML
271 format\cite{Xml} that are used to publish events. Every event or entry consists
272 of several standardized fields. The main data fields are the \textit{title} and
273 the \textit{description} field. In these fields the raw data is stored that
274 describes the event. Further there are several auxiliary fields that store the
275 link to the full article, store the publishing data, store some media in the
276 form of an image or video URL and store a \textit{Globally Unique
277 Identifier}(GUID)\footnote{A GUID is a unique identifier that in most cases is
278 the permalink of the article. A permalink is a link that will point to the
279 article}. An example of a RSS feed can be found in Listing~\ref{examplerss},
280 this listing shows a partly truncated RSS feed of a well known venue in the
281 Netherlands. As visible in the listing the similarities with XML are very
282 clear. Every RSS feed contains a \texttt{channel} tag and within that tag there
283 is some metadata and a list op \texttt{item} tags. Every \texttt{item} tag has
284 a fixed number of different fields. The most important fields for RSS within
285 the leisure industry are the \texttt{title} and the \texttt{description field}.
286
287 \begin{listing}
288 \caption{An example of a, partly truncated RSS feed of a well known dutch
289 venue}
290 \label{examplerss}
291 \xmlcode{exrss.xml}
292 \end{listing}
293
294 RSS feeds are mostly used by news sites to publish their articles, the feed
295 then only contains the headlines and if the user that is reading the feed is
296 interested he can click the so called deeplink and he is then sent to the full
297 website containing the article. Users often use programs that bundle user
298 specified RSS feeds into big summary feeds that can be used to sift through a
299 lot of news feeds. The RSS feed reader uses the GUID to skip feeds that are
300 already present in its internal database.
301
302 Generating RSS feeds by hand is a tedious task but usually RSS feeds are
303 generated by the Content Management Systems(CMS) the website runs on. With this
304 automatic method the RSS feeds are generated using the content published on the
305 website. Because the process is automatic the RSS feeds are generally very
306 structured and consistent in its structure. In the entertainment industry
307 venues often use a CMS for their website to allow users with no programming or
308 website background be able to post news items and event information.
309
310 \section{Why RSS?}
311 \label{sec:whyrss}
312 We described a lot of different source formats like HTML, fax/email, RSS and
313 XML. Because of the limited scope of the project and the time planned for it
314 we had to remove some of the input formats because they all require different
315 techniques. For example when the input source is in HTML format, most probably
316 a website, then there can be a great deal of information extraction be
317 automated using the structural information which is characteristic for HTML.
318 For fax/email however there is almost no structural information and most of the
319 automation techniques require natural language processing. We chose RSS feeds
320 because RSS feeds lack inherent structural information but are still very
321 structured. This structure is because, as said above, the RSS feeds are
322 generated and therefore almost always look the same. Also, in RSS feeds most
323 venues use particular structural identifiers that are characters. They separate
324 fields with vertical bars, commas, whitespace and more non text characters.
325 These field separators and keywords can be hard for a computer to detect but
326 people are very good in detecting these. With one look they can identify the
327 characters and keywords and build a pattern in their head. Another reason we
328 chose RSS is their temporal consistency, RSS feeds are almost always generated
329 and because of that the structure of the entries is very unlikely to change.
330 Basically the RSS feeds only change structure when the CMS that generates it
331 changes the generation algorithm. This property is usefull because the crawlers
332 then do not have to be retrained very often. Because the non inherent structure
333 entirely new strategies had to be applied.
334
335 \section{Directed Acyclic Graphs}
336 \paragraph{Directed graphs}
337 Directed graphs are a mathematical structure that can describe relations
338 between nodes. A directed graph $G$ is be defined as the tuple $(V,E)$ where
339 $V$ is a set of named nodes and $E$ is the set of edges defined by $E\subseteq
340 V\times V$. An edge $e\in E$ can be seen as $(v1, v2)$ where $v1,v2\in V$ and
341 can be interpreted as in a figure as an arrow between node $v1$ and node $v2$.
342 Multiple connections between two nodes are possible. For example the graph
343 visualized in Figure~\ref{graphexample} can be mathematically described as:
344 $$G=(\{n1, n2, n3, n4\}, \{(n1, n2), (n2, n1), (n2, n3), (n3, n4), (n1, n4)\}$$
345
346 \begin{figure}[H]
347 \label{graphexample}
348 \scalebox{0.7}{
349 \digraph[]{graphexample}{
350 rankdir=LR;
351 n1->n2;
352 n2->n1;
353 n2->n3;
354 n3->n4;
355 n1->n4;
356 }
357 }
358 \caption{Example DG}
359 \end{figure}
360
361 \paragraph{Directed acyclic graphs}
362 Directed Acyclic Graphs(DAGs) are a special kind of directed graphs. DAGs
363 are also defined as $G=(V,E)$ but with a restriction on $E$. Namely that cycles
364 are not allowed. Figure~\ref{dagexample} shows two graphs. The bottom graph
365 ontains a cycle and the right graph does not. Only the top graph is a valid
366 DAG. A cycle can be defined as follows:
367
368 Say $e\in E$ can be seen as $(v_1,v_2)\in E$ or $v_1\rightarrow v2$ then
369 $v_1\xrightarrow{+}v_2$ is defined as $v_1\rightarrow \ldots
370 \rightarrow v_2$ meaning that there is a path with a length bigger then $1$
371 between $v_1$ and $v_2$. In a non cyclic graph the following holds $\nexists
372 v1: v1\xrightarrow{+}v1$. In words this means that a node can not be reached
373 again traveling the graph. Adding the property of non cyclicity to graphs
374 lowers the computational complexity of path existence in the graph to
375 $\mathcal{O}(L)$ where $L$ is the length of the path.
376
377 \begin{figure}[H]
378 \label{dagexample}
379 \centering
380 \scalebox{0.7}{
381 \digraph[]{dagexample}{
382 rankdir=LR;
383 n01 -> n02;
384 n02 -> n03;
385 n03 -> n01;
386 n11 -> n12;
387 n12 -> n13;
388 n12 -> n14;
389 }
390 }
391 \caption{Example DAG}
392 \end{figure}
393
394 \paragraph{Directed Acyclic Word Graphs}
395 The type of graph used in the project is a special kind of DAG called Directed
396 Acyclic Word Graphs(DAWGs). A DAWG can be defined by the tuple $G=(V,v_0,E,F)$.
397 $V$ is the same as in directed graphs, namely the set of nodes. $E$ is the set
398 of edges described by $E\subseteq V\times V\times L$ where $L\in A$ where $A$
399 is the alphabet used as node labels. In words this means that an edge is a
400 labeled arrow between two nodes and for example the edge $(v_1, v_2, a)$ can be
401 visualized as: $v_1\xrightarrow{a}v_2$. In a standard DAWG the alphabet $A$
402 contains all the characters used in natural language but in theory the alphabet
403 $A$ can contain anything. $F$ describes the set of final nodes, final nodes are
404 nodes that can be the end of a sequence even if there is an arrow leading out.
405 In the example graph in Figure~\ref{exampledawg} the final nodes are visualized
406 with a double circle as node shape. In this example it is purely cosmetic
407 because $n6$ is a final node anyways because there are no arrows leading out.
408 But this does not need to be the case, for example in $G=(\{n1, n2, n3\},
409 \{(n1, n2), (n2, n3)\}, \{n2, n3\})$ there is a distinct use for the final node
410 marking. Graph $G$ accepts the words \textit{a,ab} and to simplify the graph
411 node $n2$ and $n3$ are final. Finally $v_0$ describes the initial node, this is
412 visualized in figures as an incoming arrow. Because of the property of labeled
413 edges, data can be stored in a DAWG. When traversing a DAWG and saving all the
414 edgelabels one can construct words. Using graph minimalization big sets of
415 words can be stored using a small amouth of storage because edges can be
416 re-used to specify transitions. For example the graph in
417 Figure~\ref{exampledawg} can describe the language $L$ where all words $w$ that
418 are accepted $w\in\{abd, bad, bae\}$. Testing if a word is present in the DAWG
419 is the same technique as testing if a node path is present in a normal DAG and
420 therefore also falls in the computational complexity class of $\mathcal{O}(L)$.
421 This means that it grows linearly with the length of the word.
422
423 \begin{figure}[H]
424 \label{exampledawg}
425 \centering
426 \scalebox{0.7}{
427 \digraph[]{graph21}{
428 rankdir=LR;
429 node [shape="circle"]; n1 n2 n3 n4 n5;
430 n6 [shape="doublecircle"];
431 n [style=invis];
432 n -> n1;
433 n1 -> n2 [label="a"];
434 n2 -> n3 [label="b"];
435 n3 -> n6 [label="d"];
436 n1 -> n4 [label="b"];
437 n4 -> n5 [label="a"];
438 n5 -> n6 [label="d"];
439 n5 -> n6 [label="e"];
440 }
441 }
442 \caption{Example DAWG}
443 \end{figure}