everything till graphs
[bsc-thesis1415.git] / thesis2 / 1.introduction.tex
1 \section{Introduction}
2 What do people do when they want to grab a movie? Attend a concert? Find out
3 which theater shows play in their town theater?
4
5 When the internet was in its early days and it started to be accessible to most
6 of the people information about entertainment was still obtained almost
7 exclusively from flyers, books, posters, radio/tv advertisements. People had to
8 look pretty hard for the information and you could easily miss a show just
9 because it didn't cross paths with you.
10 Today the internet is used by almost everyone in the westen society on a daily
11 basis and we would think that missing an event would be impossible because of
12 the loads of information you receive every day. The opposite is true.
13
14 Nowadays information about entertainment is offered via two main channels on
15 the internet namely individual venues and combined websites.
16
17 Individual venues put a lot of effort and resources in building a beautiful,
18 fast and most of all modern website that bundles their information with nice
19 graphics, animations and gimmicks. There also exist companies that bundle the
20 information from different websites. Information bundling websites often have
21 the individual venue website as the source for their information and therefore
22 the information is most of the time not complete.
23 Individual organisations tend to think, for example, that it is obvious what
24 the address of their venue is, that their ticket price is always fixed to
25 \EURdig$5.-$ and that you need a membership to attend the events. Individual
26 organizations usually put this non specific information in a disclaimer or a
27 separate page and information bundling website miss out on these types of
28 information a lot.
29
30 Combining the information from the different data source turns out to be a hard
31 task for such information bundling websites. It is a hard task because
32 information bundling websites do not have the resources and time reserved for
33 these tasks and therefore often also serve incomplete information. Because of
34 the complexity of complete information there are not many websites trying to
35 bundle entertainment information into a complete and consistent databese.
36 Hyperleap\footnote{\url{http://hyperleap.nl}} tries to achieve goal of serving
37 complete and consistent information.
38
39 \section{Hyperleap \& Infotainment}
40 Hyperleap is a internet company that existed in the time that internet was not
41 widespread. Hyperleap, active since 1995, is specialized in producing,
42 publishing and maintaining \textit{infotainment}. \textit{Infotainment} is a
43 combination of the words \textit{information} and \textit{entertainment}. It
44 means a combination of factual information and subjectual information
45 (entertainment) within a certain category. In the case of Hyperleap the
46 category is the entertainment industry, entertainment industry encompasses all
47 facets of entertainment going from cinemas, theaters, concerts to swimming
48 pools, bridge matches and conferences. Within the entertainment industry
49 factual information includes, but is not limited to, information such as
50 starting time, location, host or venue and location. Subjectual information
51 includes, but is not limited to, things such as reviews, previews, photos and
52 background information or trivia.
53
54 Hyperleap manages the largest database containing \textit{infotainment} about
55 the entertainment industry. The database contains over $10.000$ categorized
56 events per week on average and their venue database contains over $54.000$
57 venues delivering the entertainment ranging from theaters and music venues to
58 petting zoos and fastfood restaurants. All the subjectual information is
59 obtained or created by Hyperleap and all factual information is gathered from
60 different sources and quality checked and therefore very reliable. Hyperleap is
61 the only company in its kind that has such high quality information. The
62 \textit{infotainment} is presented via several websites specialized per genre
63 or category and some sites attract over $500.000$ visitors per month.
64
65 \section{Information flow}
66 The reason why Hyperleap is the only in its kind with the high quality data is
67 because Hyperleap spends a lot of time and resources on quality checking, cross
68 comparing and consistency checking before the data enters the database. To
69 achieve this the data is inserted in the database in several different stages
70 that are visualized in Figure~\ref{informationflow} as an information flow
71 diagram using nodes as processing or publishing steps and arrows are
72 information flow.
73
74 \begin{figure}[H]
75 \caption{Information flow Hyperleap database}
76 \label{informationflow}
77 \centering
78 \scalebox{0.7}{
79 \digraph[]{graph111}{
80 rankdir=TB;
81 node [shape="rectangle",fontsize=10,nodesep=0.7,ranksep=0.75,width=1];
82 edge [weight=5.];
83 i0 [label="Website"];
84 i1 [label="Email"];
85 i2 [label="Fax"];
86 i3 [label="RSS/Atom"];
87 p1 [label="Preproccessing"];
88 p2 [label="Temporum: Postproccesing"];
89 o1 [label="Database: Insertion"];
90 o2 [label="TheAgenda"];
91 o3 [label="BiosAgenda"];
92 o4 [label="..."];
93 node [width=5]; p1 p2 o1;
94 i0 -> p1;
95 i1 -> p1;
96 i2 -> p1;
97 i3 -> p1;
98 p1 -> p2;
99 p2 -> o1;
100 o1 -> o2;
101 o1 -> o3;
102 o1 -> o4;
103 }
104 }
105 \end{figure}
106
107 \paragraph{Sources}
108 The information that enters the database has to be quality checked. There are
109 several criteria the information and the source have to comply to before any of
110 them can enter the database. For example the source has to be reliable,
111 consistent and free by licence whereas the event entries have to have at least
112 the following fields:
113 \begin{itemize}
114 \item[What]
115 The \textit{What} field is the field that describes the content, content is
116 a very broad definition. In practice it can be describing the concert tour
117 name, theater show title, movie title, festival title and many more.
118 \item[Where]
119 The \textit{Where} field is the location of the event. The location is often
120 omitted because the organization behind source presenting the information
121 think it is obvious. This field can also include different sublocations.
122 For example when a pop concert venue has their own building but in the
123 summer they organize a festival in some park. This data is often assumed
124 to be trivial and inherent but in practice this is not the case. In this
125 example for an outsider only the name of the park is often not enough.
126 \item[When]
127 The \textit{When} field is the time and date of the event. Hyperleap wants
128 to have at minimum the date, start time and end time. In the field end
129 times are often omitted because they are not fixed or the organization
130 think it is obvious.
131 \end{itemize}
132
133 Venues often present incomplete data entries that do not comply to the
134 requirements explained before.
135
136 The first step in the information flow is the source. Hyperleap processes
137 different sources and every source has different characteristics. Sources can
138 be modern sources like websites or social media but even today a lot of
139 information arrives at Hyperleap via flyers, fax or email. As sources vary in
140 content structure sources also vary in reliability. For example the following
141 entry is an example of a very well structured and probably generated, and thus
142 probably also reliable, event description. The entry can originate for example
143 from a RSS feed.
144
145 \begin{flushleft}
146 \texttt{2015-05-20, 18:00-23:00 - \textit{Foobar} presenting their new CD in
147 combination with a show. Location: small salon.}
148 \end{flushleft}
149
150 An example of a terrible entry could be for example the following text that
151 could originate from a flyer or social media post.
152
153 \begin{flushleft}
154 \texttt{\textit{Foobar} playing to celebrate their CD release in the park
155 tomorrow evening.}
156 \end{flushleft}
157
158 This example lacks a precise date, time and location and is therefore hard for
159 people to grasp at first, let alone for a machine. When someone wants to get
160 the full information he has to tap in different resources which might not
161 always be available.
162
163 \paragraph{Crawler}
164 When the source has been determined and classified the next step is
165 periodically crawling the source. At the moment the crawling happens using a
166 couple of different methods. Some sites are very structured and a programmer
167 creates a program that can visit the website systematically and extract that
168 information. Some other sources are very hard to programmatically crawled and
169 they are visited by employees. The programmed crawlers are always specifically
170 created for one or a couple sources and when the source changes for example
171 structure the programmer has to adapt the crawler which is costly. Information
172 from the different crawlers then goes to the \textit{Temporum}.
173
174 \paragraph{Temporum}
175 The \textit{Temporum} is a big bin that contains raw data extracted from
176 different sources and has to be post processed to be suitable enough for the
177 actual database. This post-processing encompasses several possible tasks. The
178 first task is to check the validity of the entry. This is not done for every
179 entry but random samples are taken and these entries will be tested to see
180 if the crawler is not malfunctioning and there is no nonsense in the
181 data.
182 The second step is matching the entry to several objects. Objects in the
183 broadest sense can basically be anything, in the case of a theater show
184 performance the matched object can be the theater show itself, describing the
185 show for all locations. In case of a concert it can be that the object is a
186 tour the band is doing. When the source is a ticket vendor the object is the
187 venue where the event takes place since the source itself is not a venue.
188 Many of these tasks are done by a human or with aid of a computer program.
189 The \textit{Temporum} acts as a safety net because no data is going straigt
190 through to the database. It is first checked, matched and validated.
191
192 \paragraph{Database \& Publication}
193 When the data is post processed it is entered in the final database. The
194 database contains all the events that happened in the past and all the events
195 that are going to happen in the future. The database also contains information
196 about the venues. The database is linked to several categorical websites that
197 offer the information to users and accompany it with the other part of the
198 \textit{infotainment} namely the subjectual information that is usually
199 presented in the form of trivia, photos, interviews, reviews, previews and much
200 more.
201
202 \section{Goal \& Research question}
203 Crawling the sources and applying the preprocessing is the most expensive task
204 in the entire information flow because it requires a programmer to be hired.
205 Only programmers currently can create crawlers and design preprocessing jobs
206 because all current programs are specifically designed to do one thing, most of
207 the time for one source. Because a big group of sources often changes this task
208 becomes very expensive and has a long feedback loop. When a source changes the
209 source is first preprocessed in the old way, send to the \textit{Temporum} and
210 checked by a human and matched. The human then notices the error in the data
211 and contacts the programmer. The programmer then has to reprogram the specific
212 crawler to the new structure. This feedback loop, shown in
213 Figure~\ref{feedbackloop} can take days and can be the reason for gaps and
214 faulty information in the database.
215 \begin{figure}[H]
216 \caption{Feedback loop for malfunctioning crawlers}
217 \label{feedbackloop}
218 \centering
219 \scalebox{0.5}{
220 \digraph[]{graph112}{
221 rankdir=LR;
222 node [shape="rectangle"];
223 source [label="Source"];
224 crawler [label="Crawler"];
225 temporum [label="Temporum"];
226 employee [label="User"];
227 programmer [label="Programmer"];
228 database [label="Database"];
229 source -> crawler;
230 crawler -> temporum;
231 temporum -> user;
232 user -> database;
233 user -> programmer [constraint=false,style="dotted"];
234 user -> crawler [constraint=false,style="dashed"];
235 programmer -> crawler [constraint=false,style="dotted"];
236 }
237 }
238 \end{figure}
239
240 The goal of the project is to relieve the programmer of repairing crawlers all
241 the time and make the task of adapting, editing and removing crawlers doable
242 for someone without programming experience. In practice this means in
243 Figure~\ref{feedbackloop} removing the dotted arrows by dashed arrow.
244
245 For this project an application has been developed that can provide an
246 interface to a crawler generation system that is able to crawl RSS\cite{Rss}
247 and Atom\cite{Atom} publishing feeds. The interface provides the user with
248 point and click interfaces meaning that no computer science background is
249 needed to use the interface, to create, modify, test and remove crawlers. The
250 current Hyperleap backend system that handles the data can, via an interface,
251 generate XML feeds that contain the crawled data. The structure of the
252 application is very modular and generic and therefore it is easy to change
253 things in the program without having to know a lot about the programming
254 language used. This is because for example all visual things like buttons are
255 defined in a human readable text file. In practice it means that one person,
256 not by definition a programmer, can be instructed to change the structure and
257 this can also greatly reduce programmer intervention time.
258
259 \section{RSS/Atom}
260 RSS/Atom feeds, from now on called RSS feeds, are publishing feeds in the XML
261 format\cite{Xml} that are used to publish events. Every event or entry consists
262 of several standardized fields. The main data fields are the \textit{title} and
263 the \textit{description} field. In these fields the raw data is stored that
264 describes the event. Further there are several auxiliary fields that store the
265 link to the full article, store the publishing data, store some media in the
266 form of an image or video URL and store a \textit{Globally Unique
267 Identifier}(GUID)\footnote{A GUID is a unique identifier that in most cases is
268 the permalink of the article. A permalink is a link that will point to the
269 article}. An example of a RSS feed can be found in Listing~\ref{examplerss},
270 this listing shows a partly truncated RSS feed of a well known venue in the
271 Netherlands.
272
273 \begin{listing}
274 \caption{An example of a, partly truncated RSS feed of a well known dutch
275 venue}
276 \label{examplerss}
277 \xmlcode{exrss.xml}
278 \end{listing}
279
280 RSS feeds are mostly used by news sites to publish their articles, the feed
281 then only contains the headlines and if the user that is reading the feed is
282 interested he can click the so called deeplink and he is then sent to the full
283 website containing the article. Users often use programs that bundle user
284 specified RSS feeds into big summary feeds that can be used to sift through a
285 lot of news feeds. The RSS feed reader uses the GUID to skip feeds that are
286 already present in its internal database.
287
288 Generating RSS feeds by hand is a tedious task but usually RSS feeds are
289 generated by the Content Management Systems(CMS) the website runs on. With this
290 automatic method the RSS feeds are generated using the content published on the
291 website. Because the process is automatic the RSS feeds are generally very
292 structured and consistent in its structure. In the entertainment industry
293 venues often use a CMS for their website to allow users with no programming or
294 website background be able to post news items and event information.
295
296 \section{Why RSS?}
297 \label{sec:whyrss}
298 The first proposal described formats like HTML, fax/email, RSS, XML and some
299 more. Because of the scope of the project and the time planned for it we had to
300 remove some of the input formats because they all require different techniques.
301 For example when the input source is in HTML format, most probably a website,
302 then there can be a great deal of information extraction be automated using the
303 structural information which is characteristic for HTML. For fax/email however
304 there is almost no structural information and most of the automation techniques
305 require natural language processing. We chose RSS feeds because RSS feeds lack
306 inherent structural information but are still very structured. This structure
307 is because, as said above, the RSS feeds are generated and therefore almost
308 always look the same. Also, in RSS feeds most venues use particular structural
309 identifiers that are characters. They separate fields with vertical bars,
310 commas, whitespace and more non text characters. These field separators and
311 keywords can be hard for a computer to detect but people are very good in
312 detecting these. With one look they can identify the characters and keywords
313 and build a pattern in their head.
314 Another reason we chose RSS is their temporal consistency, RSS feeds are almost
315 always generated and because of that the structure of the entries is very
316 unlikely to change. Basically the RSS feeds only change structure when the CMS
317 that generates it changes the generation algorithm. This property is usefull
318 because the crawlers then do not have to be retrained very often. Because the
319 non inherent structure entirely new strategies had to be applied.
320
321 \section{Directed Acyclic Graphs}
322 \paragraph{Normal graphs}
323 A graph($G$) is a mathematical structure to describe relations between nodes
324 with edges. A standard graph is defined as the ordered pair: $G=(V,E)$. In
325 this ordered pair $V$ is the set of nodes and $E$ is set of undirected edges
326 where every undirected edge is a tuple of two nodes.
327 Figure~\ref{graphexample} is specified as:
328 $$G=(\{n1, n2, n3\}, \{(n1, n2), (n2, n3), (n2, n3)\})$$
329
330 \begin{figure}[H]
331 \caption{Example Graph}
332 \label{graphexample}
333 \centering
334 \scalebox{0.7}{
335 \digraph[]{graphexample}{
336 rankdir=LR;
337 n1 -> n2 [dir="none"];
338 n2 -> n3 [dir="none"];
339 n2 -> n3 [dir="none"];
340 }
341 }
342 \end{figure}
343
344 \paragraph{Directed graphs}
345 Directed graphs are special kind of graphs. A directed graph $G$ is also
346 defined by the ordered pair: $G=(V,E)$. $V$ still defines the nodes and $E$
347 still the edges but the inherent difference is that the edges are ordered
348 tuples in stead of not ordered. Adding this property gives the edges a
349 direction. Every edge has a specific start and end and are therefore called
350 directed edges. A directed graph would look the same as the graph in
351 Figure~\ref{graphexample} but then visualized with arrows instead of normal
352 lines. The arrows specifically go from one node to the other and not the other
353 way around. However bidirectional connection can occur. For example graph the
354 graph shown in Figure~\ref{dgexample} is directional with a bidirectional
355 connection.
356 $$G=(\{n1, n2\}, \{(n1, n2), (n2, n1)\}$$
357
358 \begin{figure}[H]
359 \caption{Example directed graph}
360 \label{dgexample}
361 \centering
362 \scalebox{0.7}{
363 \digraph[]{dgexample}{
364 rankdir=LR;
365 n1 -> n2;
366 n2 -> n1;
367 }
368 }
369 \end{figure}
370
371 \paragraph{Directed acyclic graphs}
372 Directed Acyclic Graphs(DAGs) are a special kind of directed graphs. DAGs
373 are also defined as $G=(V,E)$ but with a restriction on $E$. Namely that cycles
374 are not allowed. Figure~\ref{dagexample} shows two graphs. The bottom graph
375 ontains a cycle and the right graph does not. Only the top graph is a valid
376 DAG. A cycle is defined by a sequence of edges where nodes are visited more
377 then once. Adding the property of non-cyclicity to graphs lowers the
378 computational complexity of checking if a node sequence is present in the
379 graph to $\mathcal{O}(L)$ where $L$ is the length of the sequence.
380
381 \begin{figure}[H]
382 \caption{Example DAG}
383 \label{dagexample}
384 \centering
385 \scalebox{0.7}{
386 \digraph[]{dagexample}{
387 rankdir=LR;
388 n01 -> n02;
389 n02 -> n03;
390 n03 -> n01;
391 n11 -> n12;
392 n12 -> n13;
393 n12 -> n14;
394 }
395 }
396 \end{figure}
397
398 \paragraph{Directed Acyclic Word Graphs}
399 The type of graph used in the project is a special kind of DAG called
400 Directed Acyclic Word Graphs(DAWGs). A DAWG can be defined by the ordered
401 triple $G=(V,E,F)$.
402 $V$ is the same as in directed graphs. $E$ is also the same besides the fact
403 that the ordered pair of nodes that describes the edge it now is a triple
404 consisting of a start node, an end node and a label. In a standard DAWG the
405 label is always one character. $F$ describes the set of final nodes, final
406 nodes are nodes that can be the end of a sequence even if another arrow leads
407 out. In the example graph in Figure~\ref{exampledawg} the final nodes are
408 visualized with a double circle as node shape. In this example it is purely
409 cosmetic because $n6$ is a final node anyways because there are no arrows
410 leading out. But for example in $G=(\{n1, n2, n3\}, \{(n1, n2), (n2, n3)\},
411 \{n2, n3\})$ there is a distinct use for the final node marking. Graph $G$
412 accepts the words \textit{a,ab} and to simplify the graph node $n2$ and $n3$
413 are final. Because of the property of labeled edges data can be stored
414 in a DAWG. When traversing a DAWG and saving all the edgelabels one can
415 construct words. Because of properties of graphs a DAWG can store big sets of
416 words in a small amount of storage because it can re-use edges to specify
417 transitions. For example the graph in Figure~\ref{exampledawg} can describe a
418 dictionary that holds the words: \textit{abd, bad, bae}. Testing if a word is
419 present in the DAWG is, just as for a DAG, falls in the computational
420 complexity class of $\mathcal{O}(L)$ meaning that it grows linearly with the
421 length of the word.
422
423 \begin{figure}[H]
424 \caption{Example DAWG}
425 \label{exampledawg}
426 \centering
427 \scalebox{0.7}{
428 \digraph[]{graph21}{
429 rankdir=LR;
430 node [shape="circle"]; n1 n2 n3 n4 n5;
431 n6 [shape="doublecircle"];
432 n1 -> n2 [label="a"];
433 n2 -> n3 [label="b"];
434 n3 -> n6 [label="d"];
435 n1 -> n4 [label="b"];
436 n4 -> n5 [label="a"];
437 n5 -> n6 [label="d"];
438 n5 -> n6 [label="e"];
439 }
440 }
441 \end{figure}