final
[bsc-thesis1415.git] / thesis2 / 1.introduction.tex
1 \section{Introduction}
2 What do people do when they want to grab a movie? Attend a concert? Find out
3 which theater shows play in local theater?
4
5 In the early days of the internet, access to the web was not available for most
6 of the people. Information about leisure activities was almost exclusively
7 obtained from flyers, posters and other written media and radio/TV
8 advertisements. People had to put effort in searching for information and it
9 was easy to miss a show just because you did not cross paths with it. Today the
10 internet is used on a daily basis by almost everyone in the western society and
11 one would think that missing an event would be impossible because of the
12 enormous loads of information you can receive every day using the internet.
13 For leisure activities the opposite is true, complete and reliable information
14 about events is still hard to find.
15
16 Nowadays information on the internet about entertainment is offered via two
17 main channels: individual venues websites and information bundling websites.
18
19 Individual venues put a lot of effort and resources in building a beautiful,
20 fast and most of all modern website that bundles their information with nice
21 graphics, animations and other gimmicks. Information bundling websites are run
22 by companies that try to provide an overview of multiple venues. Information
23 bundling websites often have individual venue websites as their source for
24 information. Individual venues assume, for example, that it is obvious what the
25 address of their venue is, that their ticket price is always fixed to
26 \EURdig$5.-$ and that you need a membership to attend the events. Individual
27 organizations usually put this non specific information in a disclaimer or a
28 separate page. Because of the less structured way of providing information the
29 information bundling websites have a hard time finding complete information.
30 The event data can be crawled using automated crawlers but the miscellaneous
31 information usually has to be crawled by hand.
32
33 Combining the information from the different data source turns out to be a
34 complicated task for information bundling websites. The task is difficult
35 because the companies behind these information bundling websites do not have
36 the resources and time reserved for these tasks and therefore often serve
37 incomplete information. Because of the complexity of getting complete
38 information there are not many companies trying to bundle entertainment
39 information into a complete and consistent database and website.
40 Hyperleap\footnote{\url{http://hyperleap.nl}} tries to achieve the goal of
41 serving complete and consistent information and offers it via various
42 information bundling websites.
43 \newpage
44 \section{Hyperleap \& Infotainment}
45 Hyperleap is an internet company that was founded in the time that internet was
46 not widespread. Hyperleap, active since 1995, is specialized in producing,
47 publishing and maintaining \textit{infotainment}. \textit{Infotainment} is a
48 combination of the words \textit{information} and \textit{entertainment}. It
49 represents a combination of factual information, the \textit{information} part,
50 and non factual information or subjectual information, the
51 \textit{entertainment} part, within a certain category or field. In the case of
52 Hyperleap the category is the leisure industry, leisure industry encompasses
53 all facets of entertainment ranging from cinemas, theaters, concerts to
54 swimming pools, bridge competitions and conferences. Within the entertainment
55 industry factual information includes, but is not limited to, starting time,
56 location, host or venue and duration. Subjectual information includes, but is
57 not limited to, reviews, previews, photos, background information and trivia.
58
59 Hyperleap says to manage the largest database containing \textit{infotainment}
60 about the leisure industry focussed on the Netherlands and surrounding regions.
61 The database contains over $10.000$ categorized events on average per week and
62 their venue database contains over $54.000$ venues delivering the leisure
63 activities ranging from theaters and music venues to petting zoos and fast food
64 restaurants. All the subjectual information is obtained or created by Hyperleap
65 and all factual information is gathered from different sources, quality checked
66 and therefore very reliable. Hyperleap is the only company in its kind that has
67 such high quality information. The \textit{infotainment} is presented via
68 several websites specialized per genre or category and some sites attract over
69 $500.000$ visitors each month.
70
71 \section{Information flow}
72 Hyperleap can keep up the high quality data by investing a lot of time and
73 resources in quality checking, cross comparing and consistency checking. By
74 doing so the chances of incomplete or wrong data are much lower.
75 To achieve this, the data will go through several different stages before it
76 enters the database. These stages are visualized in
77 Figure~\ref{informationflow} as an information flow diagram. In this diagram
78 the nodes are processing steps and the arrows denote information transfer or
79 flow.
80
81 \begin{figure}[H]
82 \centering
83 \includegraphics[width=\linewidth]{informationflow.pdf}
84 \caption{Information flow Hyperleap database\label{informationflow}}
85 \end{figure}
86
87 \subsection{Sources}
88 A source is a service, location or medium in which information about events is
89 stored or published. A source can have different source shapes such as HTML,
90 email, flyer, RSS and so on. All information gathered from a source has to be
91 quality checked before it is even considered for automated crawling. There are
92 several criteria to which the source has to comply before an automated crawler
93 can be made. The prerequisites for a source are for example the fact that the
94 source has to be reliable, consistent and free by licence. Event information
95 from a source must have at least the \textit{What, Where} and \textit{When}
96 information.
97
98 The \textit{What} information is the information that describes the content,
99 content is a very broad definition but in practice it can be describing the
100 concert tour name, theater show title, movie title, festival title and many
101 more.
102
103 The \textit{Where} information is the location of the event. The location is
104 often omitted because the organization presenting the information thinks it is
105 obvious. This information can also include different sub locations. For example
106 when a pop concert venue has their own building but in the summer they organize
107 a festival in some park. This data is often assumed to be trivial and inherent
108 but in practice this is not the case. In this example for an outsider only the
109 name of the park is often not enough.
110
111 The \textit{When} field is the time and date of the event. Hyperleap wants to
112 have at minimum the date, start time and end time. In the field end times for
113 example are often omitted because they are not fixed or the organization think
114 it is obvious.
115 \strut\\
116
117 Venues often present incomplete data entries that do not comply to the
118 requirements explained before. Within the information flow categorizing and
119 grading the source is the first step. Hyperleap processes different sources and
120 source types and every source has different characteristics. Sources can be
121 modern sources like websites or social media but even today a lot of
122 information arrives at Hyperleap via flyers, fax or email. As source types vary
123 in content structure sources also vary in reliability. For example the
124 following entry is an example of a very well structured and probably generated,
125 and thus probably also reliable, event description. The entry can originate for
126 example from the title of an entry in a RSS feed. The example has a clear
127 structure and almost all information required is available directly from the
128 entry.
129
130 \begin{flushleft}
131 \texttt{2015-05-20, 18:00-23:00 - \textit{Foobar} presenting their %
132 new CD in combination with a show. Location: small salon.}
133 \end{flushleft}
134
135 An example of a low quality item could be for example the following text that
136 could originate from a flyer or social media post. This example lacks a
137 precise date, time and location and is therefore hard for people to grasp at
138 first, let alone for a machine. When someone wants to get the full information
139 he has to tap in different resources which might not always be available.
140
141 \begin{flushleft}
142 \texttt{\textit{Foobar} playing to celebrate their CD release in the %
143 park tomorrow evening.}
144 \end{flushleft}
145
146 Information with such a low quality is often not suitable for automated
147 crawling. In Figure~\ref{informationflow} this manual route is shown by the
148 arrow going straight from the source to the database. Non digital source types
149 or very sudden changes such as surprise concerts or cancellations are also
150 manually crawled.
151
152 \subsection{Crawler}
153 When the source has been determined and classified the next step is
154 periodically crawling the source using an automated crawler. As said before
155 sources have to be structured and reliable, when this is the case a programmer
156 will create a program that will visit the website systematically and
157 automatically to extract all the new information. The programmed crawlers are
158 usually specifically created for one or more sources and when the source
159 changes, the programmer has to adapt the crawler. Such a change is usually a
160 change in structure. Since writing and adapting requires a programmer the
161 process is costly. Automatically crawled information is not inserted into the
162 database directly because the information is not reliable enough. In case of a
163 change in the source malformed data can pass through. As a safety net and final
164 check the information first goes to the \textit{Temporum} before it will be
165 entered in the database.
166
167 \subsection{Temporum}
168 The \textit{Temporum} is a big bin that contains raw data extracted from
169 different sources using automated crawlers. Some of the information in the
170 \textit{Temporum} might not be suitable for the final database and therefore
171 has to be post processed. The post-processing encompasses several different
172 steps.
173
174 The first step is to check the validity of the event entries from a certain
175 source. Validity checking is useful to detect outdated automated crawlers
176 before the data can leak into the database. Crawlers become outdated when a
177 source changes and the crawler can not crawl the website using the original
178 method. Validity checking happens at random on certain event entries.
179
180 An event entry usually contains one occurrence of an event. In a lot of cases
181 there is parent information that the event entry is part of. For example in the
182 case of a concert tour the parent information is the concert tour and the event
183 entry is a certain performance. The second step in post processing is matching
184 the event entries to possible parent information. This parent information can
185 be a venue, a tour, a showing, a tournament and much more.
186
187 Both of the post processing tasks are done by people with the aid of automated
188 functionality. Within the two post processing steps malformed data can be
189 spotted very fast and the \textit{Temporum} thus acts as a safety net to keep
190 the probability of malformed data leaking into the database as low as possible.
191
192 \subsection{Database \& Publication}
193 Postprocessed data that leaves the \textit{Temporum} will enter the final
194 database. This database contains all the information about all the events that
195 happened in the past and the events that will happen in the future. The
196 database also contains the parent information such as information about venues.
197 Several categorical websites use the database to offer the information to users
198 and accompany it with the second part of \textit{infotainment} namely the
199 subjectual information. The \textit{entertainment} part will usually be
200 presented in the form of trivia, photos, interviews, reviews, previews and much
201 more.
202
203 \section{Goal \& Research question}
204 Maintaining the automated crawlers and the infrastructure that provides the
205 \textit{Temporum} and its matching aid automation are the parts within the
206 data flow that require the most amount of resources. Both of these parts require
207 a programmer to execute and therefore are costly. In the case of the automated
208 crawlers it requires a programmer because the crawlers are scripts or programs
209 are specifically designed for a particular website. Changing such a script or
210 program requires knowledge about the source, the programming framework and
211 about the \textit{Temporum}. In practice both of the tasks mean changing code.
212
213 A large group of sources often change in structure. Because of such changes
214 the task of reprogramming crawlers has to be repeated a lot. The detection of
215 malfunctioning crawlers happens in the \textit{Temporum} and not in an earlier
216 stage. Late detection elongates the feedback loop because there is not always a
217 tight communication between the programmers and the \textit{Temporum} workers.
218 In the case of a malfunction the source is first crawled. Most likely the
219 malformed data will get processed and will produce rubbish that is sent to the
220 \textit{Temporum}. Within the \textit{Temporum} after a while the error is
221 detected and the programmers have to be contacted. Finally the crawler will be
222 adapted to the new structure and will produce good data again. This feedback
223 loop, shown in Figure~\ref{feedbackloop}, can take days and can be the reason
224 for gaps and faulty information in the database. The figure shows information
225 flow with arrows. The solid and dotted lines form the current feedback loop.
226 \begin{figure}[H]
227 \centering
228 \includegraphics[width=0.8\linewidth]{feedbackloop.pdf}
229 \caption{Feedback loop for malfunctioning crawlers\label{feedbackloop}}
230 \end{figure}
231
232 The specific goal of this project is to relieve the programmer of spending a
233 lot of time repairing
234 crawlers and make the task of adapting, editing and removing
235 crawlers feasible for someone without programming experience. In practice this
236 means shortening the feedback loop. The shorter feedback loop is also shown in
237 Figure~\ref{feedbackloop}. The dashed line shows the shorter feedback loop that
238 relieves the programmer.
239
240 For this project a system has been developed that provides an
241 interface to a crawler generation system that is able to crawl RSS\cite{Rss}
242 and Atom\cite{Atom} publishing feeds. The interface provides the user with
243 point and click interfaces meaning that no computer science background is
244 needed to use the interface and to create, modify, test and remove crawlers.
245 The current Hyperleap backend system that handles the data can query XML feeds
246 that contain the crawled data.
247
248 The actual research question can then be formulated as:
249 \begin{center}
250 \textit{Is it possible to shorten the feedback loop for repairing and %
251 adding crawlers by making a system that can create, add and maintain crawlers %
252 for RSS feeds}
253 \end{center}
254
255 \section{RSS/Atom}
256 RSS/Atom feeds, from now on called RSS feeds, are publishing feeds. Such feeds
257 publish their data in a restricted XML format\cite{Xml} consisting of entries.
258 Every entry usually represents an event and consists of standardized data
259 fields. The data fields we are interested in are the \textit{title} and the
260 \textit{description} fields. Those fields store the raw data which describes
261 the event. Besides the fields we are interested in there are there are several
262 auxiliary fields that for example store the link to the full article, store the
263 publishing data, store some media in the form of an image or video URL or store
264 a \textit{Globally Unique Identifier}(GUID)\footnote{A GUID is a unique
265 identifier that in most cases is the permalink of the article. A permalink is a
266 link that will point to the article}. An example of a RSS feed can be found in
267 Listing~\ref{examplerss}, this listing shows a, partly truncated, RSS feed of a
268 well known venue in the Netherlands. Every RSS feed contains a \texttt{channel}
269 field and within that field there is some metadata and a list op \texttt{item}
270 fields. Every \texttt{item} field has a fixed number of different fields. The
271 most important fields for RSS within the leisure industry are the
272 \texttt{title} and the \texttt{description} field.
273
274 \lstinputlisting[language=xml,label={examplerss},caption={An example of a,%
275 partly truncated RSS feed of a well known dutch venue}]{exrss.xml}
276
277 RSS feeds are mostly used by news sites to publish their articles. A RSS feed
278 only contains the headlines of the entries. If the user, who reads the feed, is
279 interested it can click the so called deeplink and will be sent to the full
280 website containing the full entry or article. Users often use programs that
281 bundle user specified RSS feeds into big combined feeds that can be used to
282 sift through a lot of news feeds. The RSS feed reader uses the unique GUID to
283 skip feeds that are already present in its internal database.
284
285 Generating RSS feeds by hand is a tedious task but almost all RSS feeds are
286 generated by a Content Management Systems(CMS) on which the website runs. With
287 this automatic method the RSS feeds are generated for the content published
288 on the website. Because the process is automatic the RSS feeds are generally
289 very structured and consistent in its structure. In the entertainment industry
290 venues often use a CMS for their website to allow users with no programming or
291 website background be able to post news items and event information and thus
292 should often have RSS feeds.
293
294 \section{Why RSS?}
295 \label{sec:whyrss}
296 There are lots of different source formats like HTML, fax/email, RSS and XML.
297 Because of the limited scope of the project and the time planned for it we had
298 to remove some of the input formats because they all require different
299 techniques and approaches to tackle. For example when the input source is in
300 HTML format, most probably a website, then there can be a great deal of
301 information extraction be automated using the structural information which is a
302 characteristic for HTML. For fax/email however there is almost no structural
303 information and most of the automation techniques require natural language
304 processing and possibly OCR. We chose RSS feeds because RSS feeds lack inherent
305 structural information but are still very structured. This structure is
306 because, as said above, the RSS feeds are generated and therefore almost always
307 look the same. Also, in RSS feeds most venues use particular structural
308 identifiers that are characters. They separate fields with vertical bars,
309 commas, whitespace and more non text characters. These field separators and
310 keywords can be hard for a computer to detect but people are very good in
311 detecting these. With one look they can identify the characters and keywords
312 and build a pattern in their head. Another reason we chose RSS is their
313 temporal consistency, RSS feeds are almost always generated and because of that
314 the structure of the entries is very unlikely to change. Basically the RSS
315 feeds only change structure when the CMS that generates it changes the
316 generation algorithm. This property is useful because the crawlers then do not
317 have to be retrained very often. To detect the underlying structures a
318 technique is used that exploits subword matching with graphs.
319
320 \section{Directed Acyclic Graphs}
321 \paragraph{Directed graphs}
322 Directed graphs(DG) are mathematical structures that can describe relations
323 between nodes. A directed graph $G$ is be defined as the tuple $(V,E)$ where
324 $V$ is a set of named nodes and $E$ is the set of edges defined by $E\subseteq
325 V\times V$. An edge $e\in E$ is defined as $(v1, v2)$ where $v1,v2\in V$ and is
326 show in the figure as an arrow between node $v1$ and node $v2$. Multiple
327 connections between two nodes are possible if the directions differ. For
328 example the graph visualized in Figure~\ref{graphexample} can be mathematically
329 described as:
330 $$G=(\{n1, n2, n3, n4\}, \{(n1, n2), (n2, n1), (n2, n3), (n3, n4), (n1, n4)\}$$
331
332 \begin{figure}[H]
333 \centering
334 \includegraphics[scale=0.7]{graphexample.pdf}
335 \caption{Example DG\label{graphexample}}
336 \end{figure}
337
338 \paragraph{Directed acyclic graphs}
339 Directed Acyclic Graphs(DAGs) are a special kind of directed graphs. DAGs
340 are also defined as $G=(V,E)$ but with a restriction on $E$. Namely that cycles
341 are not allowed. Figure~\ref{dagexample} shows two graphs. The bottom graph
342 contains a cycle and the right graph does not. Only the top graph is a valid
343 DAG. A cycle can be defined as follows:
344
345 If $e\in E$ is defined as $(v_1,v_n)\in E$ or $v_1\rightarrow v_n$ then
346 $v_1\xrightarrow{+}v_n$ which means $v_1\rightarrow v_2 \rightarrow \ldots
347 \rightarrow v_{n-1} \rightarrow v_n$ meaning that there is a connection with a
348 length larger then $1$ between $v_1$ and $v_2$. In a non cyclic graph the
349 following holds $\nexists v\in V: v\xrightarrow{+}v$. In words this means that
350 a node can not be reached again traveling the graph. Adding the property of non
351 cyclicity to graphs lowers the computational complexity of path existence in
352 the graph to $\mathcal{O}(L)$ where $L$ is the length of the path.
353
354 \begin{figure}[H]
355 \centering
356 \includegraphics[scale=0.7]{dagexample.pdf}
357 \caption{Example DAG\label{dagexample}}
358 \end{figure}
359
360 \newpage
361 \paragraph{Directed Acyclic Word Graphs}
362 The type of graph used in the project is a special kind of DAG called Directed
363 Acyclic Word Graphs(DAWGs). A DAWG can be defined by the tuple $G=(V,v_0,E,F)$.
364 $V$ is the same as in directed graphs, namely the set of nodes. $E$ is the set
365 of edges described by $E\subseteq V\times V\times L$ where $L\in A$ where $A$
366 is the alphabet used as node labels. In words this means that an edge is a
367 labeled arrow between two nodes and for example the edge $(v_1, v_2, a)$ can be
368 visualized as: $v_1\xrightarrow{a}v_2$. In a standard DAWG the alphabet $A$
369 contains all the characters used in natural language but in theory the alphabet
370 $A$ can contain anything. $F$ describes the set of final nodes, final nodes are
371 nodes that can be the end of a sequence even if there is an arrow leading out.
372 In the example graph in Figure~\ref{exampledawg} the final nodes are visualized
373 with a double circle as node shape. In this example it is purely cosmetic
374 because $n6$ is a final node anyways because there are no arrows leading out.
375 But this does not need to be the case, for example in $G=(\{n1, n2, n3\},
376 \{(n1, n2), (n2, n3)\}, \{n2, n3\})$ there is a distinct use for the final node
377 marking. The only final node is the example is $n_6$, marked with a
378 double circle. $v_0$ describes the initial node, this is visualized in figures
379 as an incoming arrow. Because of the property of labeled edges, data can be
380 stored in a DAWG. When traversing a DAWG and saving all the edge labels one can
381 construct words. Using graph minimisation big sets of words can be stored using
382 a small amount of storage because edges can be re-used to specify transitions.
383 For example the graph in Figure~\ref{exampledawg} can describe the language $L$
384 where all words $w$ that are accepted $w\in\{abd, bad, bae\}$. Testing if a
385 word is present in the DAWG is the same technique as testing if a node path is
386 present in a normal DAG and therefore also falls in the computational
387 complexity class of $\mathcal{O}(L)$. This means that it grows linearly with
388 the length of the word.
389
390 \begin{figure}[H]
391 \centering
392 \includegraphics[scale=0.7]{dawgexample.pdf}
393 \caption{Example DAWG\label{exampledawg}}
394 \end{figure}
395
396 \section{Structure}
397 The following chapters will describe the system that has been created and the
398 used methods. Chapter~2 shows the requirements design for the program followed
399 in Chapter~3 by the underlying methods used for the actual matching. Finally
400 Chapter~4 concludes with the results, discussion and future research.