895de3547f2c86818d209c082b8d6d81655d906d
[bsc-thesis1415.git] / thesis2 / 1.introduction.tex
1 \section{Introduction}
2 What do people do when they want to grab a movie? Attend a concert? Find out
3 which theater shows play in local theater?
4
5 In the early days of the internet, access to the web was not available for most
6 of the people. Information about leisure activities was almost exclusively
7 obtained from flyers, posters and other written media and radio/TV
8 advertisements. People had to put effort in searching for information and it
9 was easy to miss a show just because you did not cross paths with it. Today the
10 internet is used on a daily basis by almost everyone in the western society and
11 one would think that missing an event would be impossible because of the
12 enormous loads of information you can receive every day using the internet.
13 For leisure activities the opposite is true, complete and reliable information
14 about events is still hard to find.
15
16 Nowadays information on the internet about entertainment is offered via two
17 main channels: individual venues websites and information bundling websites.
18
19 Individual venues put a lot of effort and resources in building a beautiful,
20 fast and most of all modern website that bundles their information with nice
21 graphics, animations and other gimmicks. Information bundling websites are run
22 by companies that try to provide an overview of multiple venues. Information
23 bundling websites often have individual venue websites as their source for
24 information. Individual venues assume, for example, that it is obvious what the
25 address of their venue is, that their ticket price is always fixed to
26 \EURdig$5.-$ and that you need a membership to attend the events. Individual
27 organizations usually put this non specific information in a disclaimer or a
28 separate page. Because of the less structured way of providing information the
29 information bundling websites have a hard time finding complete information.
30 The event data can be crawled using automated crawlers but the miscellaneous
31 information usually has to be crawled by hand.
32
33 Combining the information from the different data source turns out to be a
34 complicated task for information bundling websites. The task is difficult
35 because the companies behind these information bundling websites do not have
36 the resources and time reserved for these tasks and therefore often serve
37 incomplete information. Because of the complexity of getting complete
38 information there are not many companies trying to bundle entertainment
39 information into a complete and consistent database and website.
40 Hyperleap\footnote{\url{http://hyperleap.nl}} tries to achieve goal of serving
41 complete and consistent information and offers it via various information
42 bundling websites.
43 \newpage
44 \section{Hyperleap \& Infotainment}
45 Hyperleap is an internet company that was founded in the time that internet was
46 not widespread. Hyperleap, active since 1995, is specialized in producing,
47 publishing and maintaining \textit{infotainment}. \textit{Infotainment} is a
48 combination of the words \textit{information} and \textit{entertainment}. It
49 represents a combination of factual information, the \textit{information} part,
50 and subjectual information, the \textit{entertainment} part, within a certain
51 category or field. In the case of Hyperleap the category is the leisure
52 industry, leisure industry encompasses all facets of entertainment ranging from
53 cinemas, theaters, concerts to swimming pools, bridge competitions and
54 conferences. Within the entertainment industry factual information includes,
55 but is not limited to, starting time, location, host or venue and duration.
56 Subjectual information includes, but is not limited to, reviews, previews,
57 photos, background information and trivia.
58
59 Hyperleap says to manage the largest database containing \textit{infotainment}
60 about the leisure industry focussed on the Netherlands and surrounding regions.
61 The database contains over $10.000$ categorized events on average per week and
62 their venue database contains over $54.000$ venues delivering the leisure
63 activities ranging from theaters and music venues to petting zoos and fast food
64 restaurants. All the subjectual information is obtained or created by Hyperleap
65 and all factual information is gathered from different sources, quality checked
66 and therefore very reliable. Hyperleap is the only company in its kind that has
67 such high quality information. The \textit{infotainment} is presented via
68 several websites specialized per genre or category and some sites attract over
69 $500.000$ visitors each month.
70
71 \section{Information flow}
72 Hyperleap can keep up the high quality data by investing a lot of time and
73 resources in quality checking, cross comparing and consistency checking. By
74 doing so the chances of incomplete or wrong data are much lower.
75 To achieve this, the data will go through several different stages before it
76 enters the database. These stages are visualized in
77 Figure~\ref{informationflow} as an information flow diagram. In this diagram
78 the nodes are processing steps and the arrows denote information transfer or
79 flow.
80
81 \begin{figure}[H]
82 \label{informationflow}
83 \centering
84 \includegraphics[width=\linewidth]{informationflow.pdf}
85 \caption{Information flow Hyperleap database}
86 \end{figure}
87
88 \subsection{Sources}
89 A source is a service, location or medium in which information about events is
90 stored or published. A source can have different source shapes such as HTML,
91 email, flyer, RSS and so on. All information gathered from a source has to be
92 quality checked before it is even considered for automated crawling. There are
93 several criteria information from the source has to comply to before an
94 automated crawler can be made. The prerequisites for a source are for example
95 the fact that the source has to be reliable, consistent and free by licence.
96 Event information from a source must have at least the \textit{What, Where} and
97 \textit{When} information.
98
99 The \textit{What} information is the information that describes the content,
100 content is a very broad definition but in practice it can be describing the
101 concert tour name, theater show title, movie title, festival title and many
102 more.
103
104 The \textit{Where} information is the location of the event. The location is
105 often omitted because the organization behind source presenting the information
106 thinks it is obvious. This information can also include different sub
107 locations. For example when a pop concert venue has their own building but in
108 the summer they organize a festival in some park. This data is often assumed to
109 be trivial and inherent but in practice this is not the case. In this example
110 for an outsider only the name of the park is often not enough.
111
112 The \textit{When} field is the time and date of the event. Hyperleap wants to
113 have at minimum the date, start time and end time. In the field end times for
114 example are often omitted because they are not fixed or the organization think
115 it is obvious.
116 \strut\\
117
118 Venues often present incomplete data entries that do not comply to the
119 requirements explained before. Within the information flow categorizing and
120 grading the source is the first step. Hyperleap processes different sources and
121 source types and every source has different characteristics. Sources can be
122 modern sources like websites or social media but even today a lot of
123 information arrives at Hyperleap via flyers, fax or email. As source types vary
124 in content structure sources also vary in reliability. For example the
125 following entry is an example of a very well structured and probably generated,
126 and thus probably also reliable, event description. The entry can originate for
127 example from the title of an entry in a RSS feed. The example has a clear
128 structure and almost all information required is available directly from the
129 entry.
130
131 \begin{flushleft}
132 \texttt{2015-05-20, 18:00-23:00 - \textit{Foobar} presenting their %
133 new CD in combination with a show. Location: small salon.}
134 \end{flushleft}
135
136 An example of a low quality item could be for example the following text that
137 could originate from a flyer or social media post. This example lacks a
138 precise date, time and location and is therefore hard for people to grasp at
139 first, let alone for a machine. When someone wants to get the full information
140 he has to tap in different resources which might not always be available.
141
142 \begin{flushleft}
143 \texttt{\textit{Foobar} playing to celebrate their CD release in the %
144 park tomorrow evening.}
145 \end{flushleft}
146
147 Information with such a low quality is often not suitable for automated
148 crawling. In Figure~\ref{informationflow} this manual route is shown by the
149 arrow going straight from the source to the database. Non digital source types
150 or very sudden changes such as surprise concerts or cancellations are also
151 manually crawled.
152
153 \subsection{Crawler}
154 When the source has been determined and classified the next step is
155 periodically crawling the source using an automated crawler. As said before
156 sources have to be structured and reliable, when this is the case a programmer
157 will create a program that will visit the website systematically and
158 automatically to extract all the new information. The programmed crawlers are
159 usually specifically created for one or more sources and when the source
160 changes, the programmer has to adapt the crawler. Such a change is usually a
161 change in structure. Since writing and adapting requires a programmer the
162 process is costly. Automatically crawled information is not inserted into the
163 database directly because the information is not reliable enough. In case of a
164 change in the source malformed data can pass through. As a safety net and final
165 check the information first goes to the \textit{Temporum} before it will be
166 entered in the database.
167
168 \subsection{Temporum}
169 The \textit{Temporum} is a big bin that contains raw data extracted from
170 different sources using automated crawlers. All the information in the
171 \textit{Temporum} might not be suitable for the final database and therefore
172 has to be post processed. The post-processing encompasses several different
173 steps.
174
175 The first step is to check the validity of the event entries from a certain
176 source. Validity checking is useful to detect faulty automated crawlers
177 before the data can leak into the database. Validity checking happens at random
178 on certain event entries.
179
180 An event entry usually contains one occurrence of an event. In a lot of cases
181 there is parent information that the event entry is part of. For example in the
182 case of a concert tour the parent information is the concert tour and the event
183 entry is a certain performance. The second step in post processing is matching
184 the event entries to possible parent information. This parent information can
185 be a venue, a tour, a showing, a tournament and much more.
186
187 Both of the post processing tasks are done by people with the aid of automated
188 functionality. Within the two post processing steps malformed data can be
189 spotted very fast and the \textit{Temporum} thus acts as a safety net to keep
190 the probability of malformed data leaking into the database as low as possible.
191
192 \subsection{Database \& Publication}
193 Postprocessed data that leaves the \textit{Temporum} will enter the final
194 database. This database contains all the information about all the events that
195 happened in the past and the events that will happen in the future. The
196 database also contains the parent information such as information about venues.
197 Several categorical websites use the database to offer the information to users
198 and accompany it with the second part of \textit{infotainment} namely the
199 subjectual information. The \textit{entertainment} part will usually be
200 presented in the form of trivia, photos, interviews, reviews, previews and much
201 more.
202
203 \section{Goal \& Research question}
204 Maintaining the automated crawlers and the infrastructure that provides the
205 \textit{Temporum} and its matching aid automization are the parts within the
206 dataflow that require the most amount of resources. Both of these parts require
207 a programmer to execute and therefore are costly. In the case of the automated
208 crawlers it requires a programmer because the crawlers are scripts or programs
209 created are website-specific. Changing such a script or program requires
210 knowledge about the source, the programming framework and about the
211 \textit{Temporum}. In practice both of the tasks mean changing code.
212
213 A large group of sources often changes in structure. Because of such changes
214 the task of reprogramming crawlers has to be repeated a lot. The detection of
215 malfunctioning crawlers happens in the \textit{Temporum} and not in an earlier
216 stage. Late detection elongates the feedback loop because there is not always a
217 tight communication between the programmers and the \textit{Temporum} workers.
218 In the case of a malfunction the source is first crawled. Most likely the
219 malformed data will get processed and will produce rubbish that is sent to the
220 \textit{Temporum}. Within the \textit{Temporum} after a while the error is
221 detected and the programmers have to be contacted. Finally the crawler will be
222 adapted to the new structure and will produce good data again. This feedback
223 loop, shown in Figure~\ref{feedbackloop}, can take days and can be the reason
224 for gaps and faulty information in the database. The figure shows information
225 flow with arrows. The solid and dotted lines form the current feedbackloop.
226 \begin{figure}[H]
227 \label{feedbackloop}
228 \centering
229 \includegraphics[width=0.8\linewidth]{feedbackloop.pdf}
230 \caption{Feedback loop for malfunctioning crawlers}
231 \end{figure}
232
233 The specific goal of this project is to relieve the programmer of spending a
234 lot of time repairing
235 crawlers and make the task of adapting, editing and removing
236 crawlers feasible for someone without programming experience. In practice this
237 means shortening the feedbackloop. The shorter feedback loop is also shown in
238 Figure~\ref{feedbackloop}. The dashed line shows the shorter feedbackloop that
239 relieves the programmer.
240
241 For this project a system has been developed that provides an
242 interface to a crawler generation system that is able to crawl RSS\cite{Rss}
243 and Atom\cite{Atom} publishing feeds. The interface provides the user with
244 point and click interfaces meaning that no computer science background is
245 needed to use the interface and to create, modify, test and remove crawlers.
246 The current Hyperleap backend system that handles the data can query XML feeds
247 that contain the crawled data.
248
249 The actual research question can then be formulated as:
250 \begin{center}
251 \textit{Is it possible to shorten the feedback loop for repairing and %
252 adding crawlers by making a system that can create, add and maintain crawlers %
253 for RSS feeds}
254 \end{center}
255
256 \section{RSS/Atom}
257 RSS/Atom feeds, from now on called RSS feeds, are publishing feeds. Such feeds
258 publish their data in a restricted XML format\cite{Xml} consisting of entries.
259 Every entry usually represents an event and consists of standardized data
260 fields. The data fields we are interested in are the \textit{title} and the
261 \textit{description} fields. Those fields store the raw data which describes
262 the event. Besides the fields we are interested in there are there are several
263 auxiliary fields that for example store the link to the full article, store the
264 publishing data, store some media in the form of an image or video URL or store
265 a \textit{Globally Unique Identifier}(GUID)\footnote{A GUID is a unique
266 identifier that in most cases is the permalink of the article. A permalink is a
267 link that will point to the article}. An example of a RSS feed can be found in
268 Listing~\ref{examplerss}, this listing shows a, partly truncated, RSS feed of a
269 well known venue in the Netherlands. Every RSS feed contains a \texttt{channel}
270 field and within that field there is some metadata and a list op \texttt{item}
271 fields. Every \texttt{item} field has a fixed number of different fields. The
272 most important fields for RSS within the leisure industry are the
273 \texttt{title} and the \texttt{description} field.
274
275 \lstinputlisting[language=xml,label={examplerss},caption={An example of a,%
276 partly truncated RSS feed of a well known dutch venue}]{exrss.xml}
277
278 RSS feeds are mostly used by news sites to publish their articles. A RSS feed
279 only contains the headlines of the entries. If the user, who reads the feed, is
280 interested it can click the so called deeplink and will be sent to the full
281 website containing the full entry or article. Users often use programs that
282 bundle user specified RSS feeds into big combined feeds that can be used to
283 sift through a lot of news feeds. The RSS feed reader uses the unique GUID to
284 skip feeds that are already present in its internal database.
285
286 Generating RSS feeds by hand is a tedious task but almost all RSS feeds are
287 generated by a Content Management Systems(CMS) on which the website runs. With
288 this automatic method the RSS feeds are generated for the content published
289 on the website. Because the process is automatic the RSS feeds are generally
290 very structured and consistent in its structure. In the entertainment industry
291 venues often use a CMS for their website to allow users with no programming or
292 website background be able to post news items and event information and thus
293 should often have RSS feeds.
294
295 \section{Why RSS?}
296 \label{sec:whyrss}
297 There are lots of different source formats like HTML, fax/email, RSS and XML.
298 Because of the limited scope of the project and the time planned for it we had
299 to remove some of the input formats because they all require different
300 techniques and approaches to tackle. For example when the input source is in
301 HTML format, most probably a website, then there can be a great deal of
302 information extraction be automated using the structural information which is a
303 characteristic for HTML. For fax/email however there is almost no structural
304 information and most of the automation techniques require natural language
305 processing. We chose RSS feeds because RSS feeds lack inherent structural
306 information but are still very structured. This structure is because, as said
307 above, the RSS feeds are generated and therefore almost always look the same.
308 Also, in RSS feeds most venues use particular structural identifiers that are
309 characters. They separate fields with vertical bars, commas, whitespace and
310 more non text characters. These field separators and keywords can be hard for a
311 computer to detect but people are very good in detecting these. With one look
312 they can identify the characters and keywords and build a pattern in their
313 head. Another reason we chose RSS is their temporal consistency, RSS feeds are
314 almost always generated and because of that the structure of the entries is
315 very unlikely to change. Basically the RSS feeds only change structure when the
316 CMS that generates it changes the generation algorithm. This property is useful
317 because the crawlers then do not have to be retrained very often. To detect
318 the underlying structures a technique is used that exploits subword matching
319 with graphs.
320
321 \section{Directed Acyclic Graphs}
322 \paragraph{Directed graphs}
323 Directed graphs(DG) are mathematical structures that can describe relations
324 between nodes. A directed graph $G$ is be defined as the tuple $(V,E)$ where
325 $V$ is a set of named nodes and $E$ is the set of edges defined by $E\subseteq
326 V\times V$. An edge $e\in E$ is defined as $(v1, v2)$ where $v1,v2\in V$ and is
327 show in the figure as an arrow between node $v1$ and node $v2$. Multiple
328 connections between two nodes are possible if the directions differ. For
329 example the graph visualized in Figure~\ref{graphexample} can be mathematically
330 described as:
331 $$G=(\{n1, n2, n3, n4\}, \{(n1, n2), (n2, n1), (n2, n3), (n3, n4), (n1, n4)\}$$
332
333 \begin{figure}[H]
334 \label{graphexample}
335 \centering
336 \includegraphics[scale=0.7]{graphexample.pdf}
337 \caption{Example DG}
338 \end{figure}
339
340 \paragraph{Directed acyclic graphs}
341 Directed Acyclic Graphs(DAGs) are a special kind of directed graphs. DAGs
342 are also defined as $G=(V,E)$ but with a restriction on $E$. Namely that cycles
343 are not allowed. Figure~\ref{dagexample} shows two graphs. The bottom graph
344 contains a cycle and the right graph does not. Only the top graph is a valid
345 DAG. A cycle can be defined as follows:
346
347 If $e\in E$ is defined as $(v_1,v_n)\in E$ or $v_1\rightarrow v_n$ then
348 $v_1\xrightarrow{+}v_n$ which means $v_1\rightarrow v_2 \rightarrow \ldots
349 \rightarrow v_{n-1} \rightarrow v_n$ meaning that there is a connection with a
350 length larger then $1$ between $v_1$ and $v_2$. In a non cyclic graph the
351 following holds $\nexists v\in V: v\xrightarrow{+}v$. In words this means that
352 a node can not be reached again traveling the graph. Adding the property of non
353 cyclicity to graphs lowers the computational complexity of path existence in
354 the graph to $\mathcal{O}(L)$ where $L$ is the length of the path.
355
356 \begin{figure}[H]
357 \label{dagexample}
358 \centering
359 \includegraphics[scale=0.7]{dagexample.pdf}
360 \caption{Example DAG}
361 \end{figure}
362
363 \paragraph{Directed Acyclic Word Graphs}
364 The type of graph used in the project is a special kind of DAG called Directed
365 Acyclic Word Graphs(DAWGs). A DAWG can be defined by the tuple $G=(V,v_0,E,F)$.
366 $V$ is the same as in directed graphs, namely the set of nodes. $E$ is the set
367 of edges described by $E\subseteq V\times V\times L$ where $L\in A$ where $A$
368 is the alphabet used as node labels. In words this means that an edge is a
369 labeled arrow between two nodes and for example the edge $(v_1, v_2, a)$ can be
370 visualized as: $v_1\xrightarrow{a}v_2$. In a standard DAWG the alphabet $A$
371 contains all the characters used in natural language but in theory the alphabet
372 $A$ can contain anything. $F$ describes the set of final nodes, final nodes are
373 nodes that can be the end of a sequence even if there is an arrow leading out.
374 In the example graph in Figure~\ref{exampledawg} the final nodes are visualized
375 with a double circle as node shape. In this example it is purely cosmetic
376 because $n6$ is a final node anyways because there are no arrows leading out.
377 But this does not need to be the case, for example in $G=(\{n1, n2, n3\},
378 \{(n1, n2), (n2, n3)\}, \{n2, n3\})$ there is a distinct use for the final node
379 marking. Graph $G$ accepts the words \textit{a,ab} and to simplify the graph
380 node $n2$ and $n3$ are final. Finally $v_0$ describes the initial node, this is
381 visualized in figures as an incoming arrow. Because of the property of labeled
382 edges, data can be stored in a DAWG. When traversing a DAWG and saving all the
383 edgelabels one can construct words. Using graph minimalization big sets of
384 words can be stored using a small amouth of storage because edges can be
385 re-used to specify transitions. For example the graph in
386 Figure~\ref{exampledawg} can describe the language $L$ where all words $w$ that
387 are accepted $w\in\{abd, bad, bae\}$. Testing if a word is present in the DAWG
388 is the same technique as testing if a node path is present in a normal DAG and
389 therefore also falls in the computational complexity class of $\mathcal{O}(L)$.
390 This means that it grows linearly with the length of the word.
391
392 \begin{figure}[H]
393 \label{exampledawg}
394 \centering
395 \includegraphics[scale=0.7]{dawgexample.pdf}
396 \caption{Example DAWG}
397 \end{figure}