upate
[bsc-thesis1415.git] / thesis2 / 4.discussion.tex
1 \section{Conclusion}
2 \begin{center}
3 \textit{Is it possible to shorten the feedback loop for repairing and adding
4 crawlers by making a system that can create, add and maintain crawlers for
5 RSS feeds}
6 \end{center}
7
8 The short answer to the problem statement made in the introduction is yes. We
9 can shorten the loop for repairing and adding crawlers which such a system. The
10 system we have built is tested and can provide the necessary for a user with no
11 particular programming skills to generate crawlers and thus the number of
12 interventions where a programmer is needed is greatly reduced.
13 Although we have solved the problem we stated the results are not purely
14 positive. For a problem to be solved the problem must be present.
15
16 Although the research question is answered the underlying goal of the project
17 is not achieved completely. The application is a intuitive system that allows
18 users to manage RSS crawlers and for the specific domain, RSS feeds, any by
19 doing that it does shorten the feedback loop but only within the specific
20 domain. In the testing phase on real world data we stumbled on a small problem.
21 Lack of RSS feeds and misuse of RSS feeds leads to a domain that is
22 significantly smaller then first theorized.
23
24 Lack of RSS feeds is a problem because a lot of entertaintment venues have no
25 RSS feeds available for the public. They are either using different techniques
26 or they just do not use it at all. This shrinks the domain quite a lot. Taking
27 pop music venues as an example. In a certain province of the Netherlands we can
28 find about $25$ venues that have a website and only $3$ have a RSS feed.
29 Extrapolating this information combined with information from other regions we
30 can speculate that less then $10\%$ of the venues use RSS feeds.
31
32 The second problem is misuse of RSS feeds. RSS feeds are due to their
33 limitations possible fields very structured. We found that a lot of venues
34 using a RSS feed are not content with the limitations and try to bypass such
35 limitation by misusing the protocol. A common misuse is to use publication date
36 as the date of the actual event. When loadig such a RSS feed into a general RSS
37 feed reader the outcome is very strange because a lot of events will have a
38 publishing date in the future and therefore messing up the order of
39 publication. The misplacement of key information leads to lack of key
40 information in the expected fields and by that lower overall extraction
41 performance.
42
43 The second most occuring common misuse is to use HTML formatted text in the
44 text fields. The algorithm is designed to detect and extract information via
45 patterns in plain text and the performance on HTML is very bad compared to
46 plain text. A text field with HTML is almost useless to gather information
47 from. Via a small study on available RSS feeds we found that about $50\%$ of
48 the RSS feeds misuse the protocol in such a way that extraction of data is
49 almost impossible. This reduces the domain of good RSS feeds to less then $5\%$
50 of the venues.
51
52 \section{Discussion \& Future Research}
53 \label{sec:discuss}
54 % low level stuff
55 The application we created does not apply any techniques on the isolated
56 chunks. The application is built only to extract and not to process the labeled
57 chunks of text. When this processing does get combined information is added to
58 the data at least two things get better. A higher level of performance can be
59 reached due to semantic knowledge as extra constraint while matching the data.
60 Also quicker error detection in the crawlers is possible because when the match
61 is correct at a higher level it can still contain wrong information at the
62 lower chunk level, applying matching techniques on the chunks afterwards can
63 generate feedback that could also be usefull for the top level of data
64 extraction.
65
66 % combine RSS HTML
67 Another use or improvement could be combining the forces of HTML and RSS. Some
68 specifically structured HTML sources could be converted to a RSS feed and still
69 get procces by the application. In this way, with an extra intermediate step,
70 the extraction techniques can still be used. HTML sources most likely have to
71 be generated because there has to be a very consistent structure in the data.
72 Websites with such great structure are usually generated from a CMS. This will
73 enlarge the domain for the application significantly since almost all websites
74 use CMS to publish their data. When conversion between HTML and RSS feeds is
75 not possible but one has a technique to extract patterns in a similar way then
76 this application it is also possible to embed it in the current application.
77 Due to the modularity of the application extending the application is very
78 easy.