86b616c7733fa4c798eeaca5d1f354c692371eaf
[bsc-thesis1415.git] / thesis2 / 4.discussion.tex
1 \section{Conclusion}
2 \begin{center}
3 \textit{Is it possible to shorten the feedback loop for repairing and %
4 adding crawlers by making a system that can create, add and maintain crawlers %
5 for RSS feeds}
6 \end{center}
7
8 The short answer to the problem statement made in the introduction is yes. We
9 can shorten the loop for repairing and adding crawlers which our system. The
10 system we have built is tested and can provide the necessary tools for a user
11 with no particular programming skills to generate crawlers and thus the number
12 of interventions where a programmer is needed is greatly reduced. Although we
13 have solved the problem we stated the results are not strictly positive. For a
14 problem to be solved the problem must be present.
15
16 Although the research question is answered the underlying goal of the project
17 has not been completely achieved. The application is an intuitive system that
18 allows users to manage crawlers and for the specific domain, RSS feeds. By
19 doing that it does shorten the feedback loop but only for RSS feeds. In the
20 testing phase on real world data we stumbled on a small problem. Lack of RSS
21 feeds and misuse of RSS feeds leads to a domain that is significantly smaller
22 then first theorized and therefore the application solves only a very small
23 portion.
24
25 Lack of RSS feeds is a problem because a lot of entertainment venues have no
26 RSS feeds available for the public. Venues either using different techniques to
27 publish their data or do not publish their data at all via a structured source
28 besides their website. This shrinks the domain quite a lot. Taking pop music
29 venues as an example. In a certain province of the Netherlands we can find
30 about $25$ venues that have a website and only $3$ have a RSS feed.
31 Extrapolating this information combined with information from other regions we
32 can safely say that less then $10\%$ of the venues even has a RSS feed.
33
34 The second problem is misuse of RSS feeds. RSS feeds are very structured due to
35 their limitations on possible fields. We found that a lot of venues that are
36 using a RSS feed seem not to be content with the limitations and try to bypass
37 such limitations by misusing the protocol. A common misuse is to use the
38 publication date field to put the date of the actual event in. When loading
39 such a RSS feed into a general RSS feed reader the outcome is very strange
40 because a lot of events will have a publishing date in the future and therefore
41 messing up the order in your program. The misplacement of key information leads
42 to lack of key information in the expected fields and by that lower overall
43 extraction performance.
44
45 The second most occurring common misuse is to use HTML formatted text in the RSS
46 feeds text fields. Our algorithm is designed to detect and extract information
47 via patterns in plain text and the performance on HTML is very bad compared to
48 plain text. A text field with HTML is almost useless to gather information
49 from. Via a small study on available RSS feeds we found that about $50\%$ of
50 the RSS feeds misuse the protocol in such a way that extraction of data is
51 almost impossible. This reduces the domain of good RSS feeds to less then $5\%$
52 of the venues.
53
54 \section{Discussion \& Future Research}
55 \label{sec:discuss}
56 % low level stuff
57 The application we created does not apply any techniques on the isolated
58 chunks. The application is built only to extract and not to process the labeled
59 chunks of text. When we would combine the information about the global
60 structure and information about structure in a marked area we increase
61 performance in two ways. A higher levels of performance are reached due to the
62 structural information of marked areas. Hereby extra knowledge as extra
63 constraint while matching the data in marked areas. The second increase in
64 performance of the application is because the error detection happens more
65 quickly. Faster error detection is possible because when the match is correct
66 at a global level it can still contain wrong information at the lower marked
67 field level. Applying matching techniques on the marked fields afterwards can
68 generate feedback that could also be useful for the global level of data
69 extraction.
70
71 % combine RSS HTML
72 Another use or improvement could be combining the forces of HTML and RSS. Some
73 specifically structured HTML sources could be converted into a tidy RSS feed
74 and still get processed by this application. In this way, with an extra
75 intermediate step, the extraction techniques can still be used. HTML sources
76 most likely have to be generated from a source by the venue because there has
77 to be a very consistent structure in the data. Websites with such great
78 structure are usually generated from a CMS. This will enlarge the domain for
79 the application significantly since almost all websites use CMS to publish
80 their data.
81
82 % Re-use user interface
83 The interface of the program could also be re-used. When conversion between
84 HTML and RSS feeds is not possible but one has a technique to extract patterns
85 in a similar way then this application it is also possible to embed it in the
86 current application. Due to the modularity of the application extending the
87 application with other matching techniques is very easy.