4 also corrected
[bsc-thesis1415.git] / thesis2 / 4.discussion.tex
1 \section{Conclusion}
2 \begin{center}
3 \textit{Is it possible to shorten the feedback loop for repairing and %
4 adding crawlers by making a system that can create, add and maintain crawlers %
5 for RSS feeds}
6 \end{center}
7
8 The short answer to the problem statement made in the introduction is yes. We
9 can shorten the loop for repairing and adding crawlers which our system. The
10 system we have built is tested and can provide the necessary tools for a user
11 with no particular programming skills to generate crawlers and thus the number
12 of interventions where a programmer is needed is greatly reduced. Although we
13 have solved the problem we stated the results are not strictly positive. This
14 is because a if the problem space is not large the interest of solving the
15 problem is also not large, this basically means that there is not much data to
16 apply the solution on.
17
18 Although the research question is answered the underlying goal of the project
19 has not been completely achieved. The application is an intuitive system that
20 allows users to manage crawlers and for the specific domain, RSS feeds. By
21 doing that it does shorten the feedback loop but only for RSS feeds. In the
22 testing phase on real world data we stumbled on a small problem. Lack of RSS
23 feeds and misuse of RSS feeds leads to a domain that is significantly smaller
24 then first theorized and therefore the application solves only a very small
25 portion.
26
27 Lack of RSS feeds is a problem because a lot of entertainment venues have no
28 RSS feeds available for the public. Venues either using different techniques to
29 publish their data or do not publish their data at all via a structured source
30 besides their website. This shrinks the domain quite a lot. Taking pop music
31 venues as an example. In a certain province of the Netherlands we can find
32 about $25$ venues that have a website and only $3$ have a RSS feed.
33 Extrapolating this information combined with information from other regions we
34 can safely say that less then $10\%$ of the venues even has a RSS feed.
35
36 The second problem is misuse of RSS feeds. RSS feeds are very structured due to
37 their limitations on possible fields. We found that a lot of venues that are
38 using a RSS feed seem not to be content with the limitations and try to bypass
39 such limitations by misusing the protocol. A common misuse is to use the
40 publication date field to put the date of the actual event in. When loading
41 such a RSS feed into a general RSS feed reader the outcome is very strange
42 because a lot of events will have a publishing date in the future and therefore
43 messing up the order in your program. The misplacement of key information leads
44 to lack of key information in the expected fields and by that lower overall
45 extraction performance.
46
47 The second most occurring common misuse is to use HTML formatted text in the RSS
48 feeds text fields. The algorithm is designed to detect and extract information
49 via patterns in plain text and the performance on HTML is very bad compared to
50 plain text. A text field with HTML is almost useless to gather information
51 from because they usually include all kinds of information in other modalities
52 then text. Via a small study on a selecteion of RSS feeds($N=10$) we found that
53 about $50\%$ of the RSS feeds misuse the protocol in such a way that extraction
54 of data is almost impossible. This reduces the domain of good RSS feeds to less
55 then $5\%$ of the venues.
56
57 \section{Discussion \& Future Research}
58 \label{sec:discuss}
59 % low level stuff
60 The application we created does not apply any techniques on the extracted
61 data fields. The application is built only to extract and not to process the
62 labeled data fields with text. When we would combine the information about the
63 global structure and information about structure in a marked area we increase
64 performance in two ways. A higher levels of performance are reached due to the
65 structural information of marked areas. Hereby extra knowledge as extra
66 constraint while matching the data in marked areas. The second increase in
67 performance of the application is because the error detection happens more
68 quickly. Faster error detection is possible because when the match is correct
69 at a global level it can still contain wrong information at the lower marked
70 field level. Applying matching techniques on the marked fields afterwards can
71 generate feedback that could also be useful for the global level of data
72 extraction.
73
74 % combine RSS HTML
75 Another use or improvement could be combining the forces of HTML and RSS. Some
76 specifically structured HTML sources could be converted into a tidy RSS feed
77 and still get processed by this application. In this way, with an extra
78 intermediate step, the extraction techniques can still be used. HTML sources
79 most likely have to be generated from a source by the venue because there has
80 to be a very consistent structure in the data. Websites with such great
81 structure are usually generated from a CMS. This will enlarge the domain for
82 the application significantly since almost all websites use CMS to publish
83 their data.
84
85 % Re-use user interface
86 The interface of the program could also be re-used. When conversion between
87 HTML and RSS feeds is not possible but one has a technique to extract patterns
88 in a similar way then this application it is also possible to embed it in the
89 current application. Due to the modularity of the application extending the
90 application with other matching techniques is very easy.