thesis2/4.discussion.tex

   1 \section{Conclusion}
   2 \begin{center}
   3         \textit{Is it possible to shorten the feedback loop for repairing and adding
   4                 crawlers by making a system that can create, add and maintain crawlers for
   5         RSS feeds}
   6 \end{center}
   7
   8 The short answer to the problem statement made in the introduction is yes. We
   9 can shorten the loop for repairing and adding crawlers which such a system. The
  10 system we have built is tested and can provide the necessary for a user with no
  11 particular programming skills to generate crawlers and thus the number of
  12 interventions where a programmer is needed is greatly reduced.
  13 Although we have solved the problem we stated the results are not purely
  14 positive. For a problem to be solved the problem must be present.
  15
  16 Although the research question is answered the underlying goal of the project
  17 is not achieved completely. The application is a intuitive system that allows
  18 users to manage RSS crawlers and for the specific domain, RSS feeds, any by
  19 doing that it does shorten the feedback loop but only within the specific
  20 domain. In the testing phase on real world data we stumbled on a small problem.
  21 Lack of RSS feeds and misuse of RSS feeds leads to a domain that is
  22 significantly smaller then first theorized.
  23
  24 Lack of RSS feeds is a problem because a lot of entertaintment venues have no
  25 RSS feeds available for the public. They are either using different techniques
  26 or they just do not use it at all. This shrinks the domain quite a lot. Taking
  27 pop music venues as an example. In a certain province of the Netherlands we can
  28 find about $25$ venues that have a website and only $3$ have a RSS feed.
  29 Extrapolating this information combined with information from other regions we
  30 can speculate that less then $10\%$ of the venues use RSS feeds.
  31
  32 The second problem is misuse of RSS feeds. RSS feeds are due to their
  33 limitations possible fields very structured. We found that a lot of venues
  34 using a RSS feed are not content with the limitations and try to bypass such
  35 limitation by misusing the protocol. A common misuse is to use publication date
  36 as the date of the actual event. When loadig such a RSS feed into a general RSS
  37 feed reader the outcome is very strange because a lot of events will have a
  38 publishing date in the future and therefore messing up the order of
  39 publication. The misplacement of key information leads to lack of key
  40 information in the expected fields and by that lower overall extraction
  41 performance.
  42
  43 The second most occuring common misuse is to use HTML formatted text in the
  44 text fields. The algorithm is designed to detect and extract information via
  45 patterns in plain text and the performance on HTML is very bad compared to
  46 plain text. A text field with HTML is almost useless to gather information
  47 from. Via a small study on available RSS feeds we found that about $50\%$ of
  48 the RSS feeds misuse the protocol in such a way that extraction of data is
  49 almost impossible. This reduces the domain of good RSS feeds to less then $5\%$
  50 of the venues.
  51
  52 \section{Discussion \& Future Research}
  53 \label{sec:discuss}
  54 % low level stuff
  55 The application we created does not apply any techniques on the isolated
  56 chunks. The application is built only to extract and not to process the labeled
  57 chunks of text. When this processing does get combined information is added to
  58 the data at least two things get better. A higher level of performance can be
  59 reached due to semantic knowledge as extra constraint while matching the data.
  60 Also quicker error detection in the crawlers is possible because when the match
  61 is correct at a higher level it can still contain wrong information at the
  62 lower chunk level, applying matching techniques on the chunks afterwards can
  63 generate feedback that could also be usefull for the top level of data
  64 extraction.
  65
  66 % combine RSS HTML
  67 Another use or improvement could be combining the forces of HTML and RSS. Some
  68 specifically structured HTML sources could be converted to a RSS feed and still
  69 get procces by the application. In this way, with an extra intermediate step,
  70 the extraction techniques can still be used. HTML sources most likely have to
  71 be generated because there has to be a very consistent structure in the data.
  72 Websites with such great structure are usually generated from a CMS. This will
  73 enlarge the domain for the application significantly since almost all websites
  74 use CMS to publish their data. When conversion between HTML and RSS feeds is
  75 not possible but one has a technique to extract patterns in a similar way then
  76 this application it is also possible to embed it in the current application.
  77 Due to the modularity of the application extending the application is very
  78 easy.