thesis2/4.discussion.tex

   1 \section{Conclusion}
   2 \begin{center}
   3         \textit{Is it possible to shorten the feedback loop for repairing and %
   4 adding crawlers by making a system that can create, add and maintain crawlers %
   5 for RSS feeds}
   6 \end{center}
   7
   8 The short answer to the problem statement made in the introduction is yes. We
   9 can shorten the loop for repairing and adding crawlers which our system. The
  10 system we have built is tested and can provide the necessary tools for a user
  11 with no particular programming skills to generate crawlers and thus the number
  12 of interventions where a programmer is needed is greatly reduced. Although we
  13 have solved the problem we stated the results are not strictly positive. For a
  14 problem to be solved the problem must be present.
  15
  16 Although the research question is answered the underlying goal of the project
  17 has not been completely achieved. The application is an intuitive system that
  18 allows users to manage crawlers and for the specific domain, RSS feeds. By
  19 doing that it does shorten the feedback loop but only for RSS feeds. In the
  20 testing phase on real world data we stumbled on a small problem. Lack of RSS
  21 feeds and misuse of RSS feeds leads to a domain that is significantly smaller
  22 then first theorized and therefore the application solves only a very small
  23 portion.
  24
  25 Lack of RSS feeds is a problem because a lot of entertainment venues have no
  26 RSS feeds available for the public. Venues either using different techniques to
  27 publish their data or do not publish their data at all via a structured source
  28 besides their website. This shrinks the domain quite a lot. Taking pop music
  29 venues as an example. In a certain province of the Netherlands we can find
  30 about $25$ venues that have a website and only $3$ have a RSS feed.
  31 Extrapolating this information combined with information from other regions we
  32 can safely say that less then $10\%$ of the venues even has a RSS feed.
  33
  34 The second problem is misuse of RSS feeds. RSS feeds are very structured due to
  35 their limitations on possible fields. We found that a lot of venues that are
  36 using a RSS feed seem not to be content with the limitations and try to bypass
  37 such limitations by misusing the protocol. A common misuse is to use the
  38 publication date field to put the date of the actual event in. When loading
  39 such a RSS feed into a general RSS feed reader the outcome is very strange
  40 because a lot of events will have a publishing date in the future and therefore
  41 messing up the order in your program. The misplacement of key information leads
  42 to lack of key information in the expected fields and by that lower overall
  43 extraction performance.
  44
  45 The second most occurring common misuse is to use HTML formatted text in the RSS
  46 feeds text fields. Our algorithm is designed to detect and extract information
  47 via patterns in plain text and the performance on HTML is very bad compared to
  48 plain text. A text field with HTML is almost useless to gather information
  49 from. Via a small study on available RSS feeds we found that about $50\%$ of
  50 the RSS feeds misuse the protocol in such a way that extraction of data is
  51 almost impossible. This reduces the domain of good RSS feeds to less then $5\%$
  52 of the venues.
  53
  54 \section{Discussion \& Future Research}
  55 \label{sec:discuss}
  56 % low level stuff
  57 The application we created does not apply any techniques on the isolated
  58 chunks. The application is built only to extract and not to process the labeled
  59 chunks of text. When we would combine the information about the global
  60 structure and information about structure in a marked area we increase
  61 performance in two ways. A higher levels of performance are reached due to the
  62 structural information of marked areas. Hereby extra knowledge as extra
  63 constraint while matching the data in marked areas. The second increase in
  64 performance of the application is because the error detection happens more
  65 quickly. Faster error detection is possible because when the match is correct
  66 at a global level it can still contain wrong information at the lower marked
  67 field level. Applying matching techniques on the marked fields afterwards can
  68 generate feedback that could also be useful for the global level of data
  69 extraction.
  70
  71 % combine RSS HTML
  72 Another use or improvement could be combining the forces of HTML and RSS. Some
  73 specifically structured HTML sources could be converted into a tidy RSS feed
  74 and still get processed by this application. In this way, with an extra
  75 intermediate step, the extraction techniques can still be used. HTML sources
  76 most likely have to be generated from a source by the venue because there has
  77 to be a very consistent structure in the data. Websites with such great
  78 structure are usually generated from a CMS. This will enlarge the domain for
  79 the application significantly since almost all websites use CMS to publish
  80 their data.
  81
  82 % Re-use user interface
  83 The interface of the program could also be re-used. When conversion between
  84 HTML and RSS feeds is not possible but one has a technique to extract patterns
  85 in a similar way then this application it is also possible to embed it in the
  86 current application. Due to the modularity of the application extending the
  87 application with other matching techniques is very easy.