thesis2/4.discussion.tex

   1 \section{Conclusion}
   2 \begin{center}
   3         \textit{Is it possible to shorten the feedback loop for repairing and %
   4 adding crawlers by making a system that can create, add and maintain crawlers %
   5 for RSS feeds}
   6 \end{center}
   7
   8 The short answer to the problem statement made in the introduction is yes. We
   9 can shorten the loop for repairing and adding crawlers which our system. The
  10 system we have built is tested and can provide the necessary tools for a user
  11 with no particular programming skills to generate crawlers and thus the number
  12 of interventions where a programmer is needed is greatly reduced. Although we
  13 have solved the problem we stated the results are not strictly positive. This
  14 is because a if the problem space is not large the interest of solving the
  15 problem is also not large, this basically means that there is not much data to
  16 apply the solution on.
  17
  18 Although the research question is answered the underlying goal of the project
  19 has not been completely achieved. The application is an intuitive system that
  20 allows users to manage crawlers and for the specific domain, RSS feeds. By
  21 doing that it does shorten the feedback loop but only for RSS feeds. In the
  22 testing phase on real world data we stumbled on a small problem. Lack of RSS
  23 feeds and misuse of RSS feeds leads to a domain that is significantly smaller
  24 then first theorized and therefore the application solves only a very small
  25 portion.
  26
  27 Lack of RSS feeds is a problem because a lot of entertainment venues have no
  28 RSS feeds available for the public. Venues either using different techniques to
  29 publish their data or do not publish their data at all via a structured source
  30 besides their website. This shrinks the domain quite a lot. Taking pop music
  31 venues as an example. In a certain province of the Netherlands we can find
  32 about $25$ venues that have a website and only $3$ have a RSS feed.
  33 Extrapolating this information combined with information from other regions we
  34 can safely say that less then $10\%$ of the venues even has a RSS feed.
  35
  36 The second problem is misuse of RSS feeds. RSS feeds are very structured due to
  37 their limitations on possible fields. We found that a lot of venues that are
  38 using a RSS feed seem not to be content with the limitations and try to bypass
  39 such limitations by misusing the protocol. A common misuse is to use the
  40 publication date field to put the date of the actual event in. When loading
  41 such a RSS feed into a general RSS feed reader the outcome is very strange
  42 because a lot of events will have a publishing date in the future and therefore
  43 messing up the order in your program. The misplacement of key information leads
  44 to lack of key information in the expected fields and by that lower overall
  45 extraction performance.
  46
  47 The second most occurring common misuse is to use HTML formatted text in the RSS
  48 feeds text fields. The algorithm is designed to detect and extract information
  49 via patterns in plain text and the performance on HTML is very bad compared to
  50 plain text. A text field with HTML is almost useless to gather information
  51 from because they usually include all kinds of information in other modalities
  52 then text. Via a small study on a selecteion of RSS feeds($N=10$) we found that
  53 about $50\%$ of the RSS feeds misuse the protocol in such a way that extraction
  54 of data is almost impossible. This reduces the domain of good RSS feeds to less
  55 then $5\%$ of the venues.
  56
  57 \section{Discussion \& Future Research}
  58 \label{sec:discuss}
  59 % low level stuff
  60 The application we created does not apply any techniques on the extracted
  61 data fields. The application is built only to extract and not to process the
  62 labeled data fields with text. When we would combine the information about the
  63 global structure and information about structure in a marked area we increase
  64 performance in two ways. A higher levels of performance are reached due to the
  65 structural information of marked areas. Hereby extra knowledge as extra
  66 constraint while matching the data in marked areas. The second increase in
  67 performance of the application is because the error detection happens more
  68 quickly. Faster error detection is possible because when the match is correct
  69 at a global level it can still contain wrong information at the lower marked
  70 field level. Applying matching techniques on the marked fields afterwards can
  71 generate feedback that could also be useful for the global level of data
  72 extraction.
  73
  74 % combine RSS HTML
  75 Another use or improvement could be combining the forces of HTML and RSS. Some
  76 specifically structured HTML sources could be converted into a tidy RSS feed
  77 and still get processed by this application. In this way, with an extra
  78 intermediate step, the extraction techniques can still be used. HTML sources
  79 most likely have to be generated from a source by the venue because there has
  80 to be a very consistent structure in the data. Websites with such great
  81 structure are usually generated from a CMS. This will enlarge the domain for
  82 the application significantly since almost all websites use CMS to publish
  83 their data.
  84
  85 % Re-use user interface
  86 The interface of the program could also be re-used. When conversion between
  87 HTML and RSS feeds is not possible but one has a technique to extract patterns
  88 in a similar way then this application it is also possible to embed it in the
  89 current application. Due to the modularity of the application extending the
  90 application with other matching techniques is very easy.