character and we mark the last node as final. From then on all words are added
using a four step approach described below. Pseudocode for this algorithm can
be found in Listing~\ref{pseudodawg} named as the function
-\texttt{generate\_dawg(words)}
+\texttt{generate\_dawg(words)}. A \textit{Python} implementation can be found
+in Listing~\ref{dawg.py}.
\begin{enumerate}
\item
Say we add word $w$ to the grahp. Step one is finding the
\texttt{abcd}, \texttt{aecd} and \texttt{aecf}.
\begin{itemize}
- \item Initial\\
+ \item No words added yet\\
Initially we begin with the null graph. This graph is show in
the figure as SG0. This DAWG does not yet accept any words.
- \item \texttt{abcd}\\
+ \item Adding \texttt{abcd}\\
Adding the first entry \texttt{abcd} is trivial because we can
just create a single path which does not require any hard work.
This is because the common prefix we find in Step 1 is empty
back into the graph is also not possible since there are no
nodes except for the first node. The result of adding the
first word is visible in subgraph SG1.
- \item \texttt{aecd}\\
+ \item Adding \texttt{aecd}\\
For the second entry we will have to do some
extra work. The common prefix found in Step 1 is \texttt{a}
which we add to the graph. This leaves us in Step 2 with the
a common suffix \texttt{cd} and we can merge these nodes. In
this way we can reuse the transition from $q_3$ to $q_4$. This
leaves us with subgraph SG2.
- \item \texttt{aecf}\\
+ \item Adding \texttt{aecf}\\
We now add the last entry which is the word \texttt{aecf}. When
we do this without the confluence node checking we encounter an
unwanted extra word. In Step 1 we find the common prefix
\begin{figure}[H]
\label{dawg1}
\centering
- \includegraphics[width=0.9\linewidth]{inccons.eps}
+ \includegraphics[width=0.8\linewidth]{inccons.eps}
\strut\\\strut\\
\caption{Incrementally constructing a DAWG}
\end{figure}
\subsection{Appliance on extraction of patterns}
The text data in combination with the user markings can not be converted
automatically to a DAWG using the algorithm we described. This is because the
-user markings are not necessarily a single character or word. User markings are
-basically one or more characters. When we add a user marking we insert a
-character that is not in the alphabet and later on we change the marking to a
-kind of subgraph. When this is applied it can be possible that non determinism
-is added to the graph. Non determinism is the fact that a single node has
-multiple edges with the same transition, in practise this means that a word can
-be present in the graph in multiple paths. This is shown in
-Figure~\ref{nddawg} with the following words: \texttt{ab<1>c}, \texttt{a<1>bbc}.
+user markings are not necessarily a single character or word. Currently user
+markings are basically multiple random characters. When we add a user marking,
+we are inserting a kind of subgraph in the place of the node with the marking.
+By doing this we can introduce non determinism to the graph. Non determinism is
+the fact that a single node has multiple edges with the same transition, in
+practise this means it could happen that a word can be present in the graph in
+multiple paths. An example of non determinism in one of our DAWGs is shown in
+Figure~\ref{nddawg}. This figure represents a generated DAWG with the following
+entries: \texttt{ab<1>c}, \texttt{a<1>bbc}.
+
In this graph the word \texttt{abdc} will be accepted and the user pattern
-\texttt{<1>} will be filled with the word \texttt{d}. However if we try the
+\texttt{<1>} will be filled with the subword \texttt{d}. However if we try the
word \texttt{abdddbc} both paths can be chosen. In the first case the user
pattern \texttt{<1>} will be filled with \texttt{dddb} and in the second case
with \texttt{bddd}. In such a case we need to choose the hopefully smartest
\subsection{Minimality and non-determinism}
The Myhill-Nerode theorem~\cite{Hopcroft1979} states that for every number of
graphs accepting the same language there is a single graph with the least
-amount of states. Mihov\cite{Mihov1998} has proven that the algorithm is
-minimal in its original form. Our program converts the node-lists to DAWGs that
-can possibly contain non deterministic transitions from nodes and therefore one
-can argue about Myhill-Nerodes theorem and Mihovs proof holding.. Due to the
-nature of the determinism this is not the case and both hold. In reality the
-graph itself is only non-deterministic when expanding the categories and thus
-only during matching.
+amount of states. Mihov\cite{Mihov1998} has proven that the algorithm for
+generating DAWGs is minimal in its original form. Our program converts the
+node-lists to DAWGs that can possibly contain non deterministic transitions
+from nodes and therefore one can argue about Myhill-Nerodes theorem and Mihovs
+proof holding. Due to the nature of the determinism this is not the case and
+both hold. In reality the graph itself is only non-deterministic when expanding
+the categories and thus only during matching.
Choosing the smartest path during matching the program has to choose
deterministically between possibly multiple path with possibly multiple
\section{Conclusion}
\begin{center}
- \textit{Is it possible to shorten the feedback loop for repairing and adding
- crawlers by making a system that can create, add and maintain crawlers for
- RSS feeds}
+ \textit{Is it possible to shorten the feedback loop for repairing and %
+adding crawlers by making a system that can create, add and maintain crawlers %
+for RSS feeds}
\end{center}
The short answer to the problem statement made in the introduction is yes. We
-can shorten the loop for repairing and adding crawlers which such a system. The
-system we have built is tested and can provide the necessary for a user with no
-particular programming skills to generate crawlers and thus the number of
-interventions where a programmer is needed is greatly reduced.
-Although we have solved the problem we stated the results are not purely
-positive. For a problem to be solved the problem must be present.
+can shorten the loop for repairing and adding crawlers which our system. The
+system we have built is tested and can provide the necessary tools for a user
+with no particular programming skills to generate crawlers and thus the number
+of interventions where a programmer is needed is greatly reduced. Although we
+have solved the problem we stated the results are not strictly positive. For a
+problem to be solved the problem must be present.
Although the research question is answered the underlying goal of the project
-is not achieved completely. The application is a intuitive system that allows
-users to manage RSS crawlers and for the specific domain, RSS feeds, any by
-doing that it does shorten the feedback loop but only within the specific
-domain. In the testing phase on real world data we stumbled on a small problem.
-Lack of RSS feeds and misuse of RSS feeds leads to a domain that is
-significantly smaller then first theorized.
+has not been completely achieved. The application is an intuitive system that
+allows users to manage crawlers and for the specific domain, RSS feeds. By
+doing that it does shorten the feedback loop but only for RSS feeds. In the
+testing phase on real world data we stumbled on a small problem. Lack of RSS
+feeds and misuse of RSS feeds leads to a domain that is significantly smaller
+then first theorized and therefore the application solves only a very small
+portion.
-Lack of RSS feeds is a problem because a lot of entertaintment venues have no
-RSS feeds available for the public. They are either using different techniques
-or they just do not use it at all. This shrinks the domain quite a lot. Taking
-pop music venues as an example. In a certain province of the Netherlands we can
-find about $25$ venues that have a website and only $3$ have a RSS feed.
+Lack of RSS feeds is a problem because a lot of entertainment venues have no
+RSS feeds available for the public. Venues either using different techniques to
+publish their data or do not publish their data at all via a structured source
+besides their website. This shrinks the domain quite a lot. Taking pop music
+venues as an example. In a certain province of the Netherlands we can find
+about $25$ venues that have a website and only $3$ have a RSS feed.
Extrapolating this information combined with information from other regions we
-can speculate that less then $10\%$ of the venues use RSS feeds.
+can safely say that less then $10\%$ of the venues even has a RSS feed.
-The second problem is misuse of RSS feeds. RSS feeds are due to their
-limitations possible fields very structured. We found that a lot of venues
-using a RSS feed are not content with the limitations and try to bypass such
-limitation by misusing the protocol. A common misuse is to use publication date
-as the date of the actual event. When loadig such a RSS feed into a general RSS
-feed reader the outcome is very strange because a lot of events will have a
-publishing date in the future and therefore messing up the order of
-publication. The misplacement of key information leads to lack of key
-information in the expected fields and by that lower overall extraction
-performance.
+The second problem is misuse of RSS feeds. RSS feeds are very structured due to
+their limitations on possible fields. We found that a lot of venues that are
+using a RSS feed seem not to be content with the limitations and try to bypass
+such limitations by misusing the protocol. A common misuse is to use the
+publication date field to put the date of the actual event in. When loading
+such a RSS feed into a general RSS feed reader the outcome is very strange
+because a lot of events will have a publishing date in the future and therefore
+messing up the order in your program. The misplacement of key information leads
+to lack of key information in the expected fields and by that lower overall
+extraction performance.
-The second most occuring common misuse is to use HTML formatted text in the
-text fields. The algorithm is designed to detect and extract information via
-patterns in plain text and the performance on HTML is very bad compared to
+The second most occurring common misuse is to use HTML formatted text in the RSS
+feeds text fields. Our algorithm is designed to detect and extract information
+via patterns in plain text and the performance on HTML is very bad compared to
plain text. A text field with HTML is almost useless to gather information
from. Via a small study on available RSS feeds we found that about $50\%$ of
the RSS feeds misuse the protocol in such a way that extraction of data is
% low level stuff
The application we created does not apply any techniques on the isolated
chunks. The application is built only to extract and not to process the labeled
-chunks of text. When this processing does get combined information is added to
-the data at least two things get better. A higher level of performance can be
-reached due to semantic knowledge as extra constraint while matching the data.
-Also quicker error detection in the crawlers is possible because when the match
-is correct at a higher level it can still contain wrong information at the
-lower chunk level, applying matching techniques on the chunks afterwards can
-generate feedback that could also be usefull for the top level of data
+chunks of text. When we would combine the information about the global
+structure and information about structure in a marked area we increase
+performance in two ways. A higher levels of performance are reached due to the
+structural information of marked areas. Hereby extra knowledge as extra
+constraint while matching the data in marked areas. The second increase in
+performance of the application is because the error detection happens more
+quickly. Faster error detection is possible because when the match is correct
+at a global level it can still contain wrong information at the lower marked
+field level. Applying matching techniques on the marked fields afterwards can
+generate feedback that could also be useful for the global level of data
extraction.
% combine RSS HTML
Another use or improvement could be combining the forces of HTML and RSS. Some
-specifically structured HTML sources could be converted to a tidy RSS feed and
-still get proccesed by the application. In this way, with an extra intermediate
-step, the extraction techniques can still be used. HTML sources most likely
-have to be generated because there has to be a very consistent structure in the
-data. Websites with such great structure are usually generated from a CMS.
-This will enlarge the domain for the application significantly since almost all
-websites use CMS to publish their data. When conversion between HTML and RSS
-feeds is not possible but one has a technique to extract patterns in a similar
-way then this application it is also possible to embed it in the current
-application. Due to the modularity of the application extending the
-application is very easy.
+specifically structured HTML sources could be converted into a tidy RSS feed
+and still get processed by this application. In this way, with an extra
+intermediate step, the extraction techniques can still be used. HTML sources
+most likely have to be generated from a source by the venue because there has
+to be a very consistent structure in the data. Websites with such great
+structure are usually generated from a CMS. This will enlarge the domain for
+the application significantly since almost all websites use CMS to publish
+their data.
+
+% Re-use user interface
+The interface of the program could also be re-used. When conversion between
+HTML and RSS feeds is not possible but one has a technique to extract patterns
+in a similar way then this application it is also possible to embed it in the
+current application. Due to the modularity of the application extending the
+application with other matching techniques is very easy.