From: Mart Lubbers Date: Wed, 11 Mar 2015 14:55:47 +0000 (+0100) Subject: rcrc1 X-Git-Url: https://git.martlubbers.net/?a=commitdiff_plain;h=b92fcf769dd57ead3e683b9aff53a3240aaf3219;p=bsc-thesis1415.git rcrc1 --- diff --git a/thesis2/3.methods.tex b/thesis2/3.methods.tex index 250ef7e..5168de1 100644 --- a/thesis2/3.methods.tex +++ b/thesis2/3.methods.tex @@ -106,7 +106,8 @@ which is a path graph. We just create a new node for every transition of character and we mark the last node as final. From then on all words are added using a four step approach described below. Pseudocode for this algorithm can be found in Listing~\ref{pseudodawg} named as the function -\texttt{generate\_dawg(words)} +\texttt{generate\_dawg(words)}. A \textit{Python} implementation can be found +in Listing~\ref{dawg.py}. \begin{enumerate} \item Say we add word $w$ to the grahp. Step one is finding the @@ -168,10 +169,10 @@ Figure}~\ref{dawg1} that builds a DAWG with the following entries: \texttt{abcd}, \texttt{aecd} and \texttt{aecf}. \begin{itemize} - \item Initial\\ + \item No words added yet\\ Initially we begin with the null graph. This graph is show in the figure as SG0. This DAWG does not yet accept any words. - \item \texttt{abcd}\\ + \item Adding \texttt{abcd}\\ Adding the first entry \texttt{abcd} is trivial because we can just create a single path which does not require any hard work. This is because the common prefix we find in Step 1 is empty @@ -179,7 +180,7 @@ Figure}~\ref{dawg1} that builds a DAWG with the following entries: back into the graph is also not possible since there are no nodes except for the first node. The result of adding the first word is visible in subgraph SG1. - \item \texttt{aecd}\\ + \item Adding \texttt{aecd}\\ For the second entry we will have to do some extra work. The common prefix found in Step 1 is \texttt{a} which we add to the graph. This leaves us in Step 2 with the @@ -190,7 +191,7 @@ Figure}~\ref{dawg1} that builds a DAWG with the following entries: a common suffix \texttt{cd} and we can merge these nodes. In this way we can reuse the transition from $q_3$ to $q_4$. This leaves us with subgraph SG2. - \item \texttt{aecf}\\ + \item Adding \texttt{aecf}\\ We now add the last entry which is the word \texttt{aecf}. When we do this without the confluence node checking we encounter an unwanted extra word. In Step 1 we find the common prefix @@ -213,7 +214,7 @@ Figure}~\ref{dawg1} that builds a DAWG with the following entries: \begin{figure}[H] \label{dawg1} \centering - \includegraphics[width=0.9\linewidth]{inccons.eps} + \includegraphics[width=0.8\linewidth]{inccons.eps} \strut\\\strut\\ \caption{Incrementally constructing a DAWG} \end{figure} @@ -221,16 +222,18 @@ Figure}~\ref{dawg1} that builds a DAWG with the following entries: \subsection{Appliance on extraction of patterns} The text data in combination with the user markings can not be converted automatically to a DAWG using the algorithm we described. This is because the -user markings are not necessarily a single character or word. User markings are -basically one or more characters. When we add a user marking we insert a -character that is not in the alphabet and later on we change the marking to a -kind of subgraph. When this is applied it can be possible that non determinism -is added to the graph. Non determinism is the fact that a single node has -multiple edges with the same transition, in practise this means that a word can -be present in the graph in multiple paths. This is shown in -Figure~\ref{nddawg} with the following words: \texttt{ab<1>c}, \texttt{a<1>bbc}. +user markings are not necessarily a single character or word. Currently user +markings are basically multiple random characters. When we add a user marking, +we are inserting a kind of subgraph in the place of the node with the marking. +By doing this we can introduce non determinism to the graph. Non determinism is +the fact that a single node has multiple edges with the same transition, in +practise this means it could happen that a word can be present in the graph in +multiple paths. An example of non determinism in one of our DAWGs is shown in +Figure~\ref{nddawg}. This figure represents a generated DAWG with the following +entries: \texttt{ab<1>c}, \texttt{a<1>bbc}. + In this graph the word \texttt{abdc} will be accepted and the user pattern -\texttt{<1>} will be filled with the word \texttt{d}. However if we try the +\texttt{<1>} will be filled with the subword \texttt{d}. However if we try the word \texttt{abdddbc} both paths can be chosen. In the first case the user pattern \texttt{<1>} will be filled with \texttt{dddb} and in the second case with \texttt{bddd}. In such a case we need to choose the hopefully smartest @@ -247,13 +250,13 @@ choice. \subsection{Minimality and non-determinism} The Myhill-Nerode theorem~\cite{Hopcroft1979} states that for every number of graphs accepting the same language there is a single graph with the least -amount of states. Mihov\cite{Mihov1998} has proven that the algorithm is -minimal in its original form. Our program converts the node-lists to DAWGs that -can possibly contain non deterministic transitions from nodes and therefore one -can argue about Myhill-Nerodes theorem and Mihovs proof holding.. Due to the -nature of the determinism this is not the case and both hold. In reality the -graph itself is only non-deterministic when expanding the categories and thus -only during matching. +amount of states. Mihov\cite{Mihov1998} has proven that the algorithm for +generating DAWGs is minimal in its original form. Our program converts the +node-lists to DAWGs that can possibly contain non deterministic transitions +from nodes and therefore one can argue about Myhill-Nerodes theorem and Mihovs +proof holding. Due to the nature of the determinism this is not the case and +both hold. In reality the graph itself is only non-deterministic when expanding +the categories and thus only during matching. Choosing the smartest path during matching the program has to choose deterministically between possibly multiple path with possibly multiple diff --git a/thesis2/4.discussion.tex b/thesis2/4.discussion.tex index 7ab6b21..86b616c 100644 --- a/thesis2/4.discussion.tex +++ b/thesis2/4.discussion.tex @@ -1,48 +1,50 @@ \section{Conclusion} \begin{center} - \textit{Is it possible to shorten the feedback loop for repairing and adding - crawlers by making a system that can create, add and maintain crawlers for - RSS feeds} + \textit{Is it possible to shorten the feedback loop for repairing and % +adding crawlers by making a system that can create, add and maintain crawlers % +for RSS feeds} \end{center} The short answer to the problem statement made in the introduction is yes. We -can shorten the loop for repairing and adding crawlers which such a system. The -system we have built is tested and can provide the necessary for a user with no -particular programming skills to generate crawlers and thus the number of -interventions where a programmer is needed is greatly reduced. -Although we have solved the problem we stated the results are not purely -positive. For a problem to be solved the problem must be present. +can shorten the loop for repairing and adding crawlers which our system. The +system we have built is tested and can provide the necessary tools for a user +with no particular programming skills to generate crawlers and thus the number +of interventions where a programmer is needed is greatly reduced. Although we +have solved the problem we stated the results are not strictly positive. For a +problem to be solved the problem must be present. Although the research question is answered the underlying goal of the project -is not achieved completely. The application is a intuitive system that allows -users to manage RSS crawlers and for the specific domain, RSS feeds, any by -doing that it does shorten the feedback loop but only within the specific -domain. In the testing phase on real world data we stumbled on a small problem. -Lack of RSS feeds and misuse of RSS feeds leads to a domain that is -significantly smaller then first theorized. +has not been completely achieved. The application is an intuitive system that +allows users to manage crawlers and for the specific domain, RSS feeds. By +doing that it does shorten the feedback loop but only for RSS feeds. In the +testing phase on real world data we stumbled on a small problem. Lack of RSS +feeds and misuse of RSS feeds leads to a domain that is significantly smaller +then first theorized and therefore the application solves only a very small +portion. -Lack of RSS feeds is a problem because a lot of entertaintment venues have no -RSS feeds available for the public. They are either using different techniques -or they just do not use it at all. This shrinks the domain quite a lot. Taking -pop music venues as an example. In a certain province of the Netherlands we can -find about $25$ venues that have a website and only $3$ have a RSS feed. +Lack of RSS feeds is a problem because a lot of entertainment venues have no +RSS feeds available for the public. Venues either using different techniques to +publish their data or do not publish their data at all via a structured source +besides their website. This shrinks the domain quite a lot. Taking pop music +venues as an example. In a certain province of the Netherlands we can find +about $25$ venues that have a website and only $3$ have a RSS feed. Extrapolating this information combined with information from other regions we -can speculate that less then $10\%$ of the venues use RSS feeds. +can safely say that less then $10\%$ of the venues even has a RSS feed. -The second problem is misuse of RSS feeds. RSS feeds are due to their -limitations possible fields very structured. We found that a lot of venues -using a RSS feed are not content with the limitations and try to bypass such -limitation by misusing the protocol. A common misuse is to use publication date -as the date of the actual event. When loadig such a RSS feed into a general RSS -feed reader the outcome is very strange because a lot of events will have a -publishing date in the future and therefore messing up the order of -publication. The misplacement of key information leads to lack of key -information in the expected fields and by that lower overall extraction -performance. +The second problem is misuse of RSS feeds. RSS feeds are very structured due to +their limitations on possible fields. We found that a lot of venues that are +using a RSS feed seem not to be content with the limitations and try to bypass +such limitations by misusing the protocol. A common misuse is to use the +publication date field to put the date of the actual event in. When loading +such a RSS feed into a general RSS feed reader the outcome is very strange +because a lot of events will have a publishing date in the future and therefore +messing up the order in your program. The misplacement of key information leads +to lack of key information in the expected fields and by that lower overall +extraction performance. -The second most occuring common misuse is to use HTML formatted text in the -text fields. The algorithm is designed to detect and extract information via -patterns in plain text and the performance on HTML is very bad compared to +The second most occurring common misuse is to use HTML formatted text in the RSS +feeds text fields. Our algorithm is designed to detect and extract information +via patterns in plain text and the performance on HTML is very bad compared to plain text. A text field with HTML is almost useless to gather information from. Via a small study on available RSS feeds we found that about $50\%$ of the RSS feeds misuse the protocol in such a way that extraction of data is @@ -54,25 +56,32 @@ of the venues. % low level stuff The application we created does not apply any techniques on the isolated chunks. The application is built only to extract and not to process the labeled -chunks of text. When this processing does get combined information is added to -the data at least two things get better. A higher level of performance can be -reached due to semantic knowledge as extra constraint while matching the data. -Also quicker error detection in the crawlers is possible because when the match -is correct at a higher level it can still contain wrong information at the -lower chunk level, applying matching techniques on the chunks afterwards can -generate feedback that could also be usefull for the top level of data +chunks of text. When we would combine the information about the global +structure and information about structure in a marked area we increase +performance in two ways. A higher levels of performance are reached due to the +structural information of marked areas. Hereby extra knowledge as extra +constraint while matching the data in marked areas. The second increase in +performance of the application is because the error detection happens more +quickly. Faster error detection is possible because when the match is correct +at a global level it can still contain wrong information at the lower marked +field level. Applying matching techniques on the marked fields afterwards can +generate feedback that could also be useful for the global level of data extraction. % combine RSS HTML Another use or improvement could be combining the forces of HTML and RSS. Some -specifically structured HTML sources could be converted to a tidy RSS feed and -still get proccesed by the application. In this way, with an extra intermediate -step, the extraction techniques can still be used. HTML sources most likely -have to be generated because there has to be a very consistent structure in the -data. Websites with such great structure are usually generated from a CMS. -This will enlarge the domain for the application significantly since almost all -websites use CMS to publish their data. When conversion between HTML and RSS -feeds is not possible but one has a technique to extract patterns in a similar -way then this application it is also possible to embed it in the current -application. Due to the modularity of the application extending the -application is very easy. +specifically structured HTML sources could be converted into a tidy RSS feed +and still get processed by this application. In this way, with an extra +intermediate step, the extraction techniques can still be used. HTML sources +most likely have to be generated from a source by the venue because there has +to be a very consistent structure in the data. Websites with such great +structure are usually generated from a CMS. This will enlarge the domain for +the application significantly since almost all websites use CMS to publish +their data. + +% Re-use user interface +The interface of the program could also be re-used. When conversion between +HTML and RSS feeds is not possible but one has a technique to extract patterns +in a similar way then this application it is also possible to embed it in the +current application. Due to the modularity of the application extending the +application with other matching techniques is very easy. diff --git a/thesis2/5.appendices.tex b/thesis2/5.appendices.tex index 0e72139..6720c4c 100644 --- a/thesis2/5.appendices.tex +++ b/thesis2/5.appendices.tex @@ -1,4 +1,7 @@ -\section{Schemes} -\subsection{scheme.xsd} +\section{scheme.xsd} \lstinputlisting[language=XML,label={scheme.xsd},caption={XSD scheme for XML% output}]{scheme.xsd} + +\section{Algorithm} +\lstinputlisting[language=python,label={dawg.py},caption={DAWG generation % +algorithm in Python}]{dawg.py} diff --git a/thesis2/dawg.py b/thesis2/dawg.py new file mode 120000 index 0000000..adc42ce --- /dev/null +++ b/thesis2/dawg.py @@ -0,0 +1 @@ +../program/dawg/dawg.py \ No newline at end of file diff --git a/thesis2/thesis.tex b/thesis2/thesis.tex index 8f426da..f718de3 100644 --- a/thesis2/thesis.tex +++ b/thesis2/thesis.tex @@ -16,6 +16,8 @@ numbers=left, numberstyle=\tiny, breaklines=true, + showspaces=false, + showstringspaces=false, tabsize=2, }