From a429bff7c8cb119d7fa678386124bee7b8fca77d Mon Sep 17 00:00:00 2001 From: Mart Lubbers Date: Tue, 10 Mar 2015 10:20:16 +0100 Subject: [PATCH] 1 and 2 checked --- thesis2/2.requirementsanddesign.tex | 149 ++++++++++++++++------------ thesis2/Makefile | 22 ++-- thesis2/todo | 13 --- 3 files changed, 97 insertions(+), 87 deletions(-) delete mode 100644 thesis2/todo diff --git a/thesis2/2.requirementsanddesign.tex b/thesis2/2.requirementsanddesign.tex index 6bbb29c..bc680f0 100644 --- a/thesis2/2.requirementsanddesign.tex +++ b/thesis2/2.requirementsanddesign.tex @@ -44,8 +44,8 @@ due to the fact that it lies outside of the time available for the project. The less time available is partly because we chose to implement certain other requirements like an interactive intuitive user interface around the core of the pattern extraction program. All other requirements changed or kept the -same. Below are all definitive requirements with on the first line the title and -with a description underneath. +same. Below are all definitive requirements with on the first line the title +and with a description underneath. \begin{itemize} \item[F7:] Be able to crawl RSS feeds. @@ -117,6 +117,10 @@ with a description underneath. \end{itemize} \section{Application overview} +The workflow of the application can be divided into several components or +steps. The overview of the application is visible in Figure~\ref{appoverview}. +The nodes are applications or processing steps and the arrows denote +information flow or movement between nodes. \begin{figure}[H] \label{appoverview} \centering @@ -140,79 +144,90 @@ backend to process it. \begin{figure}[H] \label{frontendfront} - \includegraphics[scale=0.75,natheight=160,natwidth=657]{frontendfront.png} + \includegraphics[scale=0.75]{frontendfront.eps} \caption{The landing page of the frontend} \end{figure} \subsubsection{Edit/Remove crawler} This component lets the user view the crawlers and remove the crawlers from the -database. Doing one of these things with a crawler is as simple as selecting -the crawler from the dropdown menu and selecting the operation from the -other dropdown menu and pressing \textit{Submit}. +crawler database. Doing one of these things with a crawler is as simple as +selecting the crawler from the dropdown menu and selecting the operation from +the other dropdown menu and pressing \textit{Submit}. + Removing the crawler will remove the crawler completely from the crawler database and the crawler will be unrecoverable. Editing the crawler will open a similar screen as when adding the crawler. The details about that screen will -be discussed in ~\ref{addcrawler}. The only difference is that the previous -trained patterns are already made visible in the training interface and can -thus be adapted to change the crawler for possible source changes for example. +be discussed in Section~\ref{addcrawler}. The only difference is that the +previous trained patterns are already made visible in the training interface +and can thus be adapted to change the crawler for possible source changes for +example. \subsubsection{Add new crawler} \label{addcrawler} The addition or generation of crawlers is the key feature of the program and it is the smartest part of whole system as it includes the graph optimization -algorithm to recognize user specified patterns in the data. The user has to -assign a name to a RSS feed in the boxes and when the user presses submit the -RSS feed is downloaded and prepared to be shown in the interactive editor. The -editor consists of two components. The top most component allows the user to -enter several fields of data concerning the venue, these are things like: -address, crawling frequency and website. Below there is a table containing the -processed RSS feed entries and a row of buttons allowing the user to mark -certain parts of the entries as certain types. The user has to select a piece -of an entry and then press the appropriate category button. The text will -become highlighted and by doing this for several entries the program will have -enough information to crawl the feed as shown in Figure~\ref{addcrawl} +algorithm to recognize user specified patterns in the new data. First, the user +must fill in the static form that is visible on top of the page. This for +contains general information about the venue together with some crawler +specific values such as crawling frequency. After that the user can mark +certain points in the table as being of a category. Marking text is as easy as +selecting the text and pressing the according button. The text visible in the +table is a stripped down version of the original RSS feeds \texttt{title} and +\texttt{summary} fields. When the text is marked it will be highlighted in the +same color as the color of the button text. The entirety of the user interface +with a few sample markings is shown in Figure~\ref{frontedfront}. After the +marking of the categories the user can preview the data or submit. Previewing +will run the crawler on the RSS feed in memory and the user can revise the +patterns if necessary. Submitting will send the page to the backend to be +processed. The internals of what happens after submitting is explained in +detail in Figure~\ref{appinternals} together with the text. \begin{figure}[H] \label{frontendfront} - \includegraphics[width=0.7\linewidth,natheight=1298,natwidth=584]{crawlerpattern.png} - \caption{A pattern selection of three entries} + \centering + \includegraphics[width=\linewidth]{crawlerpattern.eps} + \caption{A view of the interface for specifying the pattern. Three % +patterns are already marked.} \end{figure} \subsubsection{Test crawler} The test crawler component is a very simple non interactive component that allows the user to verify if a crawler functions properly without having to -need to access the database or the command line utilities. Via a dropdown menu -the user selects the crawler and when submit is pressed the backend generates a +access the database via the command line utilities. Via a dropdown menu the +user selects the crawler and when submit is pressed the backend generates a results page that shows a small log of the crawler, a summary of the results -and most importantly the results, in this way the user can see in a few gazes -if the crawler functions properly. Humans are very fast in detecting patterns -and therefore the error checking goes very fast. Because the log of the crawl -operation is shown this page can also be used for diagnostic information about -the backends crawling system. The logging is pretty in depth and also shows -possible exceptions and is therefore also usable for the developers to diagnose -problems. +and most importantly the results itself. In this way the user can see in a few +gazes if the crawler functions properly. Humans are very fast in detecting +patterns and therefore the error checking goes very fast. Because the log of +the crawl operation is shown this page can also be used for diagnostic +information about the backends crawling system. The logging is pretty in depth +and also shows possible exceptions and is therefore also usable for the +developers to diagnose problems. \subsection{Backend} \subsubsection{Program description} The backend consists of a main module and a set of libraries all written in \textit{Python}\cite{Python}. The main module can, and is, be embedded in an -apache webserver\cite{apache} via the \textit{mod\_python} apache -module\cite{Modpython}. The module \textit{mod\_python} allows the webserver to -execute Python code in the webserver. We chose Python because of the rich set -of standard libraries and solid cross platform capabilities. We chose Python 2 -because it is still the default Python version on all major operating systems -and stays supported until at least the year 2020 meaning that the program can -function safe at least 5 full years. The application consists of a main Python -module that is embedded in the webserver. Finally there are some libraries and -there is a standalone program that does the periodic crawling. +apache HTTP-server\cite{apache} via the \textit{mod\_python} apache +module\cite{Modpython}. The module \textit{mod\_python} allows handling for +python code via HTTP and this allows us to integrate neatly with the +\textit{Python} libraries. We chose \textit{Python} because of the rich set of +standard libraries and solid cross platform capabilities. We chose specifically +for \textit{Python} 2 because it is still the default \textit{Python} version +on all major operating systems and stays supported until at least the year +$2020$. This means that the program can function at least for 5 full years. The +application consists of a main \textit{Python} module that is embedded in the +HTTP-server. Finally there are some libraries and there is a standalone program +that does the periodic crawling. \subsubsection{Main module} The main module is the program that deals with the requests, controls the -fronted, converts the data to patterns and sends it to the crawler. The -module serves the frontend in a modular fashion. For example the buttons and -colors can be easily edited by a non programmer by just changing some values in -a text file. In this way even when conventions change the program can still -function without intervention of a programmer that needs to adapt the source. +frontend, converts the data to patterns and sends the patterns to the crawler. +The module serves the frontend in a modular fashion. For example the buttons +and colors can be easily edited by a non programmer by just changing the +appropriate values in a text file. In this way even when conventions change the +program can still function without intervention of a programmer that needs to +adapt the source. \subsubsection{Libraries} The libraries are called by the main program and take care of all the hard @@ -222,31 +237,33 @@ the crawled data to XML and much more. \subsubsection{Standalone crawler} The crawler is a program that is used by the main module and technically is -part of the libraries. The thing the crawler stands out is the fact that it -also can run on its own. The crawler has to be runned periodically by a server -to really crawl the websites. The main module communicates with the crawler -when it needs XML data, when a new crawler is added or when data is edited. The -crawler also offers a command line interface that has the same functionality as -the web interface of the control center. +part of the libraries. The property in which the crawler stands out is the fact +that it also can run on its own. The crawler has to run periodically by a +server to literally crawl the websites. The main module communicates with the +crawler when it is queried for XML data, when a new crawler is added or when +data is edited. The crawler also offers a command line interface that has the +same functionality as the web interface of the control center. The crawler saves all the data in a database. The database is a simple dictionary where all the entries are hashed so that the crawler knows which -ones are already present in the database and which ones are new so that it -does not have to process all the old entries when they appear in the feed. The -RSS's GUID could also have been used but since it is an optional value in the -feed not every feed uses the GUID and therefore it is not reliable to use it. -The crawler also has a function to export the database to XML format. The XML -format is specified in an XSD\cite{Xsd} file for minimal ambiguity. +ones are already present in the database and which ones are new. In this way +the crawler does not have to process all the old entries when they appear in +the feed. The RSS' GUID could also have been used but since it is an optional +value in the feed not every feed uses the GUID and therefore it is not reliable +to use it. The crawler also has a function to export the database to XML +format. The XML output format is specified in an XSD\cite{Xsd} file for minimal +ambiguity. \subsubsection{XML \& XSD} XML is a file format that can describe data structures. XML can be accompanied by an XSD file that describes the format. An XSD file is in fact just another -XML file that describes the format of a class of XML files. Because almost all -programming languages have an XML parser built in it is a very versatile format -that makes the importing to the database very easy. The most used languages -also include XSD validation to detect XML errors, validity and completeness of -XML files. This makes interfacing with the database and possible future -programs very easy. The XSD scheme used for this programs output can be found -in the appendices in Listing~\ref{scheme.xsd}. The XML output can be queried -via a http interface that calls the crawler backend to crunch the latest -crawled data into XML. +XML file that describes the format of XML files. Almost all programming +languages have an XML parser built in and therefore it is a very versatile +format that makes the eventual import to the database very easy. The most used +languages also include XSD validation to detect XML errors, validity and +completeness of XML files. This makes interfacing with the database and +possible future programs even more easily. The XSD scheme used for this +programs output can be found in the appendices in Listing~\ref{scheme.xsd}. The +XML output can be queried via the HTTP interface that calls the crawler backend +to crunch the latest crawled data into XML. It can also be acquired directly +from the crawlers command line interface. diff --git a/thesis2/Makefile b/thesis2/Makefile index 38fb5c3..00cf6f9 100644 --- a/thesis2/Makefile +++ b/thesis2/Makefile @@ -1,12 +1,15 @@ SHELL:=/bin/bash VERSION:=1.0RC1 SOURCES:=$(shell ls *.tex) scheme.xsd exrss.xml -GRAPHS:=$(addsuffix .eps,$(basename $(shell ls *.dot))) +GRAPHS:=$(addsuffix .eps,$(basename $(shell ls *.{dot,png}))) -.PHONY: clobber graphs +.PHONY: clobber graphs release all: graphs thesis.pdf +%.eps: %.png + convert $< $@ + %.eps: %.dot dot -Teps < $< > $@ @@ -14,15 +17,18 @@ all: graphs thesis.pdf dvipdfm $< %.dvi: $(SOURCES) - latex -shell-escape thesis.tex - bibtex thesis.aux - latex -shell-escape thesis.tex - latex -shell-escape thesis.tex + latex $(basename $@).tex + bibtex $(basename $@).aux + latex $(basename $@).tex + latex $(basename $@).tex graphs: $(GRAPHS) clean: - @$(RM) -v *.{eps,aux,bbl,blg,dvi,log,out,toc} + $(RM) -v *.{eps,aux,bbl,blg,dvi,log,out,toc} clobber: clean - @$(RM) -v *.pdf + $(RM) -v *.pdf + +release: all clean + mv thesis.pdf thesis_$(VERSION).pdf diff --git a/thesis2/todo b/thesis2/todo deleted file mode 100644 index 6322c13..0000000 --- a/thesis2/todo +++ /dev/null @@ -1,13 +0,0 @@ -Algemeen: -- Formele taal + natuurlijke - -Specifiek: -- Minimaliseert dit algorithme? -- Referenties toevoegen en andere (huidige) toepassingen van algorithme - -Algoritme: -- Hoe match je de groepen, wat is beste vulling? non-determinisme -- Harde paden voor categorien? -- Specifieke(re) binnengroepsmatching - -Pythoncode toevoegen algoritme -- 2.20.1