From: Mart Lubbers Date: Thu, 9 Jul 2015 07:14:04 +0000 (+0200) Subject: final X-Git-Url: https://git.martlubbers.net/?a=commitdiff_plain;h=HEAD;p=bsc-thesis1415.git final --- diff --git a/thesis2/1.introduction.tex b/thesis2/1.introduction.tex index e28a1e4..c34f0dd 100644 --- a/thesis2/1.introduction.tex +++ b/thesis2/1.introduction.tex @@ -37,9 +37,9 @@ the resources and time reserved for these tasks and therefore often serve incomplete information. Because of the complexity of getting complete information there are not many companies trying to bundle entertainment information into a complete and consistent database and website. -Hyperleap\footnote{\url{http://hyperleap.nl}} tries to achieve goal of serving -complete and consistent information and offers it via various information -bundling websites. +Hyperleap\footnote{\url{http://hyperleap.nl}} tries to achieve the goal of +serving complete and consistent information and offers it via various +information bundling websites. \newpage \section{Hyperleap \& Infotainment} Hyperleap is an internet company that was founded in the time that internet was @@ -47,14 +47,14 @@ not widespread. Hyperleap, active since 1995, is specialized in producing, publishing and maintaining \textit{infotainment}. \textit{Infotainment} is a combination of the words \textit{information} and \textit{entertainment}. It represents a combination of factual information, the \textit{information} part, -and subjectual information, the \textit{entertainment} part, within a certain -category or field. In the case of Hyperleap the category is the leisure -industry, leisure industry encompasses all facets of entertainment ranging from -cinemas, theaters, concerts to swimming pools, bridge competitions and -conferences. Within the entertainment industry factual information includes, -but is not limited to, starting time, location, host or venue and duration. -Subjectual information includes, but is not limited to, reviews, previews, -photos, background information and trivia. +and non factual information or subjectual information, the +\textit{entertainment} part, within a certain category or field. In the case of +Hyperleap the category is the leisure industry, leisure industry encompasses +all facets of entertainment ranging from cinemas, theaters, concerts to +swimming pools, bridge competitions and conferences. Within the entertainment +industry factual information includes, but is not limited to, starting time, +location, host or venue and duration. Subjectual information includes, but is +not limited to, reviews, previews, photos, background information and trivia. Hyperleap says to manage the largest database containing \textit{infotainment} about the leisure industry focussed on the Netherlands and surrounding regions. @@ -79,10 +79,9 @@ the nodes are processing steps and the arrows denote information transfer or flow. \begin{figure}[H] -\label{informationflow} \centering \includegraphics[width=\linewidth]{informationflow.pdf} - \caption{Information flow Hyperleap database} + \caption{Information flow Hyperleap database\label{informationflow}} \end{figure} \subsection{Sources} @@ -90,11 +89,11 @@ A source is a service, location or medium in which information about events is stored or published. A source can have different source shapes such as HTML, email, flyer, RSS and so on. All information gathered from a source has to be quality checked before it is even considered for automated crawling. There are -several criteria information from the source has to comply to before an -automated crawler can be made. The prerequisites for a source are for example -the fact that the source has to be reliable, consistent and free by licence. -Event information from a source must have at least the \textit{What, Where} and -\textit{When} information. +several criteria to which the source has to comply before an automated crawler +can be made. The prerequisites for a source are for example the fact that the +source has to be reliable, consistent and free by licence. Event information +from a source must have at least the \textit{What, Where} and \textit{When} +information. The \textit{What} information is the information that describes the content, content is a very broad definition but in practice it can be describing the @@ -102,12 +101,12 @@ concert tour name, theater show title, movie title, festival title and many more. The \textit{Where} information is the location of the event. The location is -often omitted because the organization behind source presenting the information -thinks it is obvious. This information can also include different sub -locations. For example when a pop concert venue has their own building but in -the summer they organize a festival in some park. This data is often assumed to -be trivial and inherent but in practice this is not the case. In this example -for an outsider only the name of the park is often not enough. +often omitted because the organization presenting the information thinks it is +obvious. This information can also include different sub locations. For example +when a pop concert venue has their own building but in the summer they organize +a festival in some park. This data is often assumed to be trivial and inherent +but in practice this is not the case. In this example for an outsider only the +name of the park is often not enough. The \textit{When} field is the time and date of the event. Hyperleap wants to have at minimum the date, start time and end time. In the field end times for @@ -167,15 +166,16 @@ entered in the database. \subsection{Temporum} The \textit{Temporum} is a big bin that contains raw data extracted from -different sources using automated crawlers. All the information in the +different sources using automated crawlers. Some of the information in the \textit{Temporum} might not be suitable for the final database and therefore has to be post processed. The post-processing encompasses several different steps. The first step is to check the validity of the event entries from a certain -source. Validity checking is useful to detect faulty automated crawlers -before the data can leak into the database. Validity checking happens at random -on certain event entries. +source. Validity checking is useful to detect outdated automated crawlers +before the data can leak into the database. Crawlers become outdated when a +source changes and the crawler can not crawl the website using the original +method. Validity checking happens at random on certain event entries. An event entry usually contains one occurrence of an event. In a lot of cases there is parent information that the event entry is part of. For example in the @@ -206,11 +206,11 @@ Maintaining the automated crawlers and the infrastructure that provides the data flow that require the most amount of resources. Both of these parts require a programmer to execute and therefore are costly. In the case of the automated crawlers it requires a programmer because the crawlers are scripts or programs -created are website-specific. Changing such a script or program requires -knowledge about the source, the programming framework and about the -\textit{Temporum}. In practice both of the tasks mean changing code. +are specifically designed for a particular website. Changing such a script or +program requires knowledge about the source, the programming framework and +about the \textit{Temporum}. In practice both of the tasks mean changing code. -A large group of sources often changes in structure. Because of such changes +A large group of sources often change in structure. Because of such changes the task of reprogramming crawlers has to be repeated a lot. The detection of malfunctioning crawlers happens in the \textit{Temporum} and not in an earlier stage. Late detection elongates the feedback loop because there is not always a @@ -224,10 +224,9 @@ loop, shown in Figure~\ref{feedbackloop}, can take days and can be the reason for gaps and faulty information in the database. The figure shows information flow with arrows. The solid and dotted lines form the current feedback loop. \begin{figure}[H] -\label{feedbackloop} \centering \includegraphics[width=0.8\linewidth]{feedbackloop.pdf} - \caption{Feedback loop for malfunctioning crawlers} + \caption{Feedback loop for malfunctioning crawlers\label{feedbackloop}} \end{figure} The specific goal of this project is to relieve the programmer of spending a @@ -302,21 +301,21 @@ HTML format, most probably a website, then there can be a great deal of information extraction be automated using the structural information which is a characteristic for HTML. For fax/email however there is almost no structural information and most of the automation techniques require natural language -processing. We chose RSS feeds because RSS feeds lack inherent structural -information but are still very structured. This structure is because, as said -above, the RSS feeds are generated and therefore almost always look the same. -Also, in RSS feeds most venues use particular structural identifiers that are -characters. They separate fields with vertical bars, commas, whitespace and -more non text characters. These field separators and keywords can be hard for a -computer to detect but people are very good in detecting these. With one look -they can identify the characters and keywords and build a pattern in their -head. Another reason we chose RSS is their temporal consistency, RSS feeds are -almost always generated and because of that the structure of the entries is -very unlikely to change. Basically the RSS feeds only change structure when the -CMS that generates it changes the generation algorithm. This property is useful -because the crawlers then do not have to be retrained very often. To detect -the underlying structures a technique is used that exploits subword matching -with graphs. +processing and possibly OCR. We chose RSS feeds because RSS feeds lack inherent +structural information but are still very structured. This structure is +because, as said above, the RSS feeds are generated and therefore almost always +look the same. Also, in RSS feeds most venues use particular structural +identifiers that are characters. They separate fields with vertical bars, +commas, whitespace and more non text characters. These field separators and +keywords can be hard for a computer to detect but people are very good in +detecting these. With one look they can identify the characters and keywords +and build a pattern in their head. Another reason we chose RSS is their +temporal consistency, RSS feeds are almost always generated and because of that +the structure of the entries is very unlikely to change. Basically the RSS +feeds only change structure when the CMS that generates it changes the +generation algorithm. This property is useful because the crawlers then do not +have to be retrained very often. To detect the underlying structures a +technique is used that exploits subword matching with graphs. \section{Directed Acyclic Graphs} \paragraph{Directed graphs} @@ -331,10 +330,9 @@ described as: $$G=(\{n1, n2, n3, n4\}, \{(n1, n2), (n2, n1), (n2, n3), (n3, n4), (n1, n4)\}$$ \begin{figure}[H] - \label{graphexample} \centering \includegraphics[scale=0.7]{graphexample.pdf} - \caption{Example DG} + \caption{Example DG\label{graphexample}} \end{figure} \paragraph{Directed acyclic graphs} @@ -354,12 +352,12 @@ cyclicity to graphs lowers the computational complexity of path existence in the graph to $\mathcal{O}(L)$ where $L$ is the length of the path. \begin{figure}[H] - \label{dagexample} \centering \includegraphics[scale=0.7]{dagexample.pdf} - \caption{Example DAG} + \caption{Example DAG\label{dagexample}} \end{figure} +\newpage \paragraph{Directed Acyclic Word Graphs} The type of graph used in the project is a special kind of DAG called Directed Acyclic Word Graphs(DAWGs). A DAWG can be defined by the tuple $G=(V,v_0,E,F)$. @@ -376,22 +374,27 @@ with a double circle as node shape. In this example it is purely cosmetic because $n6$ is a final node anyways because there are no arrows leading out. But this does not need to be the case, for example in $G=(\{n1, n2, n3\}, \{(n1, n2), (n2, n3)\}, \{n2, n3\})$ there is a distinct use for the final node -marking. Graph $G$ accepts the words \textit{a,ab} and to simplify the graph -node $n2$ and $n3$ are final. Finally $v_0$ describes the initial node, this is -visualized in figures as an incoming arrow. Because of the property of labeled -edges, data can be stored in a DAWG. When traversing a DAWG and saving all the -edge labels one can construct words. Using graph minimisation big sets of -words can be stored using a small amount of storage because edges can be -re-used to specify transitions. For example the graph in -Figure~\ref{exampledawg} can describe the language $L$ where all words $w$ that -are accepted $w\in\{abd, bad, bae\}$. Testing if a word is present in the DAWG -is the same technique as testing if a node path is present in a normal DAG and -therefore also falls in the computational complexity class of $\mathcal{O}(L)$. -This means that it grows linearly with the length of the word. +marking. The only final node is the example is $n_6$, marked with a +double circle. $v_0$ describes the initial node, this is visualized in figures +as an incoming arrow. Because of the property of labeled edges, data can be +stored in a DAWG. When traversing a DAWG and saving all the edge labels one can +construct words. Using graph minimisation big sets of words can be stored using +a small amount of storage because edges can be re-used to specify transitions. +For example the graph in Figure~\ref{exampledawg} can describe the language $L$ +where all words $w$ that are accepted $w\in\{abd, bad, bae\}$. Testing if a +word is present in the DAWG is the same technique as testing if a node path is +present in a normal DAG and therefore also falls in the computational +complexity class of $\mathcal{O}(L)$. This means that it grows linearly with +the length of the word. \begin{figure}[H] - \label{exampledawg} \centering \includegraphics[scale=0.7]{dawgexample.pdf} - \caption{Example DAWG} + \caption{Example DAWG\label{exampledawg}} \end{figure} + +\section{Structure} +The following chapters will describe the system that has been created and the +used methods. Chapter~2 shows the requirements design for the program followed +in Chapter~3 by the underlying methods used for the actual matching. Finally +Chapter~4 concludes with the results, discussion and future research. diff --git a/thesis2/2.requirementsanddesign.tex b/thesis2/2.requirementsanddesign.tex index 680a18b..9f31b8d 100644 --- a/thesis2/2.requirementsanddesign.tex +++ b/thesis2/2.requirementsanddesign.tex @@ -31,11 +31,11 @@ explanation is also provided. \end{itemize} \item[I2:] Apply low level matching techniques on isolated data. \item[I3:] Insert data in the database. - \item[I4:] The system should have an user interface to train crawlers that is - usable someone without a particular computer science background. - \item[I5:] The system should be able to report to the user or - maintainer when a source has been changed too much for - successful crawling. + \item[I4:] The system should have a user interface to train crawlers + that is usable by someone without a particular computer science + background. + \item[I5:] The system should be able to report to the employee when a + source has been changed too much for successful crawling. \end{itemize} \subsubsection{Definitive functional requirements} @@ -51,21 +51,21 @@ definitive requirements. requirements I1a-I1d. We limited the source types to crawl to strict RSS because of the time constraints of the project. Most sources require an entirely different strategy and therefore we - could not easily combine them. an explanation why we chose RSS + could not easily combine them. An explanation why we chose RSS feeds can be found in Section~\ref{sec:whyrss}. \item[F2:] Export the data to a strict XML feed. This requirement is an adapted version of requirement I3, this - is also done to limit the scope. We chose to no interact + is also done to limit the scope. We chose to not interact directly with the database or the \textit{Temporum}. The application however is able to output XML data that is formatted following a string XSD scheme so that it is easy to import the data in the database or \textit{Temporum} in a indirect way. \item[F3:] The system should have a user interface to create crawlers - that is usable someone without a particular computer science - background. science people. + that is usabl for someone without a particular computer science + background. This requirement is formed from I4. Initially the user interface for adding and training crawlers was done via a @@ -124,10 +124,9 @@ steps. The overview of the application is visible in Figure~\ref{appoverview}. The nodes are applications or processing steps and the arrows denote information flow or movement between nodes. \begin{figure}[H] - \label{appoverview} \centering \includegraphics[width=\linewidth]{appoverview.pdf} - \caption{Overview of the application} + \caption{Overview of the application\label{appoverview}} \end{figure} \subsection{Frontend} @@ -136,20 +135,19 @@ The frontend is a web interface that is connected to the backend system which allows the user to interact with the backend. The frontend consists of a basic graphical user interface which is shown in Figure~\ref{frontendfront}. As the interface shows, there are three main components that the user can use. There -is also an button for downloading the XML. The \textit{Get xml} button is a -quick shortcut to make the backend to generate XML. The button for grabbing the +is also a button for downloading the XML\@. The \textit{Get xml} button is a +quick shortcut to make the backend to generate XML\@. The button for grabbing the XML data is only for diagnostic purposes located there. In the standard workflow the XML button is not used. In the standard workflow the server periodically calls the XML output option from the command line interface of the backend to process it. \begin{figure}[H] - \label{frontendfront} \includegraphics[width=\linewidth]{frontendfront.pdf} - \caption{The landing page of the frontend} + \caption{The landing page of the frontend\label{frontendfront}} \end{figure} -\subsubsection{Edit/Remove crawler} +\subsubsection{Repair/Remove crawler} This component lets the user view the crawlers and remove the crawlers from the crawler database. Doing one of these things with a crawler is as simple as selecting the crawler from the dropdown menu and selecting the operation from @@ -169,7 +167,7 @@ The addition or generation of crawlers is the key feature of the program and it is the intelligent part of the system since it includes the graph optimization algorithm to recognize user specified patterns in the new data. First, the user must fill in the static form that is visible on top of the page. This for -contains general information about the venue together with some crawler +example contains general information about the venue together with some crawler specific values such as crawling frequency. After that the user can mark certain points in the table as being of a category. Marking text is as easy as selecting the text and pressing the according button. The text visible in the @@ -184,11 +182,10 @@ processed. The internals of what happens after submitting is explained in detail in Figure~\ref{appinternals} together with the text. \begin{figure}[H] - \label{frontendfront} \centering \includegraphics[width=\linewidth]{crawlerpattern.pdf} \caption{A view of the interface for specifying the pattern. Two % -entries are already marked.} +entries are already marked.\label{frontendfront}} \end{figure} \subsubsection{Test crawler} @@ -201,15 +198,15 @@ and most importantly the results itself. In this way the user can see in a few gazes if the crawler functions properly. Humans are very fast in detecting patterns and therefore the error checking goes very fast. Because the log of the crawl operation is shown this page can also be used for diagnostic -information about the backends crawling system. The logging is pretty in depth -and also shows possible exceptions and is therefore also usable for the -developers to diagnose problems. +information about the backends crawling system. The logging is in depth and +also shows possible exceptions and is therefore also usable for the developers +to diagnose problems. \subsection{Backend} \subsubsection{Program description} The backend consists of a main module and a set of libraries all written in -\textit{Python}\cite{Python}. The main module can, and is, be embedded in an -apache HTTP-server\cite{apache} via the \textit{mod\_python} apache +\textit{Python}\cite{Python}. The main module is embedded in an apache +HTTP-server\cite{apache} via the \textit{mod\_python} apache module\cite{Modpython}. The module \textit{mod\_python} allows handling for python code via HTTP and this allows us to integrate neatly with the \textit{Python} libraries. We chose \textit{Python} because of the rich set of @@ -264,8 +261,8 @@ languages have an XML parser built in and therefore it is a very versatile format that makes the eventual import to the database very easy. The most used languages also include XSD validation to detect XML errors, validity and completeness of XML files. This makes interfacing with the database and -possible future programs even more easily. The XSD scheme used for this -programs output can be found in the appendices in Listing~\ref{scheme.xsd}. The +possible future programs even more easy. The XSD scheme used for this +programs output can be found in the appendices in Algorithm~\ref{scheme.xsd}. The XML output can be queried via the HTTP interface that calls the crawler backend -to crunch the latest crawled data into XML. It can also be acquired directly +to crunch the latest crawled data into XML\@. It can also be acquired directly from the crawlers command line interface. diff --git a/thesis2/3.methods.tex b/thesis2/3.methods.tex index 9a458b0..e512b4c 100644 --- a/thesis2/3.methods.tex +++ b/thesis2/3.methods.tex @@ -7,10 +7,9 @@ between these steps. The Figure is a detailed explanation of the \textit{Backend} node in Figure~\ref{appoverview}. \begin{figure}[H] - \label{appinternals} \centering \includegraphics[width=\linewidth]{backend.pdf} - \caption{Main module internals} + \caption{Main module internals\label{appinternals}} \end{figure} \section{HTML data} @@ -41,35 +40,33 @@ only requires rudimentary matching techniques that require very little computational power. User markings are highlights of certain text elements. The highlighting is done using \texttt{SPAN} elements and therefore all the \texttt{SPAN} elements have to be found and extracted. To achieve this for -every row all \texttt{SPAN} elements are extracted, again with simple matching -techniques, to extract the color and afterwards to remove the element to -retrieve the original plain text of the \textit{RSS} feed entry. When this step -is done a data structure containing all the text of the entries together with -the markings will go to the next step. All original data, namely the -\textit{HTML} data per row, will be transferred to the final aggregating -dictionary. +every row all \texttt{SPAN} elements are extracted to get the color and +afterwards to remove the element to retrieve the original plain text of the +\textit{RSS} feed entry. When this step is done a data structure containing all +the text of the entries together with the markings will go to the next step. +All original data, namely the \textit{HTML} data per row, will be transferred +to the final aggregating dictionary. \section{Node lists} Every entry gotten from the previous step is going to be processing into so called node-lists. A node-list can be seen as a path graph where every character and marking has a node. A path graph $G$ is defined as -$G=(V,n_1,E,n_i)$ where $V=\{n_1, n_2, \cdots, n_{i-1}, n_i\}$ and -$E=\{(n_1, n_2), (n_2, n_3), \ldots\\ (n_{i-1}, n_{i})\}$. A path graph is basically -a graph that is a single linear path of nodes where every node is connected to -the next node except for the last one. The last node is the only final node. -The transitions between two nodes is either a character or a marking. As an -example we take the entry \texttt{19:00, 2014-11-12 - Foobar} and create the -corresponding node-lists and it is shown in Figure~\ref{nodelistexample}. -Characters are denoted with single quotes, spaces with an underscore and -markers with angle brackets. Node-lists are the basic elements from which the -DAWG will be generated. These node-lists will also be available in the final -aggregating dictionary to ensure consistency of data and possibility of -regenerating the data. +$G=(V,n_1,E,n_i)$ where $V=\{n_1, n_2, \ldots, n_{i-1}, n_i\}$ and +$E=\{(n_1, n_2), (n_2, n_3), \ldots\\ (n_{i-1}, n_{i})\}$. A path graph is +basically a graph that is a single linear path of nodes where every node is +connected to the next node except for the last one. The last node is the only +final node. The transitions between two nodes is either a character or a +marking. As an example we take the entry \texttt{19:00, 2014--11--12 {-} +Foobar} and create the corresponding node-lists and it is shown in +Figure~\ref{nodelistexample}. Characters are denoted with single quotes, +spaces with an underscore and markers with angle brackets. Node-lists are the +basic elements from which the DAWG will be generated. These node-lists will +also be available in the final aggregating dictionary to ensure consistency of +data and possibility of regenerating the data. \begin{figure}[H] - \label{nodelistexample} \centering \includegraphics[width=\linewidth]{nodelistexample.pdf} - \caption{Node list example} + \caption{Node list example\label{nodelistexample}} \end{figure} \section{DAWGs} @@ -130,7 +127,7 @@ in Listing~\ref{dawg.py}. \begin{algorithm}[H] \SetKwProg{Def}{def}{:}{end} - \Def{generate\_dawg(words)}{ + \Def{generate\_dawg(words)}{% register := $\emptyset$\; \While{there is another word}{% word := next word\; @@ -158,12 +155,14 @@ word[length(commonprefix)\ldots length(word)]\; register.add(child)\; } } - \caption{Generating DAWGs pseudocode} - \label{pseudodawg} + \caption{Generating DAWGs pseudocode\label{pseudodawg}} \end{algorithm} \newpage\subsection{Example} -We visualize this with an example shown in the {Subgraphs in +The size of the graphs that are generated from real world data from the leisure +industry grows extremely fast. Therefore the example concists of short strings +instead of real life event information. +The algorith is visualized with an example shown in the {Subgraphs in Figure}~\ref{dawg1} that builds a DAWG with the following entries: \texttt{abcd}, \texttt{aecd} and \texttt{aecf}. @@ -211,24 +210,23 @@ Figure}~\ref{dawg1} that builds a DAWG with the following entries: \end{itemize} \begin{figure}[H] - \label{dawg1} \centering \includegraphics[height=20em]{inccons.pdf} - \caption{Incrementally constructing a DAWG} + \caption{Incrementally constructing a DAWG\label{dawg1}} \end{figure} \subsection{Appliance on extraction of patterns} The text data in combination with the user markings can not be converted automatically to a DAWG using the algorithm we described. This is because the user markings are not necessarily a single character or word. Currently user -markings are basically multiple random characters. When we add a user marking, -we are inserting a kind of subgraph in the place of the node with the marking. -By doing this we can introduce non determinism to the graph. Non determinism is -the fact that a single node has multiple edges with the same transition, in -practise this means it could happen that a word can be present in the graph in -multiple paths. An example of non determinism in one of our DAWGs is shown in -Figure~\ref{nddawg}. This figure represents a generated DAWG with the following -entries: \texttt{ab<1>c}, \texttt{a<1>bbc}. +markings are subgraphs that accept any word of any length. When we add a user +marking, we are inserting a kind of subgraph in the place of the node with the +marking. By doing this we can introduce non determinism to the graph. Non +determinism is the fact that a single node has multiple edges with the same +transition, in practise this means it could happen that a word can be present +in the graph in multiple paths. An example of non determinism in one of our +DAWGs is shown in Figure~\ref{nddawg}. This figure represents a generated DAWG +with the following entries: \texttt{ab<1>c}, \texttt{a<1>bc}. In this graph the word \texttt{abdc} will be accepted and the user pattern \texttt{<1>} will be filled with the subword \texttt{d}. However if we try the @@ -241,10 +239,9 @@ it will give partial information when no match is possible, however it still needs to report the error and the data should be handled with extra care. \begin{figure}[H] - \label{nddawg} \centering \includegraphics[width=\linewidth]{nddawg.pdf} - \caption{Example non determinism} + \caption{Example non determinism\label{nddawg}} \end{figure} \subsection{Minimality \& non-determinism} @@ -284,6 +281,6 @@ results. There are several possibilities or heuristics to choose from. path that marks a location using the same words. \end{itemize} -If we would know more about the categories the best heuristic automatically -becomes the maximum path heuristic. When, as in our implementation, there is -very little information both heuristics perform about the same. +The more one knows about the contents of the categories the better the Maximum +field heuristic performs. When, as in our current implementation, categories do +not contain information both heuristics perform about the same. diff --git a/thesis2/4.discussion.tex b/thesis2/4.discussion.tex index 6bce1a0..46f3de8 100644 --- a/thesis2/4.discussion.tex +++ b/thesis2/4.discussion.tex @@ -7,13 +7,13 @@ for RSS feeds} The short answer to the problem statement made in the introduction is yes. We can shorten the loop for repairing and adding crawlers which our system. The -system we have built is tested and can provide the necessary tools for a user -with no particular programming skills to generate crawlers and thus the number -of interventions where a programmer is needed is greatly reduced. Although we -have solved the problem we stated the results are not strictly positive. This -is because a if the problem space is not large the interest of solving the -problem is also not large, this basically means that there is not much data to -apply the solution on. +system can provide the necessary tools for a user with no particular +programming skills to generate crawlers and thus the number of interventions +where a programmer is needed is greatly reduced. Although we have solved the +problem we stated the results are not strictly positive. This is because a if +the problem space is not large the interest of solving the problem is also not +large, this basically means that there is not much data to apply the solution +on. Although the research question is answered the underlying goal of the project has not been completely achieved. The application is an intuitive system that diff --git a/thesis2/Makefile b/thesis2/Makefile index 6f3a41a..8b23cea 100644 --- a/thesis2/Makefile +++ b/thesis2/Makefile @@ -23,6 +23,7 @@ all: thesis.pdf bibtex $(basename $@) $(LATEX) $(basename $@) $(LATEX) $(basename $@) + $(LATEX) $(basename $@) %.fmt: preamble.tex $(LATEX) -ini -jobname="$(basename $@)" "&$(LATEX) $<\dump" diff --git a/thesis2/abstract.tex b/thesis2/abstract.tex index 9af70fe..5979d88 100644 --- a/thesis2/abstract.tex +++ b/thesis2/abstract.tex @@ -1,14 +1,14 @@ -When looking for an activity in a bar or trying to find a good movie to watch -it often seems difficult to find complete and correct information about the -event. Hyperleap tries to solve problem of bad information giving by bundling -the information from various sources and invest in good quality checking. +When looking for an activity in a bar or trying to find a good movie it often +seems difficult to find complete and correct information about the event. +Hyperleap tries to solve this problem of bad information giving by bundling the +information from various sources and invest in good quality checking. Currently information retrieval is performed using site-specific crawlers, when -a crawler breaks the feedback loop for fixing the it contains of different -steps and requires someone with a computer science background. A crawler -generation system has been created that uses directed acyclic word graphs to -assist solving the feedback loop problem. The system allows users with no -particular computer science background to create, edit and test crawlers for -\textit{RSS} feeds. In this way the feedback loop for broken crawlers is -shortened, new sources can be incorporated in the database quicker and, most -importantly, the information about the latest movie show, theater production or -conference will reach the people looking for it as fast as possible. +a crawler breaks the feedback loop for fixing it contains different steps and +requires someone with a computer science background. A crawler generation +system has been created that uses directed acyclic word graphs to assist +solving the feedback loop problem. The system allows users with no particular +computer science background to create, edit and test crawlers for \textit{RSS} +feeds. In this way the feedback loop for broken crawlers is shortened, new +sources can be incorporated in the database quicker and, most importantly, the +information about the latest movie show, theater production or conference will +reach the people looking for it as fast as possible. diff --git a/thesis2/img/feedbackloop.dot b/thesis2/img/feedbackloop.dot index 7c3b3d2..8984ab6 100644 --- a/thesis2/img/feedbackloop.dot +++ b/thesis2/img/feedbackloop.dot @@ -4,7 +4,7 @@ digraph { source [label="Source"]; crawler [label="Crawler"]; temporum [label="Temporum"]; - employee [label="User"]; + employee [label="Employee"]; programmer [label="Programmer"]; database [label="Database"]; source -> crawler; diff --git a/thesis2/img/informationflow.dot b/thesis2/img/informationflow.dot index 9755125..0506242 100644 --- a/thesis2/img/informationflow.dot +++ b/thesis2/img/informationflow.dot @@ -18,6 +18,7 @@ digraph{ c1 [label="Crawling"]; t2 [label="Temporum"]; d1 [label="Database"]; + m1 [label="Manual"]; subgraph cluster_1 { node [shape="rectangle",fontsize=10,nodesep=0.7,ranksep=0.75,width=1]; @@ -28,7 +29,8 @@ digraph{ label="Publication"; } i2 -> c1 [ltail=cluster_0]; - i0 -> d1 [ltail=cluster_0]; + i0 -> m1 [ltail=cluster_0]; + m1 -> d1 [ltail=cluster_0]; c1 -> t2; t2 -> d1; d1 -> o2 [lhead=cluster_1];