From 5d27174fb4ecb6d35384b60df0430b718314c73d Mon Sep 17 00:00:00 2001 From: Mart Lubbers Date: Wed, 3 Dec 2014 11:31:15 +0100 Subject: [PATCH] thesis all comments processed --- thesis2/1.introduction.tex | 172 +++++++++++++++++-------------------- thesis2/thesis.tex | 1 + 2 files changed, 81 insertions(+), 92 deletions(-) diff --git a/thesis2/1.introduction.tex b/thesis2/1.introduction.tex index 0ad4c5b..8b02c08 100644 --- a/thesis2/1.introduction.tex +++ b/thesis2/1.introduction.tex @@ -72,7 +72,6 @@ diagram using nodes as processing or publishing steps and arrows are information flow. \begin{figure}[H] - \caption{Information flow Hyperleap database} \label{informationflow} \centering \scalebox{0.7}{ @@ -102,6 +101,7 @@ information flow. o1 -> o4; } } + \caption{Information flow Hyperleap database} \end{figure} \paragraph{Sources} @@ -213,7 +213,6 @@ crawler to the new structure. This feedback loop, shown in Figure~\ref{feedbackloop} can take days and can be the reason for gaps and faulty information in the database. \begin{figure}[H] - \caption{Feedback loop for malfunctioning crawlers} \label{feedbackloop} \centering \scalebox{0.5}{ @@ -235,6 +234,7 @@ faulty information in the database. programmer -> crawler [constraint=false,style="dotted"]; } } + \caption{Feedback loop for malfunctioning crawlers} \end{figure} The goal of the project is to relieve the programmer of repairing crawlers all @@ -295,77 +295,53 @@ website background be able to post news items and event information. \section{Why RSS?} \label{sec:whyrss} -The first proposal described formats like HTML, fax/email, RSS, XML and some -more. Because of the scope of the project and the time planned for it we had to -remove some of the input formats because they all require different techniques. -For example when the input source is in HTML format, most probably a website, -then there can be a great deal of information extraction be automated using the -structural information which is characteristic for HTML. For fax/email however -there is almost no structural information and most of the automation techniques -require natural language processing. We chose RSS feeds because RSS feeds lack -inherent structural information but are still very structured. This structure -is because, as said above, the RSS feeds are generated and therefore almost -always look the same. Also, in RSS feeds most venues use particular structural -identifiers that are characters. They separate fields with vertical bars, -commas, whitespace and more non text characters. These field separators and -keywords can be hard for a computer to detect but people are very good in -detecting these. With one look they can identify the characters and keywords -and build a pattern in their head. -Another reason we chose RSS is their temporal consistency, RSS feeds are almost -always generated and because of that the structure of the entries is very -unlikely to change. Basically the RSS feeds only change structure when the CMS -that generates it changes the generation algorithm. This property is usefull -because the crawlers then do not have to be retrained very often. Because the -non inherent structure entirely new strategies had to be applied. +We described a lot of different source formats like HTML, fax/email, RSS and +XML. Because of the limited scope of the project and the time planned for it +we had to remove some of the input formats because they all require different +techniques. For example when the input source is in HTML format, most probably +a website, then there can be a great deal of information extraction be +automated using the structural information which is characteristic for HTML. +For fax/email however there is almost no structural information and most of the +automation techniques require natural language processing. We chose RSS feeds +because RSS feeds lack inherent structural information but are still very +structured. This structure is because, as said above, the RSS feeds are +generated and therefore almost always look the same. Also, in RSS feeds most +venues use particular structural identifiers that are characters. They separate +fields with vertical bars, commas, whitespace and more non text characters. +These field separators and keywords can be hard for a computer to detect but +people are very good in detecting these. With one look they can identify the +characters and keywords and build a pattern in their head. Another reason we +chose RSS is their temporal consistency, RSS feeds are almost always generated +and because of that the structure of the entries is very unlikely to change. +Basically the RSS feeds only change structure when the CMS that generates it +changes the generation algorithm. This property is usefull because the crawlers +then do not have to be retrained very often. Because the non inherent structure +entirely new strategies had to be applied. \section{Directed Acyclic Graphs} -\paragraph{Normal graphs} -A graph($G$) is a mathematical structure to describe relations between nodes -with edges. A standard graph is defined as the ordered pair: $G=(V,E)$. In -this ordered pair $V$ is the set of nodes and $E$ is set of undirected edges -where every undirected edge is a tuple of two nodes. -Figure~\ref{graphexample} is specified as: - $$G=(\{n1, n2, n3\}, \{(n1, n2), (n2, n3), (n2, n3)\})$$ +\paragraph{Directed graphs} +Directed graphs are a mathematical structure that can describe relations +between nodes. A directed graph $G$ is be defined as the tuple $(V,E)$ where +$V$ is a set of named nodes and $E$ is the set of edges defined by $E\subseteq +V\times V$. An edge $e\in E$ can be seen as $(v1, v2)$ where $v1,v2\in V$ and +can be interpreted as in a figure as an arrow between node $v1$ and node $v2$. +Multiple connections between two nodes are possible. For example the graph +visualized in Figure~\ref{graphexample} can be mathematically described as: +$$G=(\{n1, n2, n3, n4\}, \{(n1, n2), (n2, n1), (n2, n3), (n3, n4), (n1, n4)\}$$ \begin{figure}[H] - \caption{Example Graph} \label{graphexample} - \centering \scalebox{0.7}{ \digraph[]{graphexample}{ rankdir=LR; - n1 -> n2 [dir="none"]; - n2 -> n3 [dir="none"]; - n2 -> n3 [dir="none"]; - } - } -\end{figure} - -\paragraph{Directed graphs} -Directed graphs are special kind of graphs. A directed graph $G$ is also -defined by the ordered pair: $G=(V,E)$. $V$ still defines the nodes and $E$ -still the edges but the inherent difference is that the edges are ordered -tuples in stead of not ordered. Adding this property gives the edges a -direction. Every edge has a specific start and end and are therefore called -directed edges. A directed graph would look the same as the graph in -Figure~\ref{graphexample} but then visualized with arrows instead of normal -lines. The arrows specifically go from one node to the other and not the other -way around. However bidirectional connection can occur. For example graph the -graph shown in Figure~\ref{dgexample} is directional with a bidirectional -connection. -$$G=(\{n1, n2\}, \{(n1, n2), (n2, n1)\}$$ - -\begin{figure}[H] - \caption{Example directed graph} - \label{dgexample} - \centering - \scalebox{0.7}{ - \digraph[]{dgexample}{ - rankdir=LR; - n1 -> n2; - n2 -> n1; + n1->n2; + n2->n1; + n2->n3; + n3->n4; + n1->n4; } } + \caption{Example DG} \end{figure} \paragraph{Directed acyclic graphs} @@ -373,13 +349,18 @@ Directed Acyclic Graphs(DAGs) are a special kind of directed graphs. DAGs are also defined as $G=(V,E)$ but with a restriction on $E$. Namely that cycles are not allowed. Figure~\ref{dagexample} shows two graphs. The bottom graph ontains a cycle and the right graph does not. Only the top graph is a valid -DAG. A cycle is defined by a sequence of edges where nodes are visited more -then once. Adding the property of non-cyclicity to graphs lowers the -computational complexity of checking if a node sequence is present in the -graph to $\mathcal{O}(L)$ where $L$ is the length of the sequence. +DAG. A cycle can be defined as follows: + +Say $e\in E$ can be seen as $(v_1,v_2)\in E$ or $v_1\rightarrow v2$ then +$v_1\xrightarrow{+}v_2$ is defined as $v_1\rightarrow \ldots +\rightarrow v_2$ meaning that there is a path with a length bigger then $1$ +between $v_1$ and $v_2$. In a non cyclic graph the following holds $\nexists +v1: v1\xrightarrow{+}v1$. In words this means that a node can not be reached +again traveling the graph. Adding the property of non cyclicity to graphs +lowers the computational complexity of path existence in the graph to +$\mathcal{O}(L)$ where $L$ is the length of the path. \begin{figure}[H] - \caption{Example DAG} \label{dagexample} \centering \scalebox{0.7}{ @@ -393,35 +374,39 @@ graph to $\mathcal{O}(L)$ where $L$ is the length of the sequence. n12 -> n14; } } + \caption{Example DAG} \end{figure} \paragraph{Directed Acyclic Word Graphs} -The type of graph used in the project is a special kind of DAG called -Directed Acyclic Word Graphs(DAWGs). A DAWG can be defined by the ordered -triple $G=(V,E,F)$. -$V$ is the same as in directed graphs. $E$ is also the same besides the fact -that the ordered pair of nodes that describes the edge it now is a triple -consisting of a start node, an end node and a label. In a standard DAWG the -label is always one character. $F$ describes the set of final nodes, final -nodes are nodes that can be the end of a sequence even if another arrow leads -out. In the example graph in Figure~\ref{exampledawg} the final nodes are -visualized with a double circle as node shape. In this example it is purely -cosmetic because $n6$ is a final node anyways because there are no arrows -leading out. But for example in $G=(\{n1, n2, n3\}, \{(n1, n2), (n2, n3)\}, -\{n2, n3\})$ there is a distinct use for the final node marking. Graph $G$ -accepts the words \textit{a,ab} and to simplify the graph node $n2$ and $n3$ -are final. Because of the property of labeled edges data can be stored -in a DAWG. When traversing a DAWG and saving all the edgelabels one can -construct words. Because of properties of graphs a DAWG can store big sets of -words in a small amount of storage because it can re-use edges to specify -transitions. For example the graph in Figure~\ref{exampledawg} can describe a -dictionary that holds the words: \textit{abd, bad, bae}. Testing if a word is -present in the DAWG is, just as for a DAG, falls in the computational -complexity class of $\mathcal{O}(L)$ meaning that it grows linearly with the -length of the word. +The type of graph used in the project is a special kind of DAG called Directed +Acyclic Word Graphs(DAWGs). A DAWG can be defined by the tuple $G=(V,v_0,E,F)$. +$V$ is the same as in directed graphs, namely the set of nodes. $E$ is the set +of edges described by $E\subseteq V\times V\times L$ where $L\in A$ where $A$ +is the alphabet used as node labels. In words this means that an edge is a +labeled arrow between two nodes and for example the edge $(v_1, v_2, a)$ can be +visualized as: $v_1\xrightarrow{a}v_2$. In a standard DAWG the alphabet $A$ +contains all the characters used in natural language but in theory the alphabet +$A$ can contain anything. $F$ describes the set of final nodes, final nodes are +nodes that can be the end of a sequence even if there is an arrow leading out. +In the example graph in Figure~\ref{exampledawg} the final nodes are visualized +with a double circle as node shape. In this example it is purely cosmetic +because $n6$ is a final node anyways because there are no arrows leading out. +But this does not need to be the case, for example in $G=(\{n1, n2, n3\}, +\{(n1, n2), (n2, n3)\}, \{n2, n3\})$ there is a distinct use for the final node +marking. Graph $G$ accepts the words \textit{a,ab} and to simplify the graph +node $n2$ and $n3$ are final. Finally $v_0$ describes the initial node, this is +visualized in figures as an incoming arrow. Because of the property of labeled +edges, data can be stored in a DAWG. When traversing a DAWG and saving all the +edgelabels one can construct words. Using graph minimalization big sets of +words can be stored using a small amouth of storage because edges can be +re-used to specify transitions. For example the graph in +Figure~\ref{exampledawg} can describe the language $L$ where all words $w$ that +are accepted $w\in\{abd, bad, bae\}$. Testing if a word is present in the DAWG +is the same technique as testing if a node path is present in a normal DAG and +therefore also falls in the computational complexity class of $\mathcal{O}(L)$. +This means that it grows linearly with the length of the word. \begin{figure}[H] - \caption{Example DAWG} \label{exampledawg} \centering \scalebox{0.7}{ @@ -429,6 +414,8 @@ length of the word. rankdir=LR; node [shape="circle"]; n1 n2 n3 n4 n5; n6 [shape="doublecircle"]; + n [style=invis]; + n -> n1; n1 -> n2 [label="a"]; n2 -> n3 [label="b"]; n3 -> n6 [label="d"]; @@ -438,4 +425,5 @@ length of the word. n5 -> n6 [label="e"]; } } + \caption{Example DAWG} \end{figure} diff --git a/thesis2/thesis.tex b/thesis2/thesis.tex index 7ab74fe..7e7039c 100644 --- a/thesis2/thesis.tex +++ b/thesis2/thesis.tex @@ -9,6 +9,7 @@ \usepackage[dvipdfmx,hidelinks]{hyperref} \usepackage{graphviz} \usepackage{amssymb} +\usepackage{amsmath} \usepackage{marvosym} \usepackage{lipsum} -- 2.20.1