information flow.
\begin{figure}[H]
- \caption{Information flow Hyperleap database}
\label{informationflow}
\centering
\scalebox{0.7}{
o1 -> o4;
}
}
+ \caption{Information flow Hyperleap database}
\end{figure}
\paragraph{Sources}
Figure~\ref{feedbackloop} can take days and can be the reason for gaps and
faulty information in the database.
\begin{figure}[H]
- \caption{Feedback loop for malfunctioning crawlers}
\label{feedbackloop}
\centering
\scalebox{0.5}{
programmer -> crawler [constraint=false,style="dotted"];
}
}
+ \caption{Feedback loop for malfunctioning crawlers}
\end{figure}
The goal of the project is to relieve the programmer of repairing crawlers all
\section{Why RSS?}
\label{sec:whyrss}
-The first proposal described formats like HTML, fax/email, RSS, XML and some
-more. Because of the scope of the project and the time planned for it we had to
-remove some of the input formats because they all require different techniques.
-For example when the input source is in HTML format, most probably a website,
-then there can be a great deal of information extraction be automated using the
-structural information which is characteristic for HTML. For fax/email however
-there is almost no structural information and most of the automation techniques
-require natural language processing. We chose RSS feeds because RSS feeds lack
-inherent structural information but are still very structured. This structure
-is because, as said above, the RSS feeds are generated and therefore almost
-always look the same. Also, in RSS feeds most venues use particular structural
-identifiers that are characters. They separate fields with vertical bars,
-commas, whitespace and more non text characters. These field separators and
-keywords can be hard for a computer to detect but people are very good in
-detecting these. With one look they can identify the characters and keywords
-and build a pattern in their head.
-Another reason we chose RSS is their temporal consistency, RSS feeds are almost
-always generated and because of that the structure of the entries is very
-unlikely to change. Basically the RSS feeds only change structure when the CMS
-that generates it changes the generation algorithm. This property is usefull
-because the crawlers then do not have to be retrained very often. Because the
-non inherent structure entirely new strategies had to be applied.
+We described a lot of different source formats like HTML, fax/email, RSS and
+XML. Because of the limited scope of the project and the time planned for it
+we had to remove some of the input formats because they all require different
+techniques. For example when the input source is in HTML format, most probably
+a website, then there can be a great deal of information extraction be
+automated using the structural information which is characteristic for HTML.
+For fax/email however there is almost no structural information and most of the
+automation techniques require natural language processing. We chose RSS feeds
+because RSS feeds lack inherent structural information but are still very
+structured. This structure is because, as said above, the RSS feeds are
+generated and therefore almost always look the same. Also, in RSS feeds most
+venues use particular structural identifiers that are characters. They separate
+fields with vertical bars, commas, whitespace and more non text characters.
+These field separators and keywords can be hard for a computer to detect but
+people are very good in detecting these. With one look they can identify the
+characters and keywords and build a pattern in their head. Another reason we
+chose RSS is their temporal consistency, RSS feeds are almost always generated
+and because of that the structure of the entries is very unlikely to change.
+Basically the RSS feeds only change structure when the CMS that generates it
+changes the generation algorithm. This property is usefull because the crawlers
+then do not have to be retrained very often. Because the non inherent structure
+entirely new strategies had to be applied.
\section{Directed Acyclic Graphs}
-\paragraph{Normal graphs}
-A graph($G$) is a mathematical structure to describe relations between nodes
-with edges. A standard graph is defined as the ordered pair: $G=(V,E)$. In
-this ordered pair $V$ is the set of nodes and $E$ is set of undirected edges
-where every undirected edge is a tuple of two nodes.
-Figure~\ref{graphexample} is specified as:
- $$G=(\{n1, n2, n3\}, \{(n1, n2), (n2, n3), (n2, n3)\})$$
+\paragraph{Directed graphs}
+Directed graphs are a mathematical structure that can describe relations
+between nodes. A directed graph $G$ is be defined as the tuple $(V,E)$ where
+$V$ is a set of named nodes and $E$ is the set of edges defined by $E\subseteq
+V\times V$. An edge $e\in E$ can be seen as $(v1, v2)$ where $v1,v2\in V$ and
+can be interpreted as in a figure as an arrow between node $v1$ and node $v2$.
+Multiple connections between two nodes are possible. For example the graph
+visualized in Figure~\ref{graphexample} can be mathematically described as:
+$$G=(\{n1, n2, n3, n4\}, \{(n1, n2), (n2, n1), (n2, n3), (n3, n4), (n1, n4)\}$$
\begin{figure}[H]
- \caption{Example Graph}
\label{graphexample}
- \centering
\scalebox{0.7}{
\digraph[]{graphexample}{
rankdir=LR;
- n1 -> n2 [dir="none"];
- n2 -> n3 [dir="none"];
- n2 -> n3 [dir="none"];
- }
- }
-\end{figure}
-
-\paragraph{Directed graphs}
-Directed graphs are special kind of graphs. A directed graph $G$ is also
-defined by the ordered pair: $G=(V,E)$. $V$ still defines the nodes and $E$
-still the edges but the inherent difference is that the edges are ordered
-tuples in stead of not ordered. Adding this property gives the edges a
-direction. Every edge has a specific start and end and are therefore called
-directed edges. A directed graph would look the same as the graph in
-Figure~\ref{graphexample} but then visualized with arrows instead of normal
-lines. The arrows specifically go from one node to the other and not the other
-way around. However bidirectional connection can occur. For example graph the
-graph shown in Figure~\ref{dgexample} is directional with a bidirectional
-connection.
-$$G=(\{n1, n2\}, \{(n1, n2), (n2, n1)\}$$
-
-\begin{figure}[H]
- \caption{Example directed graph}
- \label{dgexample}
- \centering
- \scalebox{0.7}{
- \digraph[]{dgexample}{
- rankdir=LR;
- n1 -> n2;
- n2 -> n1;
+ n1->n2;
+ n2->n1;
+ n2->n3;
+ n3->n4;
+ n1->n4;
}
}
+ \caption{Example DG}
\end{figure}
\paragraph{Directed acyclic graphs}
are also defined as $G=(V,E)$ but with a restriction on $E$. Namely that cycles
are not allowed. Figure~\ref{dagexample} shows two graphs. The bottom graph
ontains a cycle and the right graph does not. Only the top graph is a valid
-DAG. A cycle is defined by a sequence of edges where nodes are visited more
-then once. Adding the property of non-cyclicity to graphs lowers the
-computational complexity of checking if a node sequence is present in the
-graph to $\mathcal{O}(L)$ where $L$ is the length of the sequence.
+DAG. A cycle can be defined as follows:
+
+Say $e\in E$ can be seen as $(v_1,v_2)\in E$ or $v_1\rightarrow v2$ then
+$v_1\xrightarrow{+}v_2$ is defined as $v_1\rightarrow \ldots
+\rightarrow v_2$ meaning that there is a path with a length bigger then $1$
+between $v_1$ and $v_2$. In a non cyclic graph the following holds $\nexists
+v1: v1\xrightarrow{+}v1$. In words this means that a node can not be reached
+again traveling the graph. Adding the property of non cyclicity to graphs
+lowers the computational complexity of path existence in the graph to
+$\mathcal{O}(L)$ where $L$ is the length of the path.
\begin{figure}[H]
- \caption{Example DAG}
\label{dagexample}
\centering
\scalebox{0.7}{
n12 -> n14;
}
}
+ \caption{Example DAG}
\end{figure}
\paragraph{Directed Acyclic Word Graphs}
-The type of graph used in the project is a special kind of DAG called
-Directed Acyclic Word Graphs(DAWGs). A DAWG can be defined by the ordered
-triple $G=(V,E,F)$.
-$V$ is the same as in directed graphs. $E$ is also the same besides the fact
-that the ordered pair of nodes that describes the edge it now is a triple
-consisting of a start node, an end node and a label. In a standard DAWG the
-label is always one character. $F$ describes the set of final nodes, final
-nodes are nodes that can be the end of a sequence even if another arrow leads
-out. In the example graph in Figure~\ref{exampledawg} the final nodes are
-visualized with a double circle as node shape. In this example it is purely
-cosmetic because $n6$ is a final node anyways because there are no arrows
-leading out. But for example in $G=(\{n1, n2, n3\}, \{(n1, n2), (n2, n3)\},
-\{n2, n3\})$ there is a distinct use for the final node marking. Graph $G$
-accepts the words \textit{a,ab} and to simplify the graph node $n2$ and $n3$
-are final. Because of the property of labeled edges data can be stored
-in a DAWG. When traversing a DAWG and saving all the edgelabels one can
-construct words. Because of properties of graphs a DAWG can store big sets of
-words in a small amount of storage because it can re-use edges to specify
-transitions. For example the graph in Figure~\ref{exampledawg} can describe a
-dictionary that holds the words: \textit{abd, bad, bae}. Testing if a word is
-present in the DAWG is, just as for a DAG, falls in the computational
-complexity class of $\mathcal{O}(L)$ meaning that it grows linearly with the
-length of the word.
+The type of graph used in the project is a special kind of DAG called Directed
+Acyclic Word Graphs(DAWGs). A DAWG can be defined by the tuple $G=(V,v_0,E,F)$.
+$V$ is the same as in directed graphs, namely the set of nodes. $E$ is the set
+of edges described by $E\subseteq V\times V\times L$ where $L\in A$ where $A$
+is the alphabet used as node labels. In words this means that an edge is a
+labeled arrow between two nodes and for example the edge $(v_1, v_2, a)$ can be
+visualized as: $v_1\xrightarrow{a}v_2$. In a standard DAWG the alphabet $A$
+contains all the characters used in natural language but in theory the alphabet
+$A$ can contain anything. $F$ describes the set of final nodes, final nodes are
+nodes that can be the end of a sequence even if there is an arrow leading out.
+In the example graph in Figure~\ref{exampledawg} the final nodes are visualized
+with a double circle as node shape. In this example it is purely cosmetic
+because $n6$ is a final node anyways because there are no arrows leading out.
+But this does not need to be the case, for example in $G=(\{n1, n2, n3\},
+\{(n1, n2), (n2, n3)\}, \{n2, n3\})$ there is a distinct use for the final node
+marking. Graph $G$ accepts the words \textit{a,ab} and to simplify the graph
+node $n2$ and $n3$ are final. Finally $v_0$ describes the initial node, this is
+visualized in figures as an incoming arrow. Because of the property of labeled
+edges, data can be stored in a DAWG. When traversing a DAWG and saving all the
+edgelabels one can construct words. Using graph minimalization big sets of
+words can be stored using a small amouth of storage because edges can be
+re-used to specify transitions. For example the graph in
+Figure~\ref{exampledawg} can describe the language $L$ where all words $w$ that
+are accepted $w\in\{abd, bad, bae\}$. Testing if a word is present in the DAWG
+is the same technique as testing if a node path is present in a normal DAG and
+therefore also falls in the computational complexity class of $\mathcal{O}(L)$.
+This means that it grows linearly with the length of the word.
\begin{figure}[H]
- \caption{Example DAWG}
\label{exampledawg}
\centering
\scalebox{0.7}{
rankdir=LR;
node [shape="circle"]; n1 n2 n3 n4 n5;
n6 [shape="doublecircle"];
+ n [style=invis];
+ n -> n1;
n1 -> n2 [label="a"];
n2 -> n3 [label="b"];
n3 -> n6 [label="d"];
n5 -> n6 [label="e"];
}
}
+ \caption{Example DAWG}
\end{figure}