From 743357c5d88592d9546dbeef236784845b8e5309 Mon Sep 17 00:00:00 2001 From: Mart Lubbers Date: Mon, 18 May 2015 12:43:19 +0200 Subject: [PATCH] all fixes applied --- thesis2/1.introduction.tex | 93 +++++++++++++++++++------------------- 1 file changed, 47 insertions(+), 46 deletions(-) diff --git a/thesis2/1.introduction.tex b/thesis2/1.introduction.tex index a23705d..cc4160f 100644 --- a/thesis2/1.introduction.tex +++ b/thesis2/1.introduction.tex @@ -173,7 +173,7 @@ has to be post processed. The post-processing encompasses several different steps. The first step is to check the validity of the event entries from a certain -source. Validity checking is useful to detect faulty automated crawlers +source. Validity checking is useful to detect faulty automated crawlers before the data can leak into the database. Validity checking happens at random on certain event entries. @@ -192,7 +192,7 @@ the probability of malformed data leaking into the database as low as possible. \subsection{Database \& Publication} Postprocessed data that leaves the \textit{Temporum} will enter the final database. This database contains all the information about all the events that -happened in the past and the events that will happen in the future. The +happened in the past and the events that will happen in the future. The database also contains the parent information such as information about venues. Several categorical websites use the database to offer the information to users and accompany it with the second part of \textit{infotainment} namely the @@ -243,7 +243,8 @@ interface to a crawler generation system that is able to crawl RSS\cite{Rss} and Atom\cite{Atom} publishing feeds. The interface provides the user with point and click interfaces meaning that no computer science background is needed to use the interface and to create, modify, test and remove crawlers. -The current Hyperleap backend system that handles the data can query XML feeds that contain the crawled data. +The current Hyperleap backend system that handles the data can query XML feeds +that contain the crawled data. The actual research question can then be formulated as: \begin{center} @@ -253,22 +254,23 @@ for RSS feeds} \end{center} \section{RSS/Atom} -RSS/Atom feeds, from now on called RSS feeds, are publishing feeds in the XML -format\cite{Xml} that are used to publish events. Every event or entry consists -of several standardized fields. The main data fields are the \textit{title} and -the \textit{description} field. In these fields the raw data is stored that -describes the event. Further there are several auxiliary fields that store the -link to the full article, store the publishing data, store some media in the -form of an image or video URL and store a \textit{Globally Unique -Identifier}(GUID)\footnote{A GUID is a unique identifier that in most cases is -the permalink of the article. A permalink is a link that will point to the -article}. An example of a RSS feed can be found in Listing~\ref{examplerss}, -this listing shows a partly truncated RSS feed of a well known venue in the -Netherlands. As visible in the listing the similarities with XML are very -clear. Every RSS feed contains a \texttt{channel} tag and within that tag there -is some metadata and a list op \texttt{item} tags. Every \texttt{item} tag has -a fixed number of different fields. The most important fields for RSS within -the leisure industry are the \texttt{title} and the \texttt{description} field. +RSS/Atom feeds, from now on called RSS feeds, are publishing feeds. Such feeds +publish their data in a restricted XML format\cite{Xml} consisting of entries. +Every entry usually represents an event and consists of standardized data +fields. The data fields we are interested in are the \textit{title} and the +\textit{description} fields. Those fields store the raw data which describes +the event. Besides the fields we are interested in there are there are several +auxiliary fields that for example store the link to the full article, store the +publishing data, store some media in the form of an image or video URL or store +a \textit{Globally Unique Identifier}(GUID)\footnote{A GUID is a unique +identifier that in most cases is the permalink of the article. A permalink is a +link that will point to the article}. An example of a RSS feed can be found in +Listing~\ref{examplerss}, this listing shows a, partly truncated, RSS feed of a +well known venue in the Netherlands. Every RSS feed contains a \texttt{channel} +field and within that field there is some metadata and a list op \texttt{item} +fields. Every \texttt{item} field has a fixed number of different fields. The +most important fields for RSS within the leisure industry are the +\texttt{title} and the \texttt{description} field. \lstinputlisting[language=xml,label={examplerss},caption={An example of a,% partly truncated RSS feed of a well known dutch venue}]{exrss.xml} @@ -278,8 +280,8 @@ only contains the headlines of the entries. If the user, who reads the feed, is interested it can click the so called deeplink and will be sent to the full website containing the full entry or article. Users often use programs that bundle user specified RSS feeds into big combined feeds that can be used to -sift through a lot of news feeds. The RSS feed reader uses the GUID to skip -feeds that are already present in its internal database. +sift through a lot of news feeds. The RSS feed reader uses the unique GUID to +skip feeds that are already present in its internal database. Generating RSS feeds by hand is a tedious task but almost all RSS feeds are generated by a Content Management Systems(CMS) on which the website runs. With @@ -292,9 +294,9 @@ should often have RSS feeds. \section{Why RSS?} \label{sec:whyrss} -We described a lot of different source formats like HTML, fax/email, RSS and -XML. Because of the limited scope of the project and the time planned for it -we had to remove some of the input formats because they all require different +There are lots of different source formats like HTML, fax/email, RSS and XML. +Because of the limited scope of the project and the time planned for it we had +to remove some of the input formats because they all require different techniques and approaches to tackle. For example when the input source is in HTML format, most probably a website, then there can be a great deal of information extraction be automated using the structural information which is a @@ -305,33 +307,33 @@ information but are still very structured. This structure is because, as said above, the RSS feeds are generated and therefore almost always look the same. Also, in RSS feeds most venues use particular structural identifiers that are characters. They separate fields with vertical bars, commas, whitespace and -more non text characters. These field separators and keywords can be hard for -a computer to detect but people are very good in detecting these. With one look +more non text characters. These field separators and keywords can be hard for a +computer to detect but people are very good in detecting these. With one look they can identify the characters and keywords and build a pattern in their head. Another reason we chose RSS is their temporal consistency, RSS feeds are almost always generated and because of that the structure of the entries is -very unlikely to change. Basically the RSS feeds only change structure when -the CMS that generates it changes the generation algorithm. This property is -useful because the crawlers then do not have to be retrained very often. -To detect the structures we propose a technique using subword matching using -graphs. +very unlikely to change. Basically the RSS feeds only change structure when the +CMS that generates it changes the generation algorithm. This property is useful +because the crawlers then do not have to be retrained very often. To detect +the underlying structures a technique is used that exploits subword matching +with graphs. \section{Directed Acyclic Graphs} \paragraph{Directed graphs} -Directed graphs(DG) are a mathematical structure that can describe relations +Directed graphs(DG) are mathematical structures that can describe relations between nodes. A directed graph $G$ is be defined as the tuple $(V,E)$ where $V$ is a set of named nodes and $E$ is the set of edges defined by $E\subseteq -V\times V$. An edge $e\in E$ can be seen as $(v1, v2)$ where $v1,v2\in V$ and -can be interpreted as in a figure as an arrow between node $v1$ and node $v2$. -Multiple connections between two nodes are possible. For example the graph -visualized in Figure~\ref{graphexample} can be mathematically described as: +V\times V$. An edge $e\in E$ is defined as $(v1, v2)$ where $v1,v2\in V$ and is +show in the figure as an arrow between node $v1$ and node $v2$. Multiple +connections between two nodes are possible if the directions differ. For +example the graph visualized in Figure~\ref{graphexample} can be mathematically +described as: $$G=(\{n1, n2, n3, n4\}, \{(n1, n2), (n2, n1), (n2, n3), (n3, n4), (n1, n4)\}$$ \begin{figure}[H] \label{graphexample} \centering \includegraphics[scale=0.7]{graphexample.pdf} - \strut\\\strut\\ \caption{Example DG} \end{figure} @@ -342,20 +344,19 @@ are not allowed. Figure~\ref{dagexample} shows two graphs. The bottom graph contains a cycle and the right graph does not. Only the top graph is a valid DAG. A cycle can be defined as follows: -Say $e\in E$ can be seen as $(v_1,v_2)\in E$ or $v_1\rightarrow v2$ then -$v_1\xrightarrow{+}v_2$ is defined as $v_1\rightarrow \ldots -\rightarrow v_2$ meaning that there is a path with a length bigger then $1$ -between $v_1$ and $v_2$. In a non cyclic graph the following holds $\nexists -v1: v1\xrightarrow{+}v1$. In words this means that a node can not be reached -again traveling the graph. Adding the property of non cyclicity to graphs -lowers the computational complexity of path existence in the graph to -$\mathcal{O}(L)$ where $L$ is the length of the path. +If $e\in E$ is defined as $(v_1,v_n)\in E$ or $v_1\rightarrow v_n$ then +$v_1\xrightarrow{+}v_n$ which means $v_1\rightarrow v_2 \rightarrow \ldots +\rightarrow v_{n-1} \rightarrow v_n$ meaning that there is a connection with a +length larger then $1$ between $v_1$ and $v_2$. In a non cyclic graph the +following holds $\nexists v\in V: v\xrightarrow{+}v$. In words this means that +a node can not be reached again traveling the graph. Adding the property of non +cyclicity to graphs lowers the computational complexity of path existence in +the graph to $\mathcal{O}(L)$ where $L$ is the length of the path. \begin{figure}[H] \label{dagexample} \centering \includegraphics[scale=0.7]{dagexample.pdf} - \strut\\\strut\\ \caption{Example DAG} \end{figure} -- 2.20.1