by definition a programmer, can be instructed to change the structure and this
can also greatly reduce programmer intervention time.
+\section{RSS/Atom}
+RSS/Atom feeds, from now on called RSS feeds, are publishing
+feeds in the XML format\footnote{\url{http://www.w3.org/TR/REC-xml/}} that are
+used to publish events. Every event or entry consists of several standardized
+fields. The main data fields are the \textit{title} and the
+\textit{description} field. In these fields the raw data is stored that
+describes the event. Further there are several auxiliary fields that store the
+link to the full article, store the publishing data, store some media in the
+form of an image or video URL and store a \textit{Globally Unique
+Identifier}(GUID).
+
+The RSS feeds are mostly used by news sites to publish their articles, the feed
+then only contains the headlines and if the user that is reading the feed is
+interested he can click the so called deeplink and he is then sent to the full
+website containing the article. Users often use programs that bundle user
+specified RSS feeds into big summary feeds that can be used to sift through a
+lot of news feeds. The RSS feed reader uses the GUID to skip feeds that are
+already present in its internal database.
+
+Generating RSS feeds by hand is a tedious task but usually RSS feeds are
+generated by the Content Management Systems(CMS) the website runs on. With this
+automatic method the RSS feeds are generated using the content published on the
+website. Because the process is automatic the RSS feeds are generally very
+structured and consistent in its structure. In the entertainment industry
+venues often use a CMS for their website to allow users with no programming or
+website background be able to post news items and event information.
+
\section{Why RSS/Atom}
Information from venues comes in various different format with for each format
several positive and negative points. For this project we chose to focus on
-RSS/Atom feeds because they are in general already structured and consistent in
-their structure. For example websites change a lot in structure and layout and
-thus making it hard to keep crawlers up to date. RSS/Atom feeds generally only
-change structure because the website or content management system gets migrated
-or upgraded.
-
-RSS/Atom feeds in comparison to websites doesn't have a structural dimension in
-the data. Because of this we have to use different techniques of isolation of
-information than present techniques used for extracting information from
-websites. RSS/Atom feeds are basically two fields with plain text, however the
-text almost always has the same structure and keywords and therefore the
-information can be extracted learning from keywords and structure.
+RSS feeds. RSS feeds are, as explained earlier, very structured and consistent
+in their structure. The structure basically only changes if the CMS changes or
+upgrades. Because of this patterns don't have to be retrained a lot.
+
+In comparison to websites RSS feeds don't have a structural dimension in
+the data, all the information in an entry is basically two fields of plain
+text. Therefore an entirely new strategy has to be applied to train the
+patterns.
\section{Scientific relevance and similar research}
Currently the techniques for conversion from non structured data to structured
great need of data mining in non structured data because the data within
companies and on the internet is piling up and are usually left to catch dust.
-The project is a side project of the past project done by Roelofs et
-al.\cite{Roelofs2009}. Roelofs et al. describes a technique to recognize date
-strings using an adapted Levensteins algorithm. This technique can be fitted in
-the current project because it works on a low level and the technique we
-describe works on a high level. The algorithm only works on already isolated
-data.
+The project is a followup project of the past project done by Roelofs et
+al.\cite{Roelofs2008} and \cite{Roelofs2009}. Roelofs et al. described
+techniques on recognizing patterns in website data and published about an
+adapted levenstein distance algorithm to recognize data in isolated text. These
+techniques of recognizing data in text can still be used to interpret the
+isolated extracted parts from this project.
+
+\section{Directed Acyclic Graphs}
+A graph is a mathematical structure described with the ordered pair: $G=(V,E)$.
+In this ordered pair $V$ is the set of nodes and $E$ is set of ordered pairs
+that we will call \textit{undirected edges}.
+
+Directed graphs are special kinds of graphs. A directed graph is described as
+the ordered pair: $G=(V,A)$. Where $A$ is a set of ordered pairs that we will
+call \textit{directed edges}. Directed edges can be visualized as arrows in a
+graph whereas undirected edges are just lines connecting the nodes with no
+particular direction.
+
+Directed Acyclic Graphs(DAGs) are again a special kind of directed graph. DAGs
+don't allow cycles to be created. Every node is only reachable at maximum once
+when starting from an arbitrary point in the graph. DAGs have the nice property
+of checking if a sequence appears in the graph, checking if a sequence is
+present in a graph only has a computational complexity of $\mathcal{O}(L)$
+where $L$ is the length of the sequence.
+
+The type of graph used in the project is another special king of DAG called
+Directed Acyclic Word Graphs(DAWGs). This type of graph can be used to store
+large dictionaries of words. A DAWGs nodes basically represent indices of a
+word and the edges describe the character added when traversing. For example
+the graph in Figure~\ref{fig:2.1.1} can describe a dictionary that holds the
+words: \textit{abd, bad, bae}. Testing if a word is present in the DAWG is,
+just as for a DAG, falls als in the computational complexity class of
+$\mathcal{O}(L)$ meaning that it grows linearly with the length of the word.
+
+\begin{figure}[H]
+ \caption{Example DAWG}
+ \label{fig:2.1.1}
+ \centering
+ \digraph[]{graph21}{
+ rankdir=LR;
+ 1,2,3,4,5 [shape="circle"];
+ 6 [shape="doublecircle"];
+ 1 -> 2 [label="a"];
+ 2 -> 3 [label="b"];
+ 3 -> 6 [label="d"];
+ 1 -> 4 [label="b"];
+ 4 -> 5 [label="a"];
+ 5 -> 6 [label="a"];
+ 5 -> 6 [label="e"];
+ }
+\end{figure}
+++ /dev/null
-\section{Application overview and workflow}
-The program can be divided into two main components namely the \textit{Crawler
-application} and the \textit{Input application}. The components are strictly
-separated by task and by application. The crawler is an application dedicated
-to the sole task of periodically crawling the sources asynchronously. The input
-is a web interface to a set of tools that can create, edit, remove and test
-crawlers via simple point and click user interfaces that can be worked with by
-someone without a computer science background.
-
-
-\section{Input application}
-\subsection{Components}
-Add new crawler
-
-Editing or remove crawlers
-
-Test crawler
-
-Generate xml
-
-
-\section{Crawler application}
-\subsection{Interface}
-
-\subsection{Preprocessing}
-When the data is received by the crawler the data is embedded as POST data in a
-HTTP request. The POST data consists of several fields with information about
-the feed and a container that has the table with the user markers embedded.
-After that the entries are extracted and processed line by line.
-
-The line processing converts the raw string of html data from a table row to a
-string. The string is stripped of all the html tags and is accompanied by a
-list of marker items. The entries that don't contain any markers are left out
-in the next step of processing. All data, including entries without user
-markers, is stored in the object too for possible later reference, for example
-for editing the patterns.
-
-The last step is when the entries with markers are then processed to build
-node-lists. Node-lists are basically lists of words that, when concatenated,
-form the original entry. A word isn't a word in the linguistic sense. A word
-can be one letter or a category. The node-list is generated by putting all the
-separate characters one by one in the list and when a user marking is
-encountered, this marking is translated to the category code and that code is
-then added as a word. The nodelists are then sent to the actual algorithm to be
-converted to a graph representation.
-
-\subsection{Directed acyclic graphs(DAG)}
-Directed acyclic graphs are a special kind of graph that is used to store big
-sets of words and has a complexity of $\mathcal{O}(L)$, meaning the length of
-the word you are searching. Figure~\ref{fig:f21} shows a DAG that accepts the
-words \textit{aa} and \textit{ab}. Final states are marked with a double circle
-and intermediate states are marked with a single circle.
-
-\begin{figure}[H]
- \caption{Sample DAG}
- \label{fig:f21}
- \centering
- \digraph[]{graph21}{
- rankdir=LR;
- 1, 2 [shape="circle"];
- 3, 4 [shape="doublecircle"];
- 1 -> 2 [label="a"];
- 2 -> 3 [label="a"];
- 2 -> 4 [label="b"];
- }
-\end{figure}
-
-The first algorithm to generate DAG's was proposed by Hopcroft et
-al\cite{Hopcroft1971}. The algorithm they described wasn't incremental and had
-a complexity of $\mathcal{O}(N\log{N})$. \cite{Daciuk2000} et al. later
-extended the algorithm and created an incremental one without increasing the
-computational complexity. The non incremental algorithm from Daciuk et al. is
-used to convert the nodelists to a graph.
-
-For example constructing a graph that from the entry: \textit{a,bc} and
-\textit{a.bc} goes in the following steps:
-
-\begin{figure}[H]
- \caption{Sample DAG, first entry}
- \label{fig:f22}
- \centering
- \digraph[]{graph22}{
- rankdir=LR;
- 1,2,3,5 [shape="circle"];
- 5 [shape="doublecircle"];
- 1 -> 2 [label="a"];
- 2 -> 3 [label="."];
- 3 -> 4 [label="b"];
- 4 -> 5 [label="c"];
- }
-\end{figure}
-
-\begin{figure}[H]
- \caption{Sample DAG, second entry}
- \label{fig:f23}
- \centering
- \digraph[]{graph23}{
- rankdir=LR;
- 1,2,3,5,6 [shape="circle"];
- 5 [shape="doublecircle"];
- 1 -> 2 [label="a"];
- 2 -> 3 [label="."];
- 3 -> 4 [label="b"];
- 4 -> 5 [label="c"];
-
- 2 -> 6 [label=","];
- 6 -> 4 [label="b"];
- }
-\end{figure}
-
-\subsection{Defining categories}
-pass
-
-\subsection{Process}
-Proposal was written
-
-
-First html/mail/fax/rss, worst case rss
-
-
-After some research and determining the scope of the project we decided only to
-do RSS, this because RSS tends to force structure in the data because RSS feeds
-are often generated by the website and thus reliable and consistent. We found a
-couple of good RSS feeds.
-
-
-At first the general framework was designed and implemented, no method yet.
-
-
-Started with method for recognizing separators.
-
-
-Found research paper about algorithm that can create directed acyclic graphs
-from string, although it was designed to compress word lists it can be
-(mis)used to extract information.
-
-
-Implementation of DAG algorithm found and tied to the program.
-
-
-Command line program ready. Conversation with both supervisors, gui had to be
-made.
-
-Step by step gui created. Web interface as a control center for the crawlers.
-
-
-Gui optimized.
-
-
-Concluded that the program doesn't reach wide audience due to lack of well
-structured rss feeds.
+++ /dev/null
- \section{Algorithm}
- \section{Progress}
SHELL:=/bin/bash
+VERSION:=0.2
all: thesis
thesis:
- latex thesis.tex
- bibtex thesis.aux
- latex -shell-escape thesis.tex
- latex thesis.tex
- dvipdfm thesis.dvi
+ latex thesis.tex > log.txt
+ bibtex thesis.aux >> log.txt
+ latex -shell-escape thesis.tex >> log.txt
+ latex thesis.tex >> log.txt
+ dvipdfm thesis.dvi >> log.txt 2>&1
+ mv -v {thesis,mart_thesis_$(VERSION)}.pdf
clean:
- rm -vf *.{aux,bbl,blg,dvi,log,out,pdf,toc,dot,ps}
+ rm -vf *.{aux,bbl,blg,dvi,log,out,pdf,toc,dot,ps} log.txt
+++ /dev/null
-\begin{thebibliography}{1}
-
-\bibitem{Daciuk2000}
-Jan Daciuk, Stoyan Mihov, Bruce~W. Watson, and Richard~E. Watson.
-\newblock {Incremental Construction of Minimal Acyclic Finite-State Automata}.
-\newblock {\em Computational Linguistics}, 26(1):3--16, March 2000.
-
-\bibitem{Hopcroft1971}
-John Hopcroft.
-\newblock {An N log N algorithm for minimizing states in a finite automaton}.
-\newblock Technical report, 1971.
-
-\bibitem{Roelofs2009}
-Wouter Roelofs, Alessandro~Tadeo Paula, and Franc Grootjen.
-\newblock {Programming by Clicking}.
-\newblock In {\em Proceedings of the Dutch Information Retrieval Conference},
- pages 2--3, 2009.
-
-\end{thebibliography}
@inproceedings{Aitken2002,
author = {Aitken, James Stuart},
booktitle = {Proceedings of the 15th Eureopean Conference on Artificial Intelligence},
-file = {:home/mart/articles/2002/Aitken\_Learning Information Extraction Rules An Inductive Logic Programming approach.pdf:pdf},
+file = {:home/mart/articles/2002/2002\_Aitken\_Learning Information Extraction Rules An Inductive Logic Programming approach.pdf:pdf},
title = {{Learning Information Extraction Rules : An Inductive Logic Programming approach}},
year = {2002}
}
@article{Berry1986,
author = {Berry, Gerard and Sethi, Ravi},
doi = {10.1016/0304-3975(86)90088-5},
-file = {:home/mart/articles/1986/Berry, Sethi\_From regular expressions to deterministic automata.pdf:pdf},
+file = {:home/mart/articles/1986/1986\_Berry, Sethi\_From regular expressions to deterministic automata.pdf:pdf},
issn = {03043975},
journal = {Theoretical Computer Science},
month = jan,
arxivId = {arXiv:1010.5318v3},
author = {Berstel, Jean and Boasson, Luc and Carton, Olivier and Fagnot, Isabelle},
eprint = {arXiv:1010.5318v3},
-file = {:home/mart/articles/2010/Berstel et al.\_Minimization of automata.pdf:pdf},
+file = {:home/mart/articles/2010/2010\_Berstel et al.\_Minimization of automata.pdf:pdf},
journal = {Mathematics Subject Classification},
keywords = {finite automata,hopcroft,minimization,s algorithm},
title = {{Minimization of automata}},
}
@article{Bruggemann-klein1993,
author = {Br\"{u}ggemann-klein, Anne},
-file = {:home/mart/articles/1993/Br\"{u}ggemann-klein\_Regular expressions into finite automata.pdf:pdf},
+file = {:home/mart/articles/1993/1993\_Br\"{u}ggemann-klein\_Regular expressions into finite automata.pdf:pdf},
journal = {Theoretical Computer Science},
pages = {197--213},
title = {{Regular expressions into finite automata}},
@article{Daciuk2000,
author = {Daciuk, Jan and Mihov, Stoyan and Watson, Bruce W. and Watson, Richard E.},
doi = {10.1162/089120100561601},
-file = {:home/mart/articles/2000/Daciuk et al.\_Incremental Construction of Minimal Acyclic Finite-State Automata.pdf:pdf},
+file = {:home/mart/articles/2000/2000\_Daciuk et al.\_Incremental Construction of Minimal Acyclic Finite-State Automata.pdf:pdf},
issn = {0891-2017},
journal = {Computational Linguistics},
month = mar,
volume = {26},
year = {2000}
}
+@article{Grootjen2014,
+author = {Grootjen, Franc and Otworowska, Maria},
+file = {:home/mart/articles/2014/2014\_Grootjen, Otworowska\_Proceedings of the 26th Benelux Conference on Artificial Intelligence.pdf:pdf},
+title = {{Proceedings of the 26th Benelux Conference on Artificial Intelligence}},
+year = {2014}
+}
@techreport{Hopcroft1971,
author = {Hopcroft, John},
-file = {:home/mart/articles/1971/Hopcroft\_An N log N algorithm for minimizing states in a finite automaton.pdf:pdf},
+file = {:home/mart/articles/1971/1971\_Hopcroft\_An N log N algorithm for minimizing states in a finite automaton.pdf:pdf},
title = {{An N log N algorithm for minimizing states in a finite automaton}},
year = {1971}
}
@article{Hromkovic2001,
author = {Hromkovic, Juraj and Seibert, Sebastian and Wilke, Thomas},
-file = {:home/mart/articles/2001/Hromkovic, Seibert, Wilke\_Translating Regular Expressions into Small = - Free Nondeterministic Finite Automata.pdf:pdf},
+file = {:home/mart/articles/2001/2001\_Hromkovic, Seibert, Wilke\_Translating Regular Expressions into Small = - Free Nondeterministic Finite Automata.pdf:pdf},
journal = {Journal of Computer and System Sciences},
keywords = {finite automata,regular expressions},
pages = {565--588},
@article{Knuutila2001,
author = {Knuutila, Timo},
doi = {10.1016/S0304-3975(99)00150-4},
-file = {:home/mart/articles/2001/Knuutila\_Re-describing an algorithm by Hopcroft.pdf:pdf},
+file = {:home/mart/articles/2001/2001\_Knuutila\_Re-describing an algorithm by Hopcroft.pdf:pdf},
issn = {03043975},
journal = {Theoretical Computer Science},
keywords = {algorithms,finite automata,minimization},
volume = {250},
year = {2001}
}
+@article{Roelofs2008,
+author = {Roelofs, Wouter},
+file = {:home/mart/articles/2008/2008\_Roelofs\_HyperBot Building Web Robots for Non-Programmers.pdf:pdf},
+title = {{HyperBot Building Web Robots for Non-Programmers}},
+year = {2008}
+}
@inproceedings{Roelofs2009,
author = {Roelofs, Wouter and Paula, Alessandro Tadeo and Grootjen, Franc},
booktitle = {Proceedings of the Dutch Information Retrieval Conference},
-file = {:home/mart/articles/2009/Roelofs, Paula, Grootjen\_Programming by Clicking.pdf:pdf},
+file = {:home/mart/articles/2009/2009\_Roelofs, Paula, Grootjen\_Programming by Clicking.pdf:pdf},
keywords = {levenshtein matching,subtree matching,web crawler},
pages = {2--3},
title = {{Programming by Clicking}},
}
@article{Science1985,
author = {Science, Computer},
-file = {:home/mart/articles/1985/Science\_A. e h r e n f e u c h t •.pdf:pdf},
+file = {:home/mart/articles/1985/1985\_Science\_A. e h r e n f e u c h t •.pdf:pdf},
pages = {31--55},
title = {{A. e h r e n f e u c h t •}},
volume = {40},
}
@article{Watson,
author = {Watson, Bruce W and Science, Computing and Eindhoven, Technische Universiteit},
-file = {:home/mart/articles/Unknown/Watson, Science, Eindhoven\_Constructing minimal acyclic deterministic finite automata.pdf:pdf},
+file = {:home/mart/articles/Unknown/Unknown\_Watson, Science, Eindhoven\_Constructing minimal acyclic deterministic finite automata.pdf:pdf},
title = {{Constructing minimal acyclic deterministic finite automata}}
}
+++ /dev/null
-This is BibTeX, Version 0.99d (TeX Live 2015/dev/Debian)
-Capacity: max_strings=35307, hash_size=35307, hash_prime=30011
-The top-level auxiliary file: thesis.aux
-The style file: plain.bst
-Database file #1: thesis.bib
-Warning--empty institution in Hopcroft1971
-You've used 3 entries,
- 2118 wiz_defined-function locations,
- 516 strings with 4464 characters,
-and the built_in function-call counts, 993 in all, are:
-= -- 95
-> -- 48
-< -- 1
-+ -- 19
-- -- 16
-* -- 70
-:= -- 173
-add.period$ -- 9
-call.type$ -- 3
-change.case$ -- 18
-chr.to.int$ -- 0
-cite$ -- 4
-duplicate$ -- 38
-empty$ -- 75
-format.name$ -- 16
-if$ -- 202
-int.to.chr$ -- 0
-int.to.str$ -- 3
-missing$ -- 2
-newline$ -- 18
-num.names$ -- 6
-pop$ -- 18
-preamble$ -- 1
-purify$ -- 14
-quote$ -- 0
-skip$ -- 26
-stack$ -- 0
-substring$ -- 47
-swap$ -- 8
-text.length$ -- 1
-text.prefix$ -- 0
-top$ -- 0
-type$ -- 12
-warning$ -- 1
-while$ -- 11
-width$ -- 4
-write$ -- 34
-(There was 1 warning)
-%\documentclass{scrbook}
\documentclass{book}
-%\usepackage{scrhack}
+\usepackage[british]{babel}
+
\usepackage{graphicx} % Images
\usepackage{float} % Better placement float figures
\usepackage{listings} % Source code formatting
showspaces=false
}
+\newcommand{\cvartitle}{Non IT configurable adaptive data mining solution used
+in transforming raw data to structured data}
% Setup hyperlink formatting
\hypersetup{
- pdftitle={Non IT configurable adaptive data mining solution used in transforming raw data to structured data},
+ pdftitle={\cvartitle},
pdfauthor={Mart Lubbers},
pdfsubject={Artificial Intelligence},
}
% Describe the frontpage
-\author{Mart Lubbers\\s4109053}
-\title{Non IT configurable adaptive data mining solution used in transforming raw
-data to structured data}
-%\subtitle{
-% Bachelor's Thesis in Artificial Intelligence\\
-% Radboud University Nijmegen\\
-% \vspace{15mm}
-% \begin{tabular}{cp{5em}c}
-% Franc Grootjen && Alessandro Paula\\
-% RU && Hyperleap
-% \end{tabular}
-% }
+\author{
+ Mart Lubbers\\
+ s4109053\\
+ \strut\\
+ External supervisor: Alessandro Paula\\
+ Internal supervisor: Franc Grootjen\\
+}
+\title{\cvartitle}
\date{\today}
\begin{document}
\chapter{Introduction}
\input{1.introduction.tex}
-\chapter{Methods}
-\input{2.methods.tex}
+\chapter{Requirements and design}
+\input{2.requirementsanddesign}
+
+\chapter{Algorithm}
+\input{3.methods.tex}
\chapter{Conclusion}
-\input{3.conclusion.tex}
+\input{4.conclusion.tex}
\section{Discussion}
-\input{4.discussion.tex}
+\input{5.discussion.tex}
\chapter{Appendices}
-\input{5.appendices.tex}
+\input{6.appendices.tex}
\bibliographystyle{plain}
\bibliography{thesis}