From 0f92dac691948ce99c2e9fe76ea82e2cfa933f47 Mon Sep 17 00:00:00 2001 From: Mart Lubbers Date: Sun, 9 Nov 2014 20:13:00 +0100 Subject: [PATCH] update --- thesis2/1.introduction.tex | 104 ++++++++++++++++++++----- thesis2/2.methods.tex | 151 ------------------------------------- thesis2/3.conclusion.tex | 0 thesis2/4.discussion.tex | 0 thesis2/5.appendices.tex | 2 - thesis2/Makefile | 14 ++-- thesis2/thesis.bbl | 19 ----- thesis2/thesis.bib | 34 ++++++--- thesis2/thesis.blg | 48 ------------ thesis2/thesis.tex | 41 +++++----- 10 files changed, 138 insertions(+), 275 deletions(-) delete mode 100644 thesis2/2.methods.tex delete mode 100644 thesis2/3.conclusion.tex delete mode 100644 thesis2/4.discussion.tex delete mode 100644 thesis2/5.appendices.tex delete mode 100644 thesis2/thesis.bbl delete mode 100644 thesis2/thesis.blg diff --git a/thesis2/1.introduction.tex b/thesis2/1.introduction.tex index 7e05d00..376efbc 100644 --- a/thesis2/1.introduction.tex +++ b/thesis2/1.introduction.tex @@ -168,21 +168,44 @@ single file that is human readable. In practice it means that one person, not by definition a programmer, can be instructed to change the structure and this can also greatly reduce programmer intervention time. +\section{RSS/Atom} +RSS/Atom feeds, from now on called RSS feeds, are publishing +feeds in the XML format\footnote{\url{http://www.w3.org/TR/REC-xml/}} that are +used to publish events. Every event or entry consists of several standardized +fields. The main data fields are the \textit{title} and the +\textit{description} field. In these fields the raw data is stored that +describes the event. Further there are several auxiliary fields that store the +link to the full article, store the publishing data, store some media in the +form of an image or video URL and store a \textit{Globally Unique +Identifier}(GUID). + +The RSS feeds are mostly used by news sites to publish their articles, the feed +then only contains the headlines and if the user that is reading the feed is +interested he can click the so called deeplink and he is then sent to the full +website containing the article. Users often use programs that bundle user +specified RSS feeds into big summary feeds that can be used to sift through a +lot of news feeds. The RSS feed reader uses the GUID to skip feeds that are +already present in its internal database. + +Generating RSS feeds by hand is a tedious task but usually RSS feeds are +generated by the Content Management Systems(CMS) the website runs on. With this +automatic method the RSS feeds are generated using the content published on the +website. Because the process is automatic the RSS feeds are generally very +structured and consistent in its structure. In the entertainment industry +venues often use a CMS for their website to allow users with no programming or +website background be able to post news items and event information. + \section{Why RSS/Atom} Information from venues comes in various different format with for each format several positive and negative points. For this project we chose to focus on -RSS/Atom feeds because they are in general already structured and consistent in -their structure. For example websites change a lot in structure and layout and -thus making it hard to keep crawlers up to date. RSS/Atom feeds generally only -change structure because the website or content management system gets migrated -or upgraded. - -RSS/Atom feeds in comparison to websites doesn't have a structural dimension in -the data. Because of this we have to use different techniques of isolation of -information than present techniques used for extracting information from -websites. RSS/Atom feeds are basically two fields with plain text, however the -text almost always has the same structure and keywords and therefore the -information can be extracted learning from keywords and structure. +RSS feeds. RSS feeds are, as explained earlier, very structured and consistent +in their structure. The structure basically only changes if the CMS changes or +upgrades. Because of this patterns don't have to be retrained a lot. + +In comparison to websites RSS feeds don't have a structural dimension in +the data, all the information in an entry is basically two fields of plain +text. Therefore an entirely new strategy has to be applied to train the +patterns. \section{Scientific relevance and similar research} Currently the techniques for conversion from non structured data to structured @@ -190,9 +213,54 @@ data are static and mainly only usable by computer science experts. There is a great need of data mining in non structured data because the data within companies and on the internet is piling up and are usually left to catch dust. -The project is a side project of the past project done by Roelofs et -al.\cite{Roelofs2009}. Roelofs et al. describes a technique to recognize date -strings using an adapted Levensteins algorithm. This technique can be fitted in -the current project because it works on a low level and the technique we -describe works on a high level. The algorithm only works on already isolated -data. +The project is a followup project of the past project done by Roelofs et +al.\cite{Roelofs2008} and \cite{Roelofs2009}. Roelofs et al. described +techniques on recognizing patterns in website data and published about an +adapted levenstein distance algorithm to recognize data in isolated text. These +techniques of recognizing data in text can still be used to interpret the +isolated extracted parts from this project. + +\section{Directed Acyclic Graphs} +A graph is a mathematical structure described with the ordered pair: $G=(V,E)$. +In this ordered pair $V$ is the set of nodes and $E$ is set of ordered pairs +that we will call \textit{undirected edges}. + +Directed graphs are special kinds of graphs. A directed graph is described as +the ordered pair: $G=(V,A)$. Where $A$ is a set of ordered pairs that we will +call \textit{directed edges}. Directed edges can be visualized as arrows in a +graph whereas undirected edges are just lines connecting the nodes with no +particular direction. + +Directed Acyclic Graphs(DAGs) are again a special kind of directed graph. DAGs +don't allow cycles to be created. Every node is only reachable at maximum once +when starting from an arbitrary point in the graph. DAGs have the nice property +of checking if a sequence appears in the graph, checking if a sequence is +present in a graph only has a computational complexity of $\mathcal{O}(L)$ +where $L$ is the length of the sequence. + +The type of graph used in the project is another special king of DAG called +Directed Acyclic Word Graphs(DAWGs). This type of graph can be used to store +large dictionaries of words. A DAWGs nodes basically represent indices of a +word and the edges describe the character added when traversing. For example +the graph in Figure~\ref{fig:2.1.1} can describe a dictionary that holds the +words: \textit{abd, bad, bae}. Testing if a word is present in the DAWG is, +just as for a DAG, falls als in the computational complexity class of +$\mathcal{O}(L)$ meaning that it grows linearly with the length of the word. + +\begin{figure}[H] + \caption{Example DAWG} + \label{fig:2.1.1} + \centering + \digraph[]{graph21}{ + rankdir=LR; + 1,2,3,4,5 [shape="circle"]; + 6 [shape="doublecircle"]; + 1 -> 2 [label="a"]; + 2 -> 3 [label="b"]; + 3 -> 6 [label="d"]; + 1 -> 4 [label="b"]; + 4 -> 5 [label="a"]; + 5 -> 6 [label="a"]; + 5 -> 6 [label="e"]; + } +\end{figure} diff --git a/thesis2/2.methods.tex b/thesis2/2.methods.tex deleted file mode 100644 index db5e56c..0000000 --- a/thesis2/2.methods.tex +++ /dev/null @@ -1,151 +0,0 @@ -\section{Application overview and workflow} -The program can be divided into two main components namely the \textit{Crawler -application} and the \textit{Input application}. The components are strictly -separated by task and by application. The crawler is an application dedicated -to the sole task of periodically crawling the sources asynchronously. The input -is a web interface to a set of tools that can create, edit, remove and test -crawlers via simple point and click user interfaces that can be worked with by -someone without a computer science background. - - -\section{Input application} -\subsection{Components} -Add new crawler - -Editing or remove crawlers - -Test crawler - -Generate xml - - -\section{Crawler application} -\subsection{Interface} - -\subsection{Preprocessing} -When the data is received by the crawler the data is embedded as POST data in a -HTTP request. The POST data consists of several fields with information about -the feed and a container that has the table with the user markers embedded. -After that the entries are extracted and processed line by line. - -The line processing converts the raw string of html data from a table row to a -string. The string is stripped of all the html tags and is accompanied by a -list of marker items. The entries that don't contain any markers are left out -in the next step of processing. All data, including entries without user -markers, is stored in the object too for possible later reference, for example -for editing the patterns. - -The last step is when the entries with markers are then processed to build -node-lists. Node-lists are basically lists of words that, when concatenated, -form the original entry. A word isn't a word in the linguistic sense. A word -can be one letter or a category. The node-list is generated by putting all the -separate characters one by one in the list and when a user marking is -encountered, this marking is translated to the category code and that code is -then added as a word. The nodelists are then sent to the actual algorithm to be -converted to a graph representation. - -\subsection{Directed acyclic graphs(DAG)} -Directed acyclic graphs are a special kind of graph that is used to store big -sets of words and has a complexity of $\mathcal{O}(L)$, meaning the length of -the word you are searching. Figure~\ref{fig:f21} shows a DAG that accepts the -words \textit{aa} and \textit{ab}. Final states are marked with a double circle -and intermediate states are marked with a single circle. - -\begin{figure}[H] - \caption{Sample DAG} - \label{fig:f21} - \centering - \digraph[]{graph21}{ - rankdir=LR; - 1, 2 [shape="circle"]; - 3, 4 [shape="doublecircle"]; - 1 -> 2 [label="a"]; - 2 -> 3 [label="a"]; - 2 -> 4 [label="b"]; - } -\end{figure} - -The first algorithm to generate DAG's was proposed by Hopcroft et -al\cite{Hopcroft1971}. The algorithm they described wasn't incremental and had -a complexity of $\mathcal{O}(N\log{N})$. \cite{Daciuk2000} et al. later -extended the algorithm and created an incremental one without increasing the -computational complexity. The non incremental algorithm from Daciuk et al. is -used to convert the nodelists to a graph. - -For example constructing a graph that from the entry: \textit{a,bc} and -\textit{a.bc} goes in the following steps: - -\begin{figure}[H] - \caption{Sample DAG, first entry} - \label{fig:f22} - \centering - \digraph[]{graph22}{ - rankdir=LR; - 1,2,3,5 [shape="circle"]; - 5 [shape="doublecircle"]; - 1 -> 2 [label="a"]; - 2 -> 3 [label="."]; - 3 -> 4 [label="b"]; - 4 -> 5 [label="c"]; - } -\end{figure} - -\begin{figure}[H] - \caption{Sample DAG, second entry} - \label{fig:f23} - \centering - \digraph[]{graph23}{ - rankdir=LR; - 1,2,3,5,6 [shape="circle"]; - 5 [shape="doublecircle"]; - 1 -> 2 [label="a"]; - 2 -> 3 [label="."]; - 3 -> 4 [label="b"]; - 4 -> 5 [label="c"]; - - 2 -> 6 [label=","]; - 6 -> 4 [label="b"]; - } -\end{figure} - -\subsection{Defining categories} -pass - -\subsection{Process} -Proposal was written - - -First html/mail/fax/rss, worst case rss - - -After some research and determining the scope of the project we decided only to -do RSS, this because RSS tends to force structure in the data because RSS feeds -are often generated by the website and thus reliable and consistent. We found a -couple of good RSS feeds. - - -At first the general framework was designed and implemented, no method yet. - - -Started with method for recognizing separators. - - -Found research paper about algorithm that can create directed acyclic graphs -from string, although it was designed to compress word lists it can be -(mis)used to extract information. - - -Implementation of DAG algorithm found and tied to the program. - - -Command line program ready. Conversation with both supervisors, gui had to be -made. - -Step by step gui created. Web interface as a control center for the crawlers. - - -Gui optimized. - - -Concluded that the program doesn't reach wide audience due to lack of well -structured rss feeds. diff --git a/thesis2/3.conclusion.tex b/thesis2/3.conclusion.tex deleted file mode 100644 index e69de29..0000000 diff --git a/thesis2/4.discussion.tex b/thesis2/4.discussion.tex deleted file mode 100644 index e69de29..0000000 diff --git a/thesis2/5.appendices.tex b/thesis2/5.appendices.tex deleted file mode 100644 index ee8c8b9..0000000 --- a/thesis2/5.appendices.tex +++ /dev/null @@ -1,2 +0,0 @@ - \section{Algorithm} - \section{Progress} diff --git a/thesis2/Makefile b/thesis2/Makefile index b01de17..cd58e30 100644 --- a/thesis2/Makefile +++ b/thesis2/Makefile @@ -1,14 +1,16 @@ SHELL:=/bin/bash +VERSION:=0.2 all: thesis thesis: - latex thesis.tex - bibtex thesis.aux - latex -shell-escape thesis.tex - latex thesis.tex - dvipdfm thesis.dvi + latex thesis.tex > log.txt + bibtex thesis.aux >> log.txt + latex -shell-escape thesis.tex >> log.txt + latex thesis.tex >> log.txt + dvipdfm thesis.dvi >> log.txt 2>&1 + mv -v {thesis,mart_thesis_$(VERSION)}.pdf clean: - rm -vf *.{aux,bbl,blg,dvi,log,out,pdf,toc,dot,ps} + rm -vf *.{aux,bbl,blg,dvi,log,out,pdf,toc,dot,ps} log.txt diff --git a/thesis2/thesis.bbl b/thesis2/thesis.bbl deleted file mode 100644 index f45b3a8..0000000 --- a/thesis2/thesis.bbl +++ /dev/null @@ -1,19 +0,0 @@ -\begin{thebibliography}{1} - -\bibitem{Daciuk2000} -Jan Daciuk, Stoyan Mihov, Bruce~W. Watson, and Richard~E. Watson. -\newblock {Incremental Construction of Minimal Acyclic Finite-State Automata}. -\newblock {\em Computational Linguistics}, 26(1):3--16, March 2000. - -\bibitem{Hopcroft1971} -John Hopcroft. -\newblock {An N log N algorithm for minimizing states in a finite automaton}. -\newblock Technical report, 1971. - -\bibitem{Roelofs2009} -Wouter Roelofs, Alessandro~Tadeo Paula, and Franc Grootjen. -\newblock {Programming by Clicking}. -\newblock In {\em Proceedings of the Dutch Information Retrieval Conference}, - pages 2--3, 2009. - -\end{thebibliography} diff --git a/thesis2/thesis.bib b/thesis2/thesis.bib index 5f1655d..4685c39 100644 --- a/thesis2/thesis.bib +++ b/thesis2/thesis.bib @@ -1,14 +1,14 @@ @inproceedings{Aitken2002, author = {Aitken, James Stuart}, booktitle = {Proceedings of the 15th Eureopean Conference on Artificial Intelligence}, -file = {:home/mart/articles/2002/Aitken\_Learning Information Extraction Rules An Inductive Logic Programming approach.pdf:pdf}, +file = {:home/mart/articles/2002/2002\_Aitken\_Learning Information Extraction Rules An Inductive Logic Programming approach.pdf:pdf}, title = {{Learning Information Extraction Rules : An Inductive Logic Programming approach}}, year = {2002} } @article{Berry1986, author = {Berry, Gerard and Sethi, Ravi}, doi = {10.1016/0304-3975(86)90088-5}, -file = {:home/mart/articles/1986/Berry, Sethi\_From regular expressions to deterministic automata.pdf:pdf}, +file = {:home/mart/articles/1986/1986\_Berry, Sethi\_From regular expressions to deterministic automata.pdf:pdf}, issn = {03043975}, journal = {Theoretical Computer Science}, month = jan, @@ -23,7 +23,7 @@ archivePrefix = {arXiv}, arxivId = {arXiv:1010.5318v3}, author = {Berstel, Jean and Boasson, Luc and Carton, Olivier and Fagnot, Isabelle}, eprint = {arXiv:1010.5318v3}, -file = {:home/mart/articles/2010/Berstel et al.\_Minimization of automata.pdf:pdf}, +file = {:home/mart/articles/2010/2010\_Berstel et al.\_Minimization of automata.pdf:pdf}, journal = {Mathematics Subject Classification}, keywords = {finite automata,hopcroft,minimization,s algorithm}, title = {{Minimization of automata}}, @@ -31,7 +31,7 @@ year = {2010} } @article{Bruggemann-klein1993, author = {Br\"{u}ggemann-klein, Anne}, -file = {:home/mart/articles/1993/Br\"{u}ggemann-klein\_Regular expressions into finite automata.pdf:pdf}, +file = {:home/mart/articles/1993/1993\_Br\"{u}ggemann-klein\_Regular expressions into finite automata.pdf:pdf}, journal = {Theoretical Computer Science}, pages = {197--213}, title = {{Regular expressions into finite automata}}, @@ -41,7 +41,7 @@ year = {1993} @article{Daciuk2000, author = {Daciuk, Jan and Mihov, Stoyan and Watson, Bruce W. and Watson, Richard E.}, doi = {10.1162/089120100561601}, -file = {:home/mart/articles/2000/Daciuk et al.\_Incremental Construction of Minimal Acyclic Finite-State Automata.pdf:pdf}, +file = {:home/mart/articles/2000/2000\_Daciuk et al.\_Incremental Construction of Minimal Acyclic Finite-State Automata.pdf:pdf}, issn = {0891-2017}, journal = {Computational Linguistics}, month = mar, @@ -52,15 +52,21 @@ url = {http://www.mitpressjournals.org/doi/abs/10.1162/089120100561601}, volume = {26}, year = {2000} } +@article{Grootjen2014, +author = {Grootjen, Franc and Otworowska, Maria}, +file = {:home/mart/articles/2014/2014\_Grootjen, Otworowska\_Proceedings of the 26th Benelux Conference on Artificial Intelligence.pdf:pdf}, +title = {{Proceedings of the 26th Benelux Conference on Artificial Intelligence}}, +year = {2014} +} @techreport{Hopcroft1971, author = {Hopcroft, John}, -file = {:home/mart/articles/1971/Hopcroft\_An N log N algorithm for minimizing states in a finite automaton.pdf:pdf}, +file = {:home/mart/articles/1971/1971\_Hopcroft\_An N log N algorithm for minimizing states in a finite automaton.pdf:pdf}, title = {{An N log N algorithm for minimizing states in a finite automaton}}, year = {1971} } @article{Hromkovic2001, author = {Hromkovic, Juraj and Seibert, Sebastian and Wilke, Thomas}, -file = {:home/mart/articles/2001/Hromkovic, Seibert, Wilke\_Translating Regular Expressions into Small = - Free Nondeterministic Finite Automata.pdf:pdf}, +file = {:home/mart/articles/2001/2001\_Hromkovic, Seibert, Wilke\_Translating Regular Expressions into Small = - Free Nondeterministic Finite Automata.pdf:pdf}, journal = {Journal of Computer and System Sciences}, keywords = {finite automata,regular expressions}, pages = {565--588}, @@ -71,7 +77,7 @@ year = {2001} @article{Knuutila2001, author = {Knuutila, Timo}, doi = {10.1016/S0304-3975(99)00150-4}, -file = {:home/mart/articles/2001/Knuutila\_Re-describing an algorithm by Hopcroft.pdf:pdf}, +file = {:home/mart/articles/2001/2001\_Knuutila\_Re-describing an algorithm by Hopcroft.pdf:pdf}, issn = {03043975}, journal = {Theoretical Computer Science}, keywords = {algorithms,finite automata,minimization}, @@ -83,10 +89,16 @@ url = {http://linkinghub.elsevier.com/retrieve/pii/S0304397599001504}, volume = {250}, year = {2001} } +@article{Roelofs2008, +author = {Roelofs, Wouter}, +file = {:home/mart/articles/2008/2008\_Roelofs\_HyperBot Building Web Robots for Non-Programmers.pdf:pdf}, +title = {{HyperBot Building Web Robots for Non-Programmers}}, +year = {2008} +} @inproceedings{Roelofs2009, author = {Roelofs, Wouter and Paula, Alessandro Tadeo and Grootjen, Franc}, booktitle = {Proceedings of the Dutch Information Retrieval Conference}, -file = {:home/mart/articles/2009/Roelofs, Paula, Grootjen\_Programming by Clicking.pdf:pdf}, +file = {:home/mart/articles/2009/2009\_Roelofs, Paula, Grootjen\_Programming by Clicking.pdf:pdf}, keywords = {levenshtein matching,subtree matching,web crawler}, pages = {2--3}, title = {{Programming by Clicking}}, @@ -94,7 +106,7 @@ year = {2009} } @article{Science1985, author = {Science, Computer}, -file = {:home/mart/articles/1985/Science\_A. e h r e n f e u c h t •.pdf:pdf}, +file = {:home/mart/articles/1985/1985\_Science\_A. e h r e n f e u c h t •.pdf:pdf}, pages = {31--55}, title = {{A. e h r e n f e u c h t •}}, volume = {40}, @@ -102,6 +114,6 @@ year = {1985} } @article{Watson, author = {Watson, Bruce W and Science, Computing and Eindhoven, Technische Universiteit}, -file = {:home/mart/articles/Unknown/Watson, Science, Eindhoven\_Constructing minimal acyclic deterministic finite automata.pdf:pdf}, +file = {:home/mart/articles/Unknown/Unknown\_Watson, Science, Eindhoven\_Constructing minimal acyclic deterministic finite automata.pdf:pdf}, title = {{Constructing minimal acyclic deterministic finite automata}} } diff --git a/thesis2/thesis.blg b/thesis2/thesis.blg deleted file mode 100644 index 3a83e1f..0000000 --- a/thesis2/thesis.blg +++ /dev/null @@ -1,48 +0,0 @@ -This is BibTeX, Version 0.99d (TeX Live 2015/dev/Debian) -Capacity: max_strings=35307, hash_size=35307, hash_prime=30011 -The top-level auxiliary file: thesis.aux -The style file: plain.bst -Database file #1: thesis.bib -Warning--empty institution in Hopcroft1971 -You've used 3 entries, - 2118 wiz_defined-function locations, - 516 strings with 4464 characters, -and the built_in function-call counts, 993 in all, are: -= -- 95 -> -- 48 -< -- 1 -+ -- 19 -- -- 16 -* -- 70 -:= -- 173 -add.period$ -- 9 -call.type$ -- 3 -change.case$ -- 18 -chr.to.int$ -- 0 -cite$ -- 4 -duplicate$ -- 38 -empty$ -- 75 -format.name$ -- 16 -if$ -- 202 -int.to.chr$ -- 0 -int.to.str$ -- 3 -missing$ -- 2 -newline$ -- 18 -num.names$ -- 6 -pop$ -- 18 -preamble$ -- 1 -purify$ -- 14 -quote$ -- 0 -skip$ -- 26 -stack$ -- 0 -substring$ -- 47 -swap$ -- 8 -text.length$ -- 1 -text.prefix$ -- 0 -top$ -- 0 -type$ -- 12 -warning$ -- 1 -while$ -- 11 -width$ -- 4 -write$ -- 34 -(There was 1 warning) diff --git a/thesis2/thesis.tex b/thesis2/thesis.tex index b9fa5e1..f4b2066 100644 --- a/thesis2/thesis.tex +++ b/thesis2/thesis.tex @@ -1,7 +1,7 @@ -%\documentclass{scrbook} \documentclass{book} -%\usepackage{scrhack} +\usepackage[british]{babel} + \usepackage{graphicx} % Images \usepackage{float} % Better placement float figures \usepackage{listings} % Source code formatting @@ -28,26 +28,24 @@ showspaces=false } +\newcommand{\cvartitle}{Non IT configurable adaptive data mining solution used +in transforming raw data to structured data} % Setup hyperlink formatting \hypersetup{ - pdftitle={Non IT configurable adaptive data mining solution used in transforming raw data to structured data}, + pdftitle={\cvartitle}, pdfauthor={Mart Lubbers}, pdfsubject={Artificial Intelligence}, } % Describe the frontpage -\author{Mart Lubbers\\s4109053} -\title{Non IT configurable adaptive data mining solution used in transforming raw -data to structured data} -%\subtitle{ -% Bachelor's Thesis in Artificial Intelligence\\ -% Radboud University Nijmegen\\ -% \vspace{15mm} -% \begin{tabular}{cp{5em}c} -% Franc Grootjen && Alessandro Paula\\ -% RU && Hyperleap -% \end{tabular} -% } +\author{ + Mart Lubbers\\ + s4109053\\ + \strut\\ + External supervisor: Alessandro Paula\\ + Internal supervisor: Franc Grootjen\\ +} +\title{\cvartitle} \date{\today} \begin{document} @@ -70,17 +68,20 @@ data to structured data} \chapter{Introduction} \input{1.introduction.tex} -\chapter{Methods} -\input{2.methods.tex} +\chapter{Requirements and design} +\input{2.requirementsanddesign} + +\chapter{Algorithm} +\input{3.methods.tex} \chapter{Conclusion} -\input{3.conclusion.tex} +\input{4.conclusion.tex} \section{Discussion} -\input{4.discussion.tex} +\input{5.discussion.tex} \chapter{Appendices} -\input{5.appendices.tex} +\input{6.appendices.tex} \bibliographystyle{plain} \bibliography{thesis} -- 2.20.1