From 7d3bf3f5dd316652cbdf8c66e3bfd5066d477d13 Mon Sep 17 00:00:00 2001 From: Mart Lubbers Date: Tue, 26 May 2015 14:10:31 +0200 Subject: [PATCH] 4 also corrected --- thesis2/2.requirementsanddesign.tex | 8 ++++---- thesis2/4.discussion.tex | 25 ++++++++++++++----------- thesis2/Makefile | 2 +- 3 files changed, 19 insertions(+), 16 deletions(-) diff --git a/thesis2/2.requirementsanddesign.tex b/thesis2/2.requirementsanddesign.tex index 61b8fd5..a69ecdc 100644 --- a/thesis2/2.requirementsanddesign.tex +++ b/thesis2/2.requirementsanddesign.tex @@ -22,7 +22,7 @@ explanation is also provided. \subsection{Functional requirements} \subsubsection{Original functional requirements} \begin{itemize} - \item[I1:] Be able to crawl several source types. + \item[I1:] The system should be able to crawl several source types. \begin{itemize} \item[I1a:] Fax/email. \item[I1b:] XML feeds. @@ -30,9 +30,9 @@ explanation is also provided. \item[I1d:] Websites. \end{itemize} \item[I2:] Apply low level matching techniques on isolated data. - \item[I3:] Insert the data in the database. - \item[I4:] User interface to train crawlers that is usable someone - without a particular computer science background. + \item[I3:] Insert data in the database. + \item[I4:] The system should have an user interface to train crawlers that is + usable someone without a particular computer science background. \item[I5:] The system should be able to report to the user or maintainer when a source has been changed too much for successful crawling. diff --git a/thesis2/4.discussion.tex b/thesis2/4.discussion.tex index 86b616c..5e3d10d 100644 --- a/thesis2/4.discussion.tex +++ b/thesis2/4.discussion.tex @@ -10,8 +10,10 @@ can shorten the loop for repairing and adding crawlers which our system. The system we have built is tested and can provide the necessary tools for a user with no particular programming skills to generate crawlers and thus the number of interventions where a programmer is needed is greatly reduced. Although we -have solved the problem we stated the results are not strictly positive. For a -problem to be solved the problem must be present. +have solved the problem we stated the results are not strictly positive. This +is because a if the problem space is not large the interest of solving the +problem is also not large, this basically means that there is not much data to +apply the solution on. Although the research question is answered the underlying goal of the project has not been completely achieved. The application is an intuitive system that @@ -43,21 +45,22 @@ to lack of key information in the expected fields and by that lower overall extraction performance. The second most occurring common misuse is to use HTML formatted text in the RSS -feeds text fields. Our algorithm is designed to detect and extract information +feeds text fields. The algorithm is designed to detect and extract information via patterns in plain text and the performance on HTML is very bad compared to plain text. A text field with HTML is almost useless to gather information -from. Via a small study on available RSS feeds we found that about $50\%$ of -the RSS feeds misuse the protocol in such a way that extraction of data is -almost impossible. This reduces the domain of good RSS feeds to less then $5\%$ -of the venues. +from because they usually include all kinds of information in other modalities +then text. Via a small study on a selecteion of RSS feeds($N=10$) we found that +about $50\%$ of the RSS feeds misuse the protocol in such a way that extraction +of data is almost impossible. This reduces the domain of good RSS feeds to less +then $5\%$ of the venues. \section{Discussion \& Future Research} \label{sec:discuss} % low level stuff -The application we created does not apply any techniques on the isolated -chunks. The application is built only to extract and not to process the labeled -chunks of text. When we would combine the information about the global -structure and information about structure in a marked area we increase +The application we created does not apply any techniques on the extracted +data fields. The application is built only to extract and not to process the +labeled data fields with text. When we would combine the information about the +global structure and information about structure in a marked area we increase performance in two ways. A higher levels of performance are reached due to the structural information of marked areas. Hereby extra knowledge as extra constraint while matching the data in marked areas. The second increase in diff --git a/thesis2/Makefile b/thesis2/Makefile index 34a32f5..6f3a41a 100644 --- a/thesis2/Makefile +++ b/thesis2/Makefile @@ -10,7 +10,7 @@ GRAPHS:=$(addsuffix .pdf,$(basename $(shell ls img/*.{dot,png}))) .SECONDARY: $(addsuffix .fmt,$(basename $(OUTPUT))) .PHONY: clobber graphs -all: thesis.pdf graphs +all: thesis.pdf %.pdf: %.png convert $< $@ -- 2.20.1