proposal/proposal.tex

   1 \documentclass[a4paper]{article}
   2
   3 \usepackage[dvipdfmx]{hyperref}
   4 \usepackage{calc}
   5 \usepackage{fullpage}
   6
   7 \author{Mart Lubbers\\ 0651371972\\ s4109503\\
   8                 \href{mailto:mart@martlubbers.net}{mart@martlubbers.net}}
   9 \title{Non IT configurable adaptive data mining solution used in transforming
  10            raw data to structured data\\\small A proposal}
  11 \date{\today}
  12
  13 \begin{document}
  14 \maketitle
  15 \tableofcontents
  16 \newpage
  17 \section{Supervisors}
  18 \begin{center}
  19         \begin{tabular}{cc}
  20                 Franc Grootjen  & Alessandro Paula\\
  21                 Radboud University Nijmegen     & Hyperleap\\
  22                 Nijmegen, The Netherlands       & Nijmegen, The Netherlands\\
  23                 \href{mailto:f.grootjen@psych.ru.nl}{f.grootjen@psych.ru.nl} &
  24                         \href{mailto:aldo@hyperleap.nl}{aldo@hyperleap.nl}
  25                 \\
  26                 \\
  27                 Signature       & Signature\\
  28                 \\
  29                 \rule{2.5cm}{0.4pt}     & \rule{2.5cm}{0.4pt}\\
  30         \end{tabular}
  31 \end{center}
  32
  33 \section{Abstract\tiny 73 words}
  34 Raw data from information providers is usually hard to interpret for a software
  35 solution and the conversion of raw data to structured data is usually done by
  36 hand.  This project aims towards an adaptable, configurable data transformation
  37 program optionally in combination with a webcrawler that can perform the
  38 conversion from raw data to structured data.  This is all done in under
  39 supervision of Franc Grootjen and Alessandro Paula and under commissioned by
  40 Hyperleap.
  41
  42 \section{Project Description\tiny 484 words}
  43 \subsection{Research Question and Motivation}
  44 The main research question is: \textit{How can we make an adaptive, autonomous
  45 and programmable data mining program that can be set up by a non IT
  46 professional which is able to transform raw data into structured data.}\\
  47 Hyperleap is a small company that is specialized in infotainment
  48 (information+entertainment) and administrates several websites which bundle
  49 information about entertainment in a ordered and complete way.  Right now, most
  50 of the data input is done by hand and takes a lot of time to type in.
  51
  52 \subsection{Aim}
  53 The practical goal and aim of the project is to make a crawler(web or other
  54 document types) that can autonomously gather information after it has been
  55 setup by a, not necessarily IT trained, employer via an intuitive interface.
  56 Optionally the crawler shouldn't be susceptible by small structure changes in
  57 the website, be able to handle advanced website display techniques such as
  58 javascript and should be able to notify the administrator when the site has
  59 become uncrawlable and the crawler needs to be reprogrammed for that particular
  60 site. But the main purpose is the translation from raw data to structured data.
  61 The projects is in principle a continuation of a past project done by Wouter
  62 Roelofs\cite{Roelofs2009} which was also supervised by Franc Grootjen and
  63 Alessandro Paula, however it was never taken out of the experimental phase and
  64 therefore is in need continuation.
  65
  66 \subsection{Research Plan and Schedule}
  67 The schedule or plan for the project can be divided into 4 stages namely the
  68 initial, developmental, testing and writing stage. These stages are not
  69 mutually exclusive and therefore can and will overlap.
  70 \begin{itemize}
  71         \item{Initiating stage:}
  72                 In this stage we will look at the past project and present literature
  73                 on the subject and create a explicit plan for the eventual software.
  74                 There probably is a lot of literature written on how to parse certain
  75                 information fields such as dates, places and artist information. The
  76                 date parsing and recognizing was a main part in the past project.
  77         \item{Developmental stage:}
  78                 The developmental stage is the stage where most of the programming is
  79                 done and the where the algorithms for crawling and transformation are
  80                 implemented.  For web-frontend the framework choice has fallen upon
  81                 firefox extensions which are mainly written in javascript and cfx. The
  82                 data transformer will probably be written in python due to the robust
  83                 natural language tools and portability.
  84         \item{Testing stage:}
  85                 This stage will overlap greatly with the developmental stage because
  86                 this will save a lot of time.
  87         \item{Writing stage:}
  88                 The last stage will be the stage in which the thesis is written and the
  89                 project presented. During all other stages certain parts of the thesis
  90                 can already be written down.
  91 \end{itemize}
  92
  93 \subsection{Weekly planning}
  94 Because of some mandatory courses in the first semester of the next year the
  95 schedule can be seen as provisional meaning that there is room to extend the
  96 schedule.(in practice at maximum up to december 2014).
  97
  98 \begin{tabular}{|p{1em}|p{1.2em}|p{5em}|p{16em}|p{15em}|}
  99         \hline
 100         \#      & Wk    & Date                  & Task  & Deliverables\\\hline
 101         1       & 15    & 2014-04-07    & proposal and references       &
 102                                                                                 proposal signed by both parties\\
 103         2       & 16    & 2014-04-14    & references and test environment setup &
 104                                                                                 test environment\\
 105         3       & 17    & 2014-04-21    & planning for writing the tool &
 106                                                                                 software design\\
 107         4       & 18    & 2014-04-28    & writing thesis and programming        &
 108                                                                                 introduction\\
 109         5       & 19    & 2014-05-05    & writing thesis and programming &
 110                                                                                 \\
 111         6       & 20    & 2014-05-12    & idem  &
 112                                                                                 methods\\
 113         7       & 21    & 2014-05-19    & idem  &
 114                                                                                 first prototype software\\
 115         8       & 22    & 2014-05-26    & testing, programming and thesis       &
 116                                                                                 \\
 117         9       & 23    & 2014-06-02    & testing, implementation bigger picture &
 118                                                                                 \\
 119         10      & 24    & 2014-06-09    & testing       &
 120                                                                                 working tool and results and abstract\\
 121         11      & 25    & 2014-06-16    & presentation and thesis       &
 122                                                                                 discussion and presentation\\
 123         12      & 26    & 2014-06-23    & presentation  &
 124                                                                                 presentiation\\
 125         13      & 27    & 2014-06-29    & presentation  &
 126                                                                                 \\
 127         \hline
 128 \end{tabular}\\
 129 There will also be bi-weekly meetings with both supervisors to make sure we are
 130 on schedule.  If necessary the frequency of meetings with the external
 131 supervisor can be increased.
 132
 133 \section{Scientific relevance\tiny 52 words}
 134 Currently the techniques for conversion from non structured data to structured
 135 data are static and mainly only usable by IT specialists. There is a great need
 136 of data mining in non structured data because the data within companies and on
 137 the internet is piling up and are usually left to catch dust.
 138
 139
 140 \bibliographystyle{ieeetr}
 141 \bibliography{proposal}
 142
 143 \end{document}