final
[bsc-thesis1415.git] / proposal / proposal.tex
1 \documentclass[a4paper]{article}
2
3 \usepackage[dvipdfmx]{hyperref}
4 \usepackage{calc}
5 \usepackage{fullpage}
6
7 \author{Mart Lubbers\\ 0651371972\\ s4109503\\
8 \href{mailto:mart@martlubbers.net}{mart@martlubbers.net}}
9 \title{Non IT configurable adaptive data mining solution used in transforming
10 raw data to structured data\\\small A proposal}
11 \date{\today}
12
13 \begin{document}
14 \maketitle
15 \tableofcontents
16 \newpage
17 \section{Supervisors}
18 \begin{center}
19 \begin{tabular}{cc}
20 Franc Grootjen & Alessandro Paula\\
21 Radboud University Nijmegen & Hyperleap\\
22 Nijmegen, The Netherlands & Nijmegen, The Netherlands\\
23 \href{mailto:f.grootjen@psych.ru.nl}{f.grootjen@psych.ru.nl} &
24 \href{mailto:aldo@hyperleap.nl}{aldo@hyperleap.nl}
25 \\
26 \\
27 Signature & Signature\\
28 \\
29 \rule{2.5cm}{0.4pt} & \rule{2.5cm}{0.4pt}\\
30 \end{tabular}
31 \end{center}
32
33 \section{Abstract\tiny 73 words}
34 Raw data from information providers is usually hard to interpret for a software
35 solution and the conversion of raw data to structured data is usually done by
36 hand. This project aims towards an adaptable, configurable data transformation
37 program optionally in combination with a webcrawler that can perform the
38 conversion from raw data to structured data. This is all done in under
39 supervision of Franc Grootjen and Alessandro Paula and under commissioned by
40 Hyperleap.
41
42 \section{Project Description\tiny 484 words}
43 \subsection{Research Question and Motivation}
44 The main research question is: \textit{How can we make an adaptive, autonomous
45 and programmable data mining program that can be set up by a non IT
46 professional which is able to transform raw data into structured data.}\\
47 Hyperleap is a small company that is specialized in infotainment
48 (information+entertainment) and administrates several websites which bundle
49 information about entertainment in a ordered and complete way. Right now, most
50 of the data input is done by hand and takes a lot of time to type in.
51
52 \subsection{Aim}
53 The practical goal and aim of the project is to make a crawler(web or other
54 document types) that can autonomously gather information after it has been
55 setup by a, not necessarily IT trained, employer via an intuitive interface.
56 Optionally the crawler shouldn't be susceptible by small structure changes in
57 the website, be able to handle advanced website display techniques such as
58 javascript and should be able to notify the administrator when the site has
59 become uncrawlable and the crawler needs to be reprogrammed for that particular
60 site. But the main purpose is the translation from raw data to structured data.
61 The projects is in principle a continuation of a past project done by Wouter
62 Roelofs\cite{Roelofs2009} which was also supervised by Franc Grootjen and
63 Alessandro Paula, however it was never taken out of the experimental phase and
64 therefore is in need continuation.
65
66 \subsection{Research Plan and Schedule}
67 The schedule or plan for the project can be divided into 4 stages namely the
68 initial, developmental, testing and writing stage. These stages are not
69 mutually exclusive and therefore can and will overlap.
70 \begin{itemize}
71 \item{Initiating stage:}
72 In this stage we will look at the past project and present literature
73 on the subject and create a explicit plan for the eventual software.
74 There probably is a lot of literature written on how to parse certain
75 information fields such as dates, places and artist information. The
76 date parsing and recognizing was a main part in the past project.
77 \item{Developmental stage:}
78 The developmental stage is the stage where most of the programming is
79 done and the where the algorithms for crawling and transformation are
80 implemented. For web-frontend the framework choice has fallen upon
81 firefox extensions which are mainly written in javascript and cfx. The
82 data transformer will probably be written in python due to the robust
83 natural language tools and portability.
84 \item{Testing stage:}
85 This stage will overlap greatly with the developmental stage because
86 this will save a lot of time.
87 \item{Writing stage:}
88 The last stage will be the stage in which the thesis is written and the
89 project presented. During all other stages certain parts of the thesis
90 can already be written down.
91 \end{itemize}
92
93 \subsection{Weekly planning}
94 Because of some mandatory courses in the first semester of the next year the
95 schedule can be seen as provisional meaning that there is room to extend the
96 schedule.(in practice at maximum up to december 2014).
97
98 \begin{tabular}{|p{1em}|p{1.2em}|p{5em}|p{16em}|p{15em}|}
99 \hline
100 \# & Wk & Date & Task & Deliverables\\\hline
101 1 & 15 & 2014-04-07 & proposal and references &
102 proposal signed by both parties\\
103 2 & 16 & 2014-04-14 & references and test environment setup &
104 test environment\\
105 3 & 17 & 2014-04-21 & planning for writing the tool &
106 software design\\
107 4 & 18 & 2014-04-28 & writing thesis and programming &
108 introduction\\
109 5 & 19 & 2014-05-05 & writing thesis and programming &
110 \\
111 6 & 20 & 2014-05-12 & idem &
112 methods\\
113 7 & 21 & 2014-05-19 & idem &
114 first prototype software\\
115 8 & 22 & 2014-05-26 & testing, programming and thesis &
116 \\
117 9 & 23 & 2014-06-02 & testing, implementation bigger picture &
118 \\
119 10 & 24 & 2014-06-09 & testing &
120 working tool and results and abstract\\
121 11 & 25 & 2014-06-16 & presentation and thesis &
122 discussion and presentation\\
123 12 & 26 & 2014-06-23 & presentation &
124 presentiation\\
125 13 & 27 & 2014-06-29 & presentation &
126 \\
127 \hline
128 \end{tabular}\\
129 There will also be bi-weekly meetings with both supervisors to make sure we are
130 on schedule. If necessary the frequency of meetings with the external
131 supervisor can be increased.
132
133 \section{Scientific relevance\tiny 52 words}
134 Currently the techniques for conversion from non structured data to structured
135 data are static and mainly only usable by IT specialists. There is a great need
136 of data mining in non structured data because the data within companies and on
137 the internet is piling up and are usually left to catch dust.
138
139
140 \bibliographystyle{ieeetr}
141 \bibliography{proposal}
142
143 \end{document}