From 4b5b9b229886c1a7e5cf25eb227fa95d7cd5294b Mon Sep 17 00:00:00 2001 From: Mart Lubbers Date: Wed, 12 Nov 2014 21:27:31 +0100 Subject: [PATCH] thesis v0.2 --- thesis2/2.requirementsanddesign.tex | 3 + thesis2/3.methods.tex | 130 +++++++++++++++++++++++++ thesis2/4.conclusion.tex | 0 thesis2/5.discussion.tex | 0 thesis2/6.appendices.tex | 2 + thesis2/version/mart_thesis_0.2.tar.gz | Bin 0 -> 9652 bytes 6 files changed, 135 insertions(+) create mode 100644 thesis2/2.requirementsanddesign.tex create mode 100644 thesis2/3.methods.tex create mode 100644 thesis2/4.conclusion.tex create mode 100644 thesis2/5.discussion.tex create mode 100644 thesis2/6.appendices.tex create mode 100644 thesis2/version/mart_thesis_0.2.tar.gz diff --git a/thesis2/2.requirementsanddesign.tex b/thesis2/2.requirementsanddesign.tex new file mode 100644 index 0000000..a1dff98 --- /dev/null +++ b/thesis2/2.requirementsanddesign.tex @@ -0,0 +1,3 @@ +\section{Requirements} + +\section{Design} diff --git a/thesis2/3.methods.tex b/thesis2/3.methods.tex new file mode 100644 index 0000000..d8c0e3a --- /dev/null +++ b/thesis2/3.methods.tex @@ -0,0 +1,130 @@ +\section{Application overview and workflow} +The program can be divided into two main components namely the \textit{Crawler +application} and the \textit{Input application}. The components are strictly +separated by task and by application. The crawler is an application dedicated +to the sole task of periodically crawling the sources asynchronously. The input +is a web interface to a set of tools that can create, edit, remove and test +crawlers via simple point and click user interfaces that can be worked with by +someone without a computer science background. + +\section{Minimizing DAWGs} +The first algorithm to generate DAG's was proposed by Hopcroft et +al\cite{Hopcroft1971}. The algorithm they described wasn't incremental and had +a complexity of $\mathcal{O}(N\log{N})$. \cite{Daciuk2000} et al. later +extended the algorithm and created an incremental one without increasing the +computational complexity. The non incremental algorithm from Daciuk et al. is +used to convert the nodelists to a graph. + +For example constructing a graph that from the entry: \textit{a,bc} and +\textit{a.bc} goes in the following steps: + +\begin{figure}[H] + \caption{Sample DAG, first entry} + \label{fig:f22} + \centering + \digraph[]{graph22}{ + rankdir=LR; + 1,2,3,5 [shape="circle"]; + 5 [shape="doublecircle"]; + 1 -> 2 [label="a"]; + 2 -> 3 [label="."]; + 3 -> 4 [label="b"]; + 4 -> 5 [label="c"]; + } +\end{figure} + +\begin{figure}[H] + \caption{Sample DAG, second entry} + \label{fig:f23} + \centering + \digraph[]{graph23}{ + rankdir=LR; + 1,2,3,5,6 [shape="circle"]; + 5 [shape="doublecircle"]; + 1 -> 2 [label="a"]; + 2 -> 3 [label="."]; + 3 -> 4 [label="b"]; + 4 -> 5 [label="c"]; + + 2 -> 6 [label=","]; + 6 -> 4 [label="b"]; + } +\end{figure} + +\section{Input application} +\subsection{Components} +Add new crawler + +Editing or remove crawlers + +Test crawler + +Generate xml + + +\section{Crawler application} +\subsection{Interface} + +\subsection{Preprocessing} +When the data is received by the crawler the data is embedded as POST data in a +HTTP request. The POST data consists of several fields with information about +the feed and a container that has the table with the user markers embedded. +After that the entries are extracted and processed line by line. + +The line processing converts the raw string of html data from a table row to a +string. The string is stripped of all the html tags and is accompanied by a +list of marker items. The entries that don't contain any markers are left out +in the next step of processing. All data, including entries without user +markers, is stored in the object too for possible later reference, for example +for editing the patterns. + +The last step is when the entries with markers are then processed to build +node-lists. Node-lists are basically lists of words that, when concatenated, +form the original entry. A word isn't a word in the linguistic sense. A word +can be one letter or a category. The node-list is generated by putting all the +separate characters one by one in the list and when a user marking is +encountered, this marking is translated to the category code and that code is +then added as a word. The nodelists are then sent to the actual algorithm to be +converted to a graph representation. + +\subsection{Defining categories} +pass + +\subsection{Process} +Proposal was written + + +First html/mail/fax/rss, worst case rss + + +After some research and determining the scope of the project we decided only to +do RSS, this because RSS tends to force structure in the data because RSS feeds +are often generated by the website and thus reliable and consistent. We found a +couple of good RSS feeds. + + +At first the general framework was designed and implemented, no method yet. + + +Started with method for recognizing separators. + + +Found research paper about algorithm that can create directed acyclic graphs +from string, although it was designed to compress word lists it can be +(mis)used to extract information. + + +Implementation of DAG algorithm found and tied to the program. + + +Command line program ready. Conversation with both supervisors, gui had to be +made. + +Step by step gui created. Web interface as a control center for the crawlers. + + +Gui optimized. + + +Concluded that the program doesn't reach wide audience due to lack of well +structured rss feeds. diff --git a/thesis2/4.conclusion.tex b/thesis2/4.conclusion.tex new file mode 100644 index 0000000..e69de29 diff --git a/thesis2/5.discussion.tex b/thesis2/5.discussion.tex new file mode 100644 index 0000000..e69de29 diff --git a/thesis2/6.appendices.tex b/thesis2/6.appendices.tex new file mode 100644 index 0000000..ee8c8b9 --- /dev/null +++ b/thesis2/6.appendices.tex @@ -0,0 +1,2 @@ + \section{Algorithm} + \section{Progress} diff --git a/thesis2/version/mart_thesis_0.2.tar.gz b/thesis2/version/mart_thesis_0.2.tar.gz new file mode 100644 index 0000000000000000000000000000000000000000..13784417ca0b34949fa39e2d7617b9049b2523ad GIT binary patch literal 9652 zcmV;lB}>{LiwFqiq+?V918re)bYFC6WpinBUob8*E_7jX0PI|Cj~h3V_Sf)NG=jlx z0w~Hz#?RS&e)j0WZ2oZm z`0?5K+3Dj4r?a!!~zx=ohw$YUKG06xNB>^EH1jvG!=d?T~nI2`?7dx@w@9g zwOiYAP4DW?w1r>0oiDbv-WQf3+7{4uP2BLMQs3||h0$%jFDkwFgF{Cc z+Knb;3%$T0g`r2Cw#^2gfTJ%~8n3jbQp?tDiek8<53;Z|Md$5Jb zF8&T_D;B2Iy~mrKs%+K#sVmCXVb2!$`eJ9hH369x?*hjMs#mSq5i&cv9gF|_q5;&M zDRy+C>ejZo!Yg=6r`l!(lnQfA9Kl|@0oC3vKvl#c@U?cjVkT$QD27Aiw5Bxn+7O_Y z?_*lL#Y=7raJDl9x@&tgRqA`U(|ljQc0ZO*Thi?m*H#zZ&J`OxE6TNQ8dKvbou853 z=#8mKVk&!Wt6tXysOpVpo|kU3unl1C%z`lTQ<9xvl%9_hZb%f6ndrS`IS1F@mkbzS3g1>EA9!uCbqR5c)pSm#{edu?oK6!c~b zjezXw#>!J6V`VoRTV8o)Z)pTUA&S!OILLF7j9Rk-eB zfn{L03xQWDwTPH$8FeBg|WM$ie%>+i*I}*mFy)krzx_t5D z>&mW9&ZnQ9FvpXKH^z|d+L+A(R$*-c!u)Q1CXqZQh9 zxwhsy-0RBe8bXl#Z-QA|{u{(>=_1C5;W(D*!6u4P*b5D8CrlN5YPb+q@3e|cG&R~bEnbY?Dj}P!I(cL>p?JvfM`A^Q43?RVp0G>X8FK|gv1o0z zg6P`SIw4TI66TSEp^}dVwBFk?##FvAU76ilCK(WaSQRnMwQg5LMv`T`{wsI$mh|;# zyJtl#PG+Z5gy|>21|lSEnTCRxi48b5*wSQ|?*(=KP(THbT$vg&*AiYq1#@m+h)GjN ze&!!Aw6@zLR4}miRXn>j*kuYfjxFG7ggF_jI7k67*Kia8A@RkPjV@q~aMAY&QrEbS zbV7EpaOh6v5oD;7C+sDf@noWNZ<%Wm@KGz%099Dz5^}S0z?kJW`C$nb#90uM?Drb7 z;?Pp^pm~_acmw~|kjT@IZ|KCx!l@~B0UWihAkNIhW3CmqxW)+5 zlM%z{+NX-03lS3xc8a;!CXFL5vfkj1L;xHHvr%geBwu&sTCwS+n-=ee1A@6D{C!#^ zVZ$9A+1T`o^^hVD;TjC&VCb84VVTG3%$DTSm#tm!OX{Uv^{x5OO9pVdc~rQU|5D)YcKLiK^O@%#g}*Lq><94H_#fzS+m_OJ?O;cnia zPXBy(uRi+d5rdzSGUB zHXl!x&`J+!d^$Uw&>nBLPfw@k6M9`fKR%zpPju_2v!4l&Ayf6!&XBQsdOqbfds_U& zh(G=Koe28JvOY`eFW?pGusTnxFZE4WJ4fg1We&pw}J zpC95gMChO{foGIUr6^@v+SZG3M>Vj=5*pM-6f`%W#8WWe_}hV<4Uf31>gw#2OH?UH20^Vuiq;VkK^&L47QA6Q|?jF%_?zl<*ZzbyeZV? z8Fy`)E8$R-+z=rXlhru8=lbF8ik4&Mi=OE+rpTZ?g{%mq)g6H}7i)Uv6BVAVY4h~_ z71DLe%wI!nbX!u)ixi7%<+s%0e$QG`G!xUE+#-M<_5R3cVP8;jHazBrEFIb3PSy3??K6auwiCH2U zO-}!-6yTFlq(oy&!XN1puR|4GNlZZnIXX2E5^W*r4VbbtrlOREoZ>drzjXkY;D`*f z@fsWWz;;(xzln%&?7gH(*zHQ|y2o6*~e${aZb+s;m3 zk#g^SG-hN?rJARGs1mO9)*mNH$Pqu3=Df@?VIUh_N1H|B9wqH|vNQj0Bs=qQvh(fh z4<$Psk_Q=zp}GWb@+dO;DB+NYU67YsX@)#zGan!Q=;K@2O~;2nq$SysjNWltGQt

!$)DGSv}V;lm|f5*j!c!gqe6ROeKq1X$Z(zeY7kL;aFby3msD|wZzDf#<~ zq`6rkfI;jN>zb{h{$>uTauk~>l5+1hh9UuSEmT#ESa3W7D~C{{DY^H7M^s=fLt#Sd zllK%A=7-@_E=z(#Fwrel>lc)}RN*)HrYypyi}k33il5}-CX&Ka*B!4>zneR&D#ZPmu;u+HF7pT#)8a?Y z)Fe+JTNKoD+`i?QCb*DX*+_#OXCA)aY*;2T(<(M2DvY;FDpa*xPSGl4>C9>IB1L{J z%o=d2_dzb{?ikFK-R7r|PU@O=?^8J}bV;P^nkUvuy%Ik_Y#6A(jju!Z2c84+M{^I7 z;US`krL>X6sOV5wUI@Ww;t`7PxC_YL$v5n~{U^`9d-JI@RKsPoJ+WlJ%V9V8ZU9Q8T%F2MYi-~t* zR+@j|u*gf$F^VO3lyy7L1T9_J^Hf&4t?BsUCRCb2&|7PsX$d8iiPX05y+xHq07uzv z?$n_?p~NOuF4h6Wq|p_|CEQBC#T`|7aUzV1&~zPXbRn*Vh%k?Vb-Xb-SBt|du!^KCPJw!85|tYXO^l^l~0Xs zdDFHu6&I=lR4UtN#br(e!~imdQeO&u$c67sX~j{MlvJ5(O*0_u;OxfMx}}CKSG^AE zS*jp$&}hmvyE@M@hkM0!`y{oEYx1j??ybv~l$1`${dQ~*;wVYH?cTFSE%tmNJx^Jb zoK;DJ)umxMo+IIKPlma`I0n8Ezhj#_vB_qgWP$Vt+#crb2qSkihe_gOxsKOwrF=Iu zXo}h_*$b9j<0a!0yndFwBFu7RV)|`bjj_CziY>2knHjXdcF5h9H2nd0GrcS!itoeP z{PgtCVIt;e`HRCedl3RS9s1cSyEtkKnbNHq`wLN&^u!LMjj6yjSb7-{ds2o$qRlK*7MgwiJj(v zi6dl0QcMcR6e*%EzNhw9qLC>|Qy4U(GR&LEvuSZrRHN;L!$7B&V&`2{UVus};sW@| z#fk9kF64_Gc{M`*81>@$Xd#Aq*OqjT$7`sXETt08QU&AKCF8%o@Y#2*CEL}M=H zi7F<0TxLsJN1i(m1BqXRA9>bg` zjVzO%kg?q>pv_ftHmOU#P^c>FdgwLhuqO9yM$DPT%Q^@a(<1!$8gPgqo>*14cKRr=9k+E-+>eQxL(E!ylD3Y8S%rN!NlYEQmi2bX~K!;J??9Nv7 z5_l7`pxYP!4pKeb{ySWMDAbQnJWfq0hg3YW2Z>;*(%kUG3p@i!8pz;+JPu(3wGJW^ zD=RG*$UMCj6ZIhHQC}r&Uu!mpNGB`m{mVByTFtJTdQRDBg-UmMS7%|)(@IQHIzbOq zgvr#k3snHd%s)xmtrr!|sPTUc4Ivt7<+7|W6~ky&rPa#8ZJYcV4+;(QPFhxZ#AujxTF5F7 zBt1&2RR-bV@!p5DdVUP|d<^04Rl)421MuLFpMUYsd>XpP{H-Tg4Jfi*{d2zv_~GyW z;pco7zyCLzond`;emZ~j$M^q!*YE$m&cxu5$HOcSf2iXB)1R}cl>Gd^=BvNILH}o` z^YgQ)|BvRUkD&ka+3fTW{r_F{|03%er!Qbp6p7g8isqLNDw|@7Q7nHOjw5I)GNo^E zBGPgVw(QWwlXOWCMJv{~%-Hb?oRW;qD4+~HKAiQ!#6%m+|A-&G;>zI%|%@0)Vny2kas-e*J6igVwcSrkW& z_K*&9%X?^CgibVffG4s2%nU++tkNI?f7Qgw0Q6C3JIo&r?d_~5LiZU96bmr!Xof-G zuCxs$?z~X)oo(z!Dl?p(9I^@XQQ;@GzY(PW$@5pUDW7zq-}SZImaSWMQr(g&R$QBX z@pu+4J_AW#ddw;<{IxbvTZ;FhHX+ZIRZ4#-bNv3YcB}W_AE-(pr#{!E?XTvrNrzZ5 zqXJT^Ad-g1q`MI!%x_7{xk;E|kaLvRb?lm``;rnNvvFfU!)ctn3yKjtq|q^rl+tUL z>J~M(ep$#Qt(RGmI598NIK9+5V*brilD_uKb?Gcwn=DEh-AF4_T3I=h@+~Q95=b7Z zee&Uq=uJ$BD0O48;UmY%(Q-cjwbW>SZ)$XRVs!pLO^l{lV)Sb%PfD6^UWzt7yn>Hnqx54-W1A^S_Cfm~kG~f5>Nj54Xp+!#})`+PrBfG}}Yp zSm!bh%<2z+ee*Vz$+i0W?b{!CI2crjbs09?H60RKPV5oOOiN30A2YMFGdm9*ah?G^ z_PU63DG%#Kkco2d#SskPfdBF&kE^uE2gw6Vj}u4-;@l2iWYWKd&p_okmkH<7@GbFV zVS40KD7Gz!i-j;IDFYt2l=;(C#cV1k^H6ybgZ1JKF>PT07A>oJW+J-^r~Z&WU~p1b zaAIF8SdjpB$y#F=J=nb*qKM`oj$jvIVx(;&$|oWOMD(L}Aah|m6Grv-vgGP#wMa(#$>xe!PH%*g>u|+$=p+{IN!YM~R$iS8(b=t^h@A8Wb$)@?` z2r(IpsBrbLC8a5B^ihDROK*7@(W`cIIb=Y=@Y*1LCH7t!tEA-D@^m(}rS%X0iqS}) z#2euoUo0;{uohx}auMF)NKez^CXRUHR2tlbd|QMckCRDICd3#E+k$W{@)dIta4F|i zV&b0jj|{Fbq`(m*A6}m-4oCYb9Z`mghOIwD41jNA++qBI`nQ%7NTb3&k zdQAhqp}qQx!jd`lYESl*n?Dvq=;IM+k3kdFD;{;N2l+GQ1MJB~EwgCvY9AclO&_RV@5R1b? z$*WlvnR6ICoH~^+(Om%hs&xq?l+ruott;MhOAD`D1cVg4amr&X(6I24c|h8oHD$n1 zno6UZFRe1dj45(J3C<2@4DiM2jKhNTX^0Fa3r;n|TtEBP7dzsbY=>gcTAp;jsVumc1rt(oz7{eV&!s@vo0LLiyACEh$7wTi&bTlT*voc}i(eE!#Vd%aG(gY&sIG~&rv#c8TR^|^^;a}SS82@gQ@!x7!_?_&uRSKjsI4&)sXhzCLSdGyPaOA!vC|_e<#T}hqB5p zacg8p#vyeMReXHyPlJDqXj2H&*-z^cj^fRmYnjXTc4OOL*k_26OO5+iQQ~KG)5iUm z;+%}L4{>Y`-?hbYzWH7xvbv<)?xoBCfnzD5Y2%(T$jK^POs?j)&Md^qRD7NOwB|IN z%ixOJ5b?UWVf2%Q*T0D2p&#X*IthJ_Lo*&-Os{JBJW3}U`-}T9ScSiEKc`YvhRXk7 zGxR;Zn~yd|B@x#up=wz>y^6vS{BzgBml-`;7++>9z`%6OA1p6QGy1Z&_V`3J&F-RL z;EAwIBzbe_1}Wax?BYB?qGqazb53+Sz%g_ARlG{g-cemfTLW^$HLVTz!ST5PKeHx< zAxq|qJ6M<^Kc40^iNWG7nTaAw`8-YsS1@&skWFNdv(Ji$I3Iuo z%8G-Vu{iu#d*cv(3V`O;B_WwBp{pO*iMNdmN%Xuo@DveTDNZ6^hLBC11S** zlf=}$BlX%I8wT~88#S7r^CXitxjipV!Wcoh^pQO?*1G&7&5PO#-Z%>UCBwfM*;b}} z*^YbfcN8|-E4p)?JWm#0b@J^;yfF19H1S-%WVmkWj~*8%I>ocw$2^9IUvU6XJidJ@ zU<{cH3@#{TJ8To3HHJ%9b3)0se;K^YG7K%2~eH)TtTE0-KL%V#_(ba%F#2U(>*y z(I|38(bS>V9ctQ1^30*foh%tUlY~B#V5Ua&zsj&A010b<)cS|A*OTkN(}v?Os@m1> z|2#|ihIVjR&jUD)r0b(FyAEPf!<&vJ1WtQk%UZ|&!AJUm%Pc2YkhltEk%c=v^R{e7 z*~Q&XHnUGi#!eQrykNSNo+OAleq0LXltR?v;VhX4`+)nt%6qbpk@3YFia~k9bCS?l z_}7yroCwn_K3?PqqCAT6+a0<1k%RMO8jkI^T0#esUzUkKo`D@4(jN_@nA!SiJsdu@ zj}0|`M@^z_Oa-#)H*$vSm!z`Zx&4E#4u7t6?C*l5FGKrCSDA&5ec`V{YHgB)c+73o z-G!qO}s)dNyTSIGZKy68T#Lm~)LB{pH`o)dn|i!`0cO zs5&HsHtL<0d$4M4wnzfa(V7fo+RB%SpK?B}^IYUnIHs=g{Y?~NHe7e=Ph;A-?6VYy zIjpt@izS@LAL;Sqi_vZHg^32kk&emUP&$Vf9uN-@heipnifDS`E<#PVk8vJVW{q>K z>)=lKchL)+DXhrA>pL^)FYe5>ex0;*x-iJ)vB8-O(`}geTjtX#4Kj{v*HIA0{1YiU zLYCg2*7?SG#vzsv2RoNuzr`1V;sb4X!;1$9o8v9EpzttkcdW89D@V57cXPRIBap6EOXozWw>QylKDFdi>nc;Nf z=p(F=##W=j^-QcC8$kl+g%e_`F-97;V~LYHo~@3}7BNYaM}G3)@3TxKteb}!lZG67 z%IWI}{`>xjNc&L)l*GcUT-y%KRi{p%=5|?~&*g+pnSd&{cP9TiWw7CI|_A0{Js&Cbvl92VwH5jcyE{({l|v z`ko|&>tq#0eCdCX(p5-{DuWP2I&l^%1{6lufn&cTF!||jF5>cq;8u1~ zG|WwT!*|jpms7$?UVF05YcJ<;5+2w!qV*uKUnj9C(}>mj5R=gu$`A4*QCqPK@Ki;L zSGggiJNDTJWGtQQbh@2dt#fc-;PrFN(cEvkO_PO~sQ+u6=UEu}+n87NgwSxF+?S#+ zOEJ3Nf6+itdaeC*tgGg+f7fKRn*2e|lDxI>x8F-_*3nmVlsem()CHw&VO!f%YE5HT zY0Tb&jPnq6t6u2)H4ElY>K-=6qI*pHEUQ3asBVJb1B1{0g-6%ca3n5axNGlrF zpqEKzzf`{#3Ln~wWRzqgAqmlMtxWlyKFIHWZ6jN1;{i|A)ZVq^Umuw;uX{HCIJ!aT zosZO}xVN>rK0(v}Hb<^AT|K@KSQNsB+{<=Dy3dya(vBKtVTc;)3N?|}F~VID2U9{= zUjQO0*^IN2sh=uw(tMPG$rKd`%$2UqI<+^6*0*=Z{LEO5xfBw}f@E*O-%5Db`QIctWy1=!>-m8w*wDpjdURjN{z qs#K*aRjEo literal 0 HcmV?d00001 -- 2.20.1