From 21c4bef30ea88446a446212ea6b1cc11218496bf Mon Sep 17 00:00:00 2001 From: Mart Lubbers Date: Mon, 23 Feb 2015 21:08:55 +0100 Subject: [PATCH] thesis update --- thesis2/1.introduction.tex | 3 + thesis2/3.methods.tex | 90 +++++++++------ thesis2/Makefile | 2 +- thesis2/backend.dot | 4 +- thesis2/backend.eps | 168 ++++++++++++++-------------- thesis2/nodelistexample.dot | 1 + thesis2/nodelistexample.eps | 216 ++++++++++++++++++++---------------- 7 files changed, 266 insertions(+), 218 deletions(-) diff --git a/thesis2/1.introduction.tex b/thesis2/1.introduction.tex index 366d13a..74efdbc 100644 --- a/thesis2/1.introduction.tex +++ b/thesis2/1.introduction.tex @@ -62,6 +62,9 @@ the only company in its kind that has such high quality information. The \textit{infotainment} is presented via several websites specialized per genre or category and some sites attract over $500.000$ visitors per month. +\section{Extracting data from plain text} + + \section{Information flow} The reason why Hyperleap is the only in its kind with the high quality data is because Hyperleap spends a lot of time and resources on quality checking, cross diff --git a/thesis2/3.methods.tex b/thesis2/3.methods.tex index 5729b79..4fd871d 100644 --- a/thesis2/3.methods.tex +++ b/thesis2/3.methods.tex @@ -1,9 +1,12 @@ \section{Application overview} The backend consists of several processing steps before a crawler specification -is ready. +is ready. These steps are visualized in Figure~\ref{appinternals} where all the +nodes are important milestones in the process of processing the user data. +Arrows indicate informatio transfer between these steps. The Figure is a +detailed explanation of the \textit{Backend} node in Figure~\ref{appoverview}. \begin{figure}[H] - \label{dawg1} + \label{appinternals} \centering \includegraphics[width=\linewidth]{backend.eps} \strut\\ @@ -11,40 +14,51 @@ is ready. \end{figure} \section{HTML data} -Data marked by the user will be returned as a \texttt{POST} http request and is -basically the information in the description fields accompanied by the raw html -of the table with the patterns. For every RSS feed entry there is a row in the -table and every marker is a \texttt{SPAN} element containing the color of the -text. Within the HTML code of the table markers are placed to make parsing more -easy and remove the need of an advanced HTML parser. The \texttt{POST} request -that will be send when the user presses the submit button will not be sent -asynchronously, the user has to wait for the processing to finish. Because of -this the user will be immediately notified when the processing has finished -successfully. The preparation of the data for the \texttt{POST} request is all -done in Javascript on the user side, the processing of the data in the backend -is all done server side. +The raw data from the \textit{Frontend} with the user markings will enter the +backend as a HTTP \textit{POST} request and consists of several information +fields extracted from the description boxes in the front and accompanied by the +raw \textit{HTML} data from the table with the feed entries that contain the +marking made by the user at the time just before the submit button is pressed. +Within the \textit{HTML} data of the table markers are placed before sending to +make the parsing of the tables more easy and remove the need for an advanced +\textit{HTML} parser. The \textit{POST} request is not send asynchronously, +this means the user has to wait until the server has processed the request. In +this way the user will be immediately notified when the processing has finished +successfully and will be able to review and test the resulting crawler. All the +data preparation for the \textit{POST} request is done in Javascript and thus +on the user side, all the processing afterwards is done in the backend and thus +server side. All the descriptive fields will be transferred to the final +aggregating dictionary. \section{Table rows} -When the backend receives the HTTP POST data it saves all the table HTML data -and the data from the description fields. The program first extracts the table -rows using simple pattern matching on the added markers. The table rows are -processed one at the time, the processing mainly consists of finding the -\texttt{SPAN} elements, extract the color, remove the elements and save the -plain text and the location and type of the marking. When the table rows are -processed a datastructure with all the entries accompanied by their markers -will go to the next step. All intermediate data will be saved for later use. +The first conversion step is to extract the rows within from the \textit{HTML} +data. Because of the table markers this process can be done using very +rudimentary matching techniques that require very little computational power. +All the \texttt{SPAN} \textit{HTML} elements have to be found since the marking +of the user are visualized through colored \texttt{SPAN} elements. To achieve +this for every row all \texttt{SPAN} elements are extracted, again with simple +matching techniques, to extract the color and afterwards to remove the element +to retrieve the original plain text of the \textit{RSS} feed entry. When this +step is done a data structure containing all the text of the entries together +with the markings will go to the next step. All original data, namely the +\textit{HTML} data per row, will be transferred to the final aggregating +dictionary. \section{Node lists} -Every entry from the datastructure is processed to convert them to node-lists. -A node-list can be seen as a path graph of every character and marking. A path -graph $G$ is defined as $G=(V,n_1,E,n_i)$ where $V=\{n_1, n_2, \cdots, n_{i-1}, -n_i\}$ and $E=\{(n_1, n_2), (n_2, n_3), ... (n_{i-1}, n_{i})\}$. A path graph -is basically a graph that is a path of nodes where every node is connected to -the next on and the last on being final. The transitions between two nodes is -either a character or a marking. As an example we take the entry \texttt{19:00, -2014-11-12 - Foobar} and create the corresponding node-lists and make it -visible in Figure~\ref{nodelistexample}. Characters are denoted with single -quotes, spaces with an underscore and markers with angle brackets. +Every entry gotten from the previous step are going to be processing into so +called node-lists. A node-list can be seen as a path graph of every character +and marking. A path graph $G$ is defined as $G=(V,n_1,E,n_i)$ where $V=\{n_1, +n_2, \cdots, n_{i-1}, n_i\}$ and $E=\{(n_1, n_2), (n_2, n_3), ... (n_{i-1}, +n_{i})\}$. A path graph is basically a graph that is a path of nodes where +every node is connected to the next on and the last on being final. The +transitions between two nodes is either a character or a marking. As an example +we take the entry \texttt{19:00, 2014-11-12 - Foobar} and create the +corresponding node-lists and make it visible in Figure~\ref{nodelistexample}. +Characters are denoted with single quotes, spaces with an underscore and +markers with angle brackets. Node-lists are the basic elements from which the +DAWG will be generated. These node-lists will also be available in the final +aggregating dictionary to ensure consistency of data and possibility of +regenerating the data. \begin{figure}[H] \label{nodelistexample} \centering @@ -52,7 +66,6 @@ quotes, spaces with an underscore and markers with angle brackets. \caption{Node list example} \end{figure} - \section{DAWGs} \subsection{Terminology} \textbf{Parent nodes} are nodes that have an arrow to the child.\\ @@ -63,18 +76,20 @@ We represent the user generated patterns as DAWGs by converting the node-lists to DAWGS. Normally DAWGs have single letters from an alphabet as edgelabel but in our implementation the DAWGs alphabet contains all letters, whitespace and punctuation but also the specified user markers which can be multiple -characters in length. +characters of actual length but for the DAWGs' sake they are one transition in +the graph. DAWGs are graphs but due to the constraints we can use the DAWG to check if a match occurs by checking if a path exists that creates the word by -concatenating all the edge labels. The first algorithm to generate DAWGs from +concatenating all the edge labels while trying to find a path following the +characters from the entry. The first algorithm to generate DAWGs from words was proposed by Hopcroft et al\cite{Hopcroft1971}. It is an incremental approach in generating the graph. Meaning that entry by entry the graph will be expanded. Hopcrofts algorithm has the constraint of lexicographical ordering. Later on Daciuk et al.\cite{Daciuk2000} improved on the original algorithm and their algorithm is the algorithm we used to minimize our DAWGs. A minimal graph is a graph $G$ for which there is no graph $G'$ that has less -paths and $\mathcal{L}(G)=\mathcal{L }(G')$ where $\mathcal{L}(G)$ is the set +paths and $\mathcal{L}(G)=\mathcal{L}(G')$ where $\mathcal{L}(G)$ is the set of all words present in the DAWG. \subsection{Algorithm} @@ -216,6 +231,9 @@ added. } \end{figure} +\subsection{Minimality of the algorithm} + + \subsection{Appliance on extraction of patterns} The text data in combination with the user markings can not be converted automatically to a DAWG using the algorithm we described. This is because the diff --git a/thesis2/Makefile b/thesis2/Makefile index 7c7fbef..ad0c690 100644 --- a/thesis2/Makefile +++ b/thesis2/Makefile @@ -1,5 +1,5 @@ SHELL:=/bin/bash -VERSION:=0.91 +VERSION:=0.95 all: thesis diff --git a/thesis2/backend.dot b/thesis2/backend.dot index eef3f17..1acf7a2 100644 --- a/thesis2/backend.dot +++ b/thesis2/backend.dot @@ -2,12 +2,12 @@ digraph { rankdir=LR q0 [style=invis] q1 [style=invis] - q0 -> "HTML data" + q0 -> "HTML data" [label="From frontend"] "HTML data" -> "Table rows" "Table rows" -> "Node lists" "Node lists" -> "Dawg" "Dawg" -> "Dictionary" "HTML data" -> "Dictionary" [label="Description fields"] "Table rows" -> "Dictionary" [label="Original text"] - "Dictionary" -> q1 + "Dictionary" -> q1 [label="To crawler"] } diff --git a/thesis2/backend.eps b/thesis2/backend.eps index 65c7bc7..5b6bd39 100644 --- a/thesis2/backend.eps +++ b/thesis2/backend.eps @@ -2,7 +2,7 @@ %%Creator: graphviz version 2.38.0 (20140413.2041) %%Title: %3 %%Pages: 1 -%%BoundingBox: 36 36 907 152 +%%BoundingBox: 36 36 1045 152 %%EndComments save %%BeginProlog @@ -179,226 +179,232 @@ def %%EndSetup setupLatin1 %%Page: 1 1 -%%PageBoundingBox: 36 36 907 152 +%%PageBoundingBox: 36 36 1045 152 %%PageOrientation: Portrait 0 0 1 beginpage gsave -36 36 871 116 boxprim clip newpath +36 36 1009 116 boxprim clip newpath 1 1 set_scale 0 rotate 40 40 translate % q0 % HTML data gsave 1 setlinewidth 0 0 0 nodecolor -144.3 18 53.09 18 ellipse_path stroke +223.3 18 53.09 18 ellipse_path stroke 0 0 0 nodecolor 14 /Times-Roman set_font -111.3 14.3 moveto 66 (HTML data) alignedtext +190.3 14.3 moveto 66 (HTML data) alignedtext grestore % q0->HTML data gsave 1 setlinewidth 0 0 0 edgecolor -newpath 54.16 18 moveto -62.16 18 71.33 18 80.68 18 curveto +newpath 54.2 18 moveto +81.07 18 123.86 18 159.54 18 curveto stroke 0 0 0 edgecolor -newpath 80.84 21.5 moveto -90.84 18 lineto -80.84 14.5 lineto +newpath 159.88 21.5 moveto +169.88 18 lineto +159.88 14.5 lineto closepath fill 1 setlinewidth solid 0 0 0 edgecolor -newpath 80.84 21.5 moveto -90.84 18 lineto -80.84 14.5 lineto +newpath 159.88 21.5 moveto +169.88 18 lineto +159.88 14.5 lineto closepath stroke +0 0 0 edgecolor +14 /Times-Roman set_font +72 21.8 moveto 80 (From frontend) alignedtext grestore % q1 % Table rows gsave 1 setlinewidth 0 0 0 nodecolor -284.64 47 50.09 18 ellipse_path stroke +363.64 47 50.09 18 ellipse_path stroke 0 0 0 nodecolor 14 /Times-Roman set_font -254.14 43.3 moveto 61 (Table rows) alignedtext +333.14 43.3 moveto 61 (Table rows) alignedtext grestore % HTML data->Table rows gsave 1 setlinewidth 0 0 0 edgecolor -newpath 190.21 27.42 moveto -203.16 30.13 217.41 33.12 230.8 35.92 curveto +newpath 269.21 27.42 moveto +282.16 30.13 296.41 33.12 309.8 35.92 curveto stroke 0 0 0 edgecolor -newpath 230.32 39.4 moveto -240.83 38.03 lineto -231.76 32.55 lineto +newpath 309.32 39.4 moveto +319.83 38.03 lineto +310.76 32.55 lineto closepath fill 1 setlinewidth solid 0 0 0 edgecolor -newpath 230.32 39.4 moveto -240.83 38.03 lineto -231.76 32.55 lineto +newpath 309.32 39.4 moveto +319.83 38.03 lineto +310.76 32.55 lineto closepath stroke grestore % Dictionary gsave 1 setlinewidth 0 0 0 nodecolor -723.47 44 48.19 18 ellipse_path stroke +802.47 44 48.19 18 ellipse_path stroke 0 0 0 nodecolor 14 /Times-Roman set_font -694.47 40.3 moveto 58 (Dictionary) alignedtext +773.47 40.3 moveto 58 (Dictionary) alignedtext grestore % HTML data->Dictionary gsave 1 setlinewidth 0 0 0 edgecolor -newpath 191.55 9.45 moveto -218.35 5.22 252.78 1 283.64 1 curveto -283.64 1 283.64 1 607.53 1 curveto -636.05 1 666.18 12.76 688.43 23.98 curveto +newpath 270.55 9.45 moveto +297.35 5.22 331.78 1 362.64 1 curveto +362.64 1 362.64 1 686.53 1 curveto +715.05 1 745.18 12.76 767.43 23.98 curveto stroke 0 0 0 edgecolor -newpath 686.97 27.17 moveto -697.45 28.72 lineto -690.23 20.97 lineto +newpath 765.97 27.17 moveto +776.45 28.72 lineto +769.23 20.97 lineto closepath fill 1 setlinewidth solid 0 0 0 edgecolor -newpath 686.97 27.17 moveto -697.45 28.72 lineto -690.23 20.97 lineto +newpath 765.97 27.17 moveto +776.45 28.72 lineto +769.23 20.97 lineto closepath stroke 0 0 0 edgecolor 14 /Times-Roman set_font -371.68 4.8 moveto 97 (Description fields) alignedtext +450.68 4.8 moveto 97 (Description fields) alignedtext grestore % Node lists gsave 1 setlinewidth 0 0 0 nodecolor -420.18 90 46.29 18 ellipse_path stroke +499.18 90 46.29 18 ellipse_path stroke 0 0 0 nodecolor 14 /Times-Roman set_font -392.68 86.3 moveto 55 (Node lists) alignedtext +471.68 86.3 moveto 55 (Node lists) alignedtext grestore % Table rows->Node lists gsave 1 setlinewidth 0 0 0 edgecolor -newpath 322.48 58.86 moveto -338.49 64.02 357.38 70.1 374.21 75.52 curveto +newpath 401.48 58.86 moveto +417.49 64.02 436.38 70.1 453.21 75.52 curveto stroke 0 0 0 edgecolor -newpath 373.25 78.89 moveto -383.84 78.62 lineto -375.39 72.22 lineto +newpath 452.25 78.89 moveto +462.84 78.62 lineto +454.39 72.22 lineto closepath fill 1 setlinewidth solid 0 0 0 edgecolor -newpath 373.25 78.89 moveto -383.84 78.62 lineto -375.39 72.22 lineto +newpath 452.25 78.89 moveto +462.84 78.62 lineto +454.39 72.22 lineto closepath stroke grestore % Table rows->Dictionary gsave 1 setlinewidth 0 0 0 edgecolor -newpath 334.9 46.66 moveto -416.38 46.1 578.77 44.99 665.22 44.39 curveto +newpath 413.9 46.66 moveto +495.38 46.1 657.77 44.99 744.22 44.39 curveto stroke 0 0 0 edgecolor -newpath 665.4 47.89 moveto -675.38 44.32 lineto -665.36 40.89 lineto +newpath 744.4 47.89 moveto +754.38 44.32 lineto +744.36 40.89 lineto closepath fill 1 setlinewidth solid 0 0 0 edgecolor -newpath 665.4 47.89 moveto -675.38 44.32 lineto -665.36 40.89 lineto +newpath 744.4 47.89 moveto +754.38 44.32 lineto +744.36 40.89 lineto closepath stroke 0 0 0 edgecolor 14 /Times-Roman set_font -486.68 48.8 moveto 70 (Original text) alignedtext +565.68 48.8 moveto 70 (Original text) alignedtext grestore % Dawg gsave 1 setlinewidth 0 0 0 nodecolor -606.53 90 31.7 18 ellipse_path stroke +685.53 90 31.7 18 ellipse_path stroke 0 0 0 nodecolor 14 /Times-Roman set_font -590.03 86.3 moveto 33 (Dawg) alignedtext +669.03 86.3 moveto 33 (Dawg) alignedtext grestore % Node lists->Dawg gsave 1 setlinewidth 0 0 0 edgecolor -newpath 466.75 90 moveto -496.66 90 535.46 90 564.43 90 curveto +newpath 545.75 90 moveto +575.66 90 614.46 90 643.43 90 curveto stroke 0 0 0 edgecolor -newpath 564.62 93.5 moveto -574.62 90 lineto -564.62 86.5 lineto +newpath 643.62 93.5 moveto +653.62 90 lineto +643.62 86.5 lineto closepath fill 1 setlinewidth solid 0 0 0 edgecolor -newpath 564.62 93.5 moveto -574.62 90 lineto -564.62 86.5 lineto +newpath 643.62 93.5 moveto +653.62 90 lineto +643.62 86.5 lineto closepath stroke grestore % Dawg->Dictionary gsave 1 setlinewidth 0 0 0 edgecolor -newpath 633.04 79.79 moveto -646.95 74.22 664.53 67.19 680.41 60.83 curveto +newpath 712.04 79.79 moveto +725.95 74.22 743.53 67.19 759.41 60.83 curveto stroke 0 0 0 edgecolor -newpath 682.04 63.95 moveto -690.03 56.98 lineto -679.44 57.45 lineto +newpath 761.04 63.95 moveto +769.03 56.98 lineto +758.44 57.45 lineto closepath fill 1 setlinewidth solid 0 0 0 edgecolor -newpath 682.04 63.95 moveto -690.03 56.98 lineto -679.44 57.45 lineto +newpath 761.04 63.95 moveto +769.03 56.98 lineto +758.44 57.45 lineto closepath stroke grestore % Dictionary->q1 gsave 1 setlinewidth 0 0 0 edgecolor -newpath 771.8 44 moveto -780.65 44 789.78 44 798.23 44 curveto +newpath 850.57 44 moveto +877.59 44 911.01 44 936.03 44 curveto stroke 0 0 0 edgecolor -newpath 798.51 47.5 moveto -808.51 44 lineto -798.51 40.5 lineto +newpath 936.2 47.5 moveto +946.2 44 lineto +936.2 40.5 lineto closepath fill 1 setlinewidth solid 0 0 0 edgecolor -newpath 798.51 47.5 moveto -808.51 44 lineto -798.51 40.5 lineto +newpath 936.2 47.5 moveto +946.2 44 lineto +936.2 40.5 lineto closepath stroke +0 0 0 edgecolor +14 /Times-Roman set_font +868.57 47.8 moveto 60 (To crawler) alignedtext grestore endpage showpage diff --git a/thesis2/nodelistexample.dot b/thesis2/nodelistexample.dot index f4e5f03..5bffaae 100644 --- a/thesis2/nodelistexample.dot +++ b/thesis2/nodelistexample.dot @@ -2,6 +2,7 @@ digraph { rankdir=LR; n0 [style=invis] n9 [shape=doublecircle] + n0 -> n1 n1 -> n2 [label="