process comments of proofread

[asr1617.git] / methods.tex
diff --git a/methods.tex b/methods.tex

index 96365ae..e166208 100644 (file)
--- a/methods.tex
+++ b/methods.tex
@@ -29,16 +29,16 @@ metal there are different spectral patterns visible over time.
  
  The data is collected from three studio albums. The first band is called
  \emph{Cannibal Corpse} and has been producing \gls{dm} for almost 25 years and
-have been creating the same type every album. The singer of \emph{Cannibal
-Corpse} has a very raspy growls and the lyrics are quite comprehensible. The
-vocals produced by \emph{Cannibal Corpse} are bordering regular shouting. 
+has been creating album with a consistent style. The singer of \emph{Cannibal
+Corpse} has a very raspy growl and the lyrics are quite comprehensible. The
+vocals produced by \emph{Cannibal Corpse} border regular shouting. 
  
-The second band is called \emph{Disgorge} and make even more violently sounding
-music. The growls of the lead singer sound like a coffee grinder and are more
-shallow. In the spectrals it is clearly visible that there are overtones
-produced during some parts of the growling. The lyrics are completely
+The second band is called \emph{Disgorge} and makes even more violently
+sounding music. The growls of the lead singer sound like a coffee grinder and
+are more shallow. In the spectrals it is clearly visible that there are
+overtones produced during some parts of the growling. The lyrics are completely
  incomprehensible and therefore some parts were not annotated with the actual
-lyrics because it was not possible what was being sung.
+lyrics because it was impossible to hear what was being sung.
  
  Lastly a band from Moscow is chosen bearing the name \emph{Who Dies in
  Siberian Slush}. This band is a little odd compared to the previous \gls{dm}
@@ -71,15 +71,15 @@ The training and test data is divided as follows:
  \section{\acrlong{MFCC} Features}
  The waveforms in itself are not very suitable to be used as features due to the
  high dimensionality and correlation. Therefore we use the often used
-\glspl{MFCC} feature vectors which has shown to be
-suitable\cite{rocamora_comparing_2007}. It has also been found that altering
-the mel scale to better suit singing does not yield a better
+\glspl{MFCC} feature vectors which have shown to be suitable%
+\cite{rocamora_comparing_2007}. It has also been found that altering the mel
+scale to better suit singing does not yield a better
  performance\cite{you_comparative_2015}. The actual conversion is done using the
  \emph{python\_speech\_features}%
  \footnote{\url{https://github.com/jameslyons/python_speech_features}} package.
  
-\gls{MFCC} features are inspired by human auditory processing inspired and
-built incrementally in several steps.
+\gls{MFCC} features are inspired by human auditory processing inspired and are
+created from a waveform incrementally using several steps:
  \begin{enumerate}
         \item The first step in the process is converting the time representation
                 of the signal to a spectral representation using a sliding window with
@@ -93,13 +93,13 @@ built incrementally in several steps.
                 using triangular overlapping windows to get a more tonotopic
                 representation trying to match the actual representation in the cochlea
                 of the human ear.
-       \item The \emph{Weber-Fechner} law that describes how humans perceive physical
+       \item The \emph{Weber-Fechner} law describes how humans perceive physical
                 magnitudes\footnote{Fechner, Gustav Theodor (1860). Elemente der
-               Psychophysik} and it was found that energy is perceived in logarithmic
+               Psychophysik}. They found that energy is perceived in logarithmic
                 increments. This means that twice the amount of decibels does not mean
-               twice the amount of perceived loudness. Therefore in this step log is
-               taken of energy or amplitude of the \gls{MS} frequency spectrum to
-               closer match the human hearing.
+               twice the amount of perceived loudness. Therefore we take the log of
+               the energy or amplitude of the \gls{MS} spectrum to closer match the
+               human hearing.
         \item The amplitudes of the spectrum are highly correlated and therefore
                 the last step is a decorrelation step. \Gls{DCT} is applied on the
                 amplitudes interpreted as a signal. \Gls{DCT} is a technique of
@@ -154,9 +154,9 @@ labeled as \texttt{1000, 0100, 0010, 0001}.
  
  \subsection{\acrlong{ANN}}
  The data is classified using standard \gls{ANN} techniques, namely \glspl{MLP}.
-The classification problems are only binary and four-class so therefore it is
-interesting to see where the bottleneck lies. How abstract the abstraction can
-go. The \gls{ANN} is built with the Keras\footnote{\url{https://keras.io}}
+The classification problems are only binary and four-class so it is
+interesting to see where the bottleneck lies; how abstract can the abstraction
+be made. The \gls{ANN} is built with the Keras\footnote{\url{https://keras.io}}
  using the TensorFlow\footnote{\url{https://github.com/tensorflow/tensorflow}}
  backend that provides a high-level interface to the highly technical networks.
  
@@ -177,7 +177,7 @@ activation function suitable for multiple output nodes. The definition is given
  in Equation~\ref{eq:softmax}.
  
  The data is shuffled before fed to the network to mitigate the risk of
-over fitting on one album. Every model was trained using $10$ epochs and a
+overfitting on one album. Every model was trained using $10$ epochs and a
  batch size of $32$.
  
  \begin{equation}\label{eq:relu}
@@ -203,6 +203,7 @@ batch size of $32$.
         \end{subfigure}%
  %
         \begin{subfigure}{.5\textwidth}
+               \centering
                 \includegraphics[width=.8\linewidth]{mcann}
                 \caption{Multiclass classifier network architecture}\label{fig:mcann}
         \end{subfigure}
@@ -269,7 +270,7 @@ frequency range from $0$ to $3000Hz$.
         \caption{Plotting the classifier under similar alien data}\label{fig:alien1}
  \end{figure}
  
-To really test the limits a song from the highly atmospheric doom metal band
+To really test the limits, a song from the highly atmospheric doom metal band
  called \emph{Catacombs} has been tested on the system. The album \emph{Echoes
  Through the Catacombs} is an album that has a lot of synthesizers, heavy
  droning guitars and bass lines. The vocals are not mixed in a way that makes
@@ -280,6 +281,6 @@ singing from non singing.
  
  \begin{figure}[H]
         \centering
-       \includegraphics[width=.6\linewidth]{alien1}.
+       \includegraphics[width=.6\linewidth]{alien2}.
         \caption{Plotting the classifier under different alien data}\label{fig:alien2}
  \end{figure}