--- /dev/null
+\newacronym{ANN}{ANN}{Artificial Neural Network}
+\newacronym{DCT}{DCT}{Discrete Cosine Transform}
+\newacronym{DHMM}{DHMM}{Duration-explicit \acrlong{HMM}}
+\newacronym{FA}{FA}{Forced alignment}
+\newacronym{GMM}{GMM}{Gaussian Mixture Models}
+\newacronym{HMM}{HMM}{Hidden Markov Model}
+\newacronym{HTK}{HTK}{\acrlong{HMM} Toolkit}
+\newacronym{IFPI}{IFPI}{International Federation of the Phonographic Industry}
+\newacronym{LPCC}{LPCC}{\acrlong{LPC} derivec cepstrum}
+\newacronym{LPC}{LPC}{Linear Prediction Coefficients}
+\newacronym{MFCC}{MFCC}{\acrlong{MFC} coefficient}
+\newacronym{MFC}{MFC}{Mel-frequency cepstrum}
+\newacronym{MLP}{MLP}{Multi-layer Perceptron}
+\newacronym{PLP}{PLP}{Perceptual Linear Prediction}
+\newacronym{PPF}{PPF}{Posterior Probability Features}
+\newacronym{ZCR}{ZCR}{Zero-crossing Rate}
+\newacronym{RELU}{ReLU}{Rectified Linear Unit}
\usepackage{rutitlepage/rutitlepage} % Titlepage
\usepackage{hyperref} % Hyperlinks
\usepackage{booktabs} % Better looking tables
-\usepackage{todonotes} % Todo's
\usepackage{float} % Floating tables
\usepackage{csquotes} % Typeset quotes
\usepackage{subcaption} % Subfigures and captions
%&asr
\usepackage[toc,nonumberlist,acronyms]{glossaries}
\makeglossaries%
-\newacronym{ANN}{ANN}{Artificial Neural Network}
-\newacronym{DCT}{DCT}{Discrete Cosine Transform}
-\newacronym{DHMM}{DHMM}{Duration-explicit \acrlong{HMM}}
-\newacronym{FA}{FA}{Forced alignment}
-\newacronym{GMM}{GMM}{Gaussian Mixture Models}
-\newacronym{HMM}{HMM}{Hidden Markov Model}
-\newacronym{HTK}{HTK}{\acrlong{HMM} Toolkit}
-\newacronym{IFPI}{IFPI}{International Federation of the Phonographic Industry}
-\newacronym{LPCC}{LPCC}{\acrlong{LPC} derivec cepstrum}
-\newacronym{LPC}{LPC}{Linear Prediction Coefficients}
-\newacronym{MFCC}{MFCC}{\acrlong{MFC} coefficient}
-\newacronym{MFC}{MFC}{Mel-frequency cepstrum}
-\newacronym{MLP}{MLP}{Multi-layer Perceptron}
-\newacronym{PLP}{PLP}{Perceptual Linear Prediction}
-\newacronym{PPF}{PPF}{Posterior Probability Features}
-\newacronym{ZCR}{ZCR}{Zero-crossing Rate}
-\newacronym{RELU}{ReLU}{Rectified Linear Unit}
-\newglossaryentry{dm}{name={Death Metal},
- description={is an extreme heavy metal music style with growling vocals and
- pounding drums}}
-\newglossaryentry{dom}{name={Doom Metal},
- description={is an extreme heavy metal music style with growling vocals and
- pounding drums played very slowly}}
-\newglossaryentry{FT}{name={Fourier Transform},
- description={is a technique of converting a time representation signal to a
- frequency representation}}
-\newglossaryentry{MS}{name={Mel-Scale},
- description={is a human ear inspired scale for spectral signals}}
-\newglossaryentry{Viterbi}{name={Viterbi},
- description={is a dynamic programming algorithm for finding the most likely
- sequence of hidden states in a \gls{HMM}}}
+\input{acronyms}
+\input{glossaries}
\begin{document}
\frontmatter{}
righttextheader={Supervisor:},
righttext={Louis ten Bosch},
pagenr=1]
-\listoftodos[Todo]
\tableofcontents
-\mainmatter{}
-%Berenzweig and Ellis use acoustic classifiers from speech recognition as a
-%detector for singing lines. They achive 80\% accuracy for forty 15 second
-%exerpts. They mention people that wrote signal features that discriminate
-%between speech and music. Neural net
-%\glspl{HMM}~\cite{berenzweig_locating_2001}.
-%
-%In 2014 Dzhambazov et al.\ applied state of the art segmentation methods to
-%polyphonic turkish music, this might be interesting to use for heavy metal.
-%They mention Fujihara (2011) to have a similar \gls{FA} system. This method uses
-%phone level segmentation, first 12 \gls{MFCC}s. They first do vocal/non-vocal
-%detection, then melody extraction, then alignment. They compare results with
-%Mesaros \& Virtanen, 2008~\cite{dzhambazov_automatic_2014}. Later they
-%specialize in long syllables in a capella. They use \glspl{DHMM} with
-%\glspl{GMM} and show that adding knowledge increases alignment (bejing opera
-%has long syllables)~\cite{dzhambazov_automatic_2016}.
-%
+\glsaddall{}
+\printglossaries{}
+\mainmatter{}
-%Introduction, leading to a clearly defined research question
\chapter{Introduction}
-\input{intro.tex}
+\input{intro}
\chapter{Methods}
-\input{methods.tex}
+\input{methods}
\chapter{Conclusion \& Discussion}
-\input{conclusion.tex}
+\input{conclusion}
%(Appendices)
\appendix
-\input{appendices.tex}
-
-\newpage
-%Glossaries
-\glsaddall{}
-\begingroup
-\let\clearpage\relax
-\let\cleardoublepage\relax
-\printglossaries{}
-\endgroup
+\input{appendices}
\bibliographystyle{ieeetr}
\bibliography{asr}
\section{Conclusion \& Future Research}
-This research shows that existing techniques for singing-voice detection
-designed for regular singing voices also work respectably on extreme singing
+This study shows that existing techniques for singing-voice detection
+designed for regular singing-voices also work respectably on extreme singing
styles like grunting. With a standard \gls{ANN} classifier using \gls{MFCC}
features a performance of $85\%$ can be achieved which is similar to the same
techniques on regular singing. This means that it might be suitable as a
-pre-processing step for lyrics forced alignment. The model performs pretty good
-on alien data that uses similar singing techniques as the trainingset. However,
-the model is not coping very good with different singing techniques or with
-data that contains a lot of atmospheric noise and accompaniment.
+pre-processing step for lyrics forced alignment. The model performs pretty well
+on alien data that uses similar singing techniques as the training set.
+However, the model does not cope very well with different singing techniques or
+with data that contains a lot of atmospheric noise and accompaniment.
+\subsection{Future research}
+\paragraph{Forced aligment: }
Future interesting research includes doing the actual forced alignment. This
probably requires entirely different models. The models used for real speech
-are probably not suitable because the acoustic properties of a regular singing
-voice is very different from a growling voice, let alone speech.
+are probably not suitable because the acoustic properties of a regular
+singing-voice are very different from those of a growling voice, let alone
+speech.
+\paragraph{Generalization: }
Secondly, it would be interesting if a model could be trained that could
discriminate a singing voice for all styles of singing including growling.
Moreover, it is possible to investigate the performance of detecting growling
on regular singing-voice trained models and the other way around.
+\paragraph{Decorrelation }
Another interesting research continuation would be to investigate whether the
decorrelation step of the feature extraction is necessary. This transformation
might be inefficient or unnatural. The first layer of weights in the model
could take over the role of the decorrelating. The downside of this is that
training the model is tougher because there are a many more weights to train.
-\emph{Singing}-voice detection and \emph{singer}-voice Singing-voice detection
-can be seen as a crude way of genre-detection. Therefore it might be
-interesting to figure out whether this is generalizable to general genre
-recognition. This requires more data from different genres to be added to the
-dataset and the models to be retrained.
+\paragraph{Genre detection: }
+\emph{Singing}-voice detection and \emph{singer}-voice can be seen as a crude
+way of genre-detection. Therefore it might be interesting to figure out whether
+this is generalizable to general genre recognition. This requires more data
+from different genres to be added to the dataset and the models to be
+retrained.
+\paragraph{\glspl{HMM}: }
A lot of similar research on singing-voice detection uses \glspl{HMM} and
-existing phone models. It would be fruitful to try the same approach on extreme
-singing styles to see whether the phone models can say anything about a
+existing phone models. It would be interesting to try the same approach on
+extreme singing styles to see whether the phone models can say anything about a
growling voice.
%Discussion section
--- /dev/null
+\newglossaryentry{dm}{name={Death Metal},
+ description={is an extreme heavy metal music style with growling vocals and
+ pounding drums}}
+\newglossaryentry{dom}{name={Doom Metal},
+ description={is an extreme heavy metal music style with growling vocals and
+ pounding drums played very slowly}}
+\newglossaryentry{FT}{name={Fourier Transform},
+ description={is a technique of converting a time representation signal to a
+ frequency representation}}
+\newglossaryentry{MS}{name={Mel-Scale},
+ description={is a human ear inspired scale for spectral signals}}
+\newglossaryentry{Viterbi}{name={Viterbi},
+ description={is a dynamic programming algorithm for finding the most likely
+ sequence of hidden states in a \gls{HMM}}}
\section{Introduction}
The primary medium for music distribution is rapidly changing from physical
media to digital media. The \gls{IFPI} stated that about $43\%$ of music
-revenue rises from digital distribution. Another $39\%$ arises from the
+revenue arises from digital distribution. Another $39\%$ arises from the
physical sale and the remaining $16\%$ is made through performance and
synchronisation revenieus. The overtake of digital formats on physical formats
took place somewhere in 2015. Moreover, ever since twenty years the music
There has always been an interest in lyrics to music alignment to be used in
for example karaoke. As early as in the late 1980s karaoke machines were
available for consumers. While the lyrics for the track are almost always
-available, a alignment is not and it involves manual labour to create such an
+available, an alignment is not and it involves manual labour to create such an
alignment.
A lot of this musical distribution goes via non-official channels such as
\section{Related work}
Applying speech related processing and classification techniques on music
already started in the late 90s. Saunders et al.\ devised a technique to
-classify audio in the categories \emph{Music} and \emph{Speech}. It was found
+classify audio in the categories \emph{Music} and \emph{Speech}. They was found
that music has different properties than speech. Music has more bandwidth,
tonality and regularity. Multivariate Gaussian classifiers were used to
-discriminate the categories with an average performance of $90\%$.
+discriminate the categories with an average performance of $90\%%
+$\cite{saunders_real-time_1996}.
Williams and Ellis were inspired by the aforementioned research and tried to
separate the singing segments from the instrumental
already segmented.
Later, Berenzweig showed singing voice segments to be more useful for artist
-classification and used a \gls{ANN} (\gls{MLP}) using \gls{PLP} coefficients to
-separate detect singing voice\cite{berenzweig_using_2002}. Nwe et al.\ showed
-that there is not much difference in accuracy when using different features
-founded in speech processing. They tested several features and found accuracies
-differ less that a few percent. Moreover, they found that others have tried to
-tackle the problem using myriads of different approaches such as using
-\gls{ZCR}, \gls{MFCC} and \gls{LPCC} as features and \glspl{HMM} or \glspl{GMM}
-as classifiers\cite{nwe_singing_2004}.
+classification and used an \gls{ANN} (\gls{MLP}) using \gls{PLP} coefficients
+to detect a singing voice\cite{berenzweig_using_2002}. Nwe et al.\ showed that
+there is not much difference in accuracy when using different features founded
+in speech processing. They tested several features and found accuracies differ
+less that a few percent. Moreover, they found that others have tried to tackle
+the problem using myriads of different approaches such as using \gls{ZCR},
+\gls{MFCC} and \gls{LPCC} as features and \glspl{HMM} or \glspl{GMM} as
+classifiers\cite{nwe_singing_2004}.
Fujihara et al.\ took the idea to a next level by attempting to do \gls{FA} on
-music. Their approach is a three step approach. First step is reducing the
-accompaniment levels, secondly the vocal segments are
-separated from the non-vocal segments using a simple two-state \gls{HMM}.
-The chain is concluded by applying \gls{Viterbi} alignment on the segregated
-signals with the lyrics. The system showed accuracy levels of $90\%$ on
-Japanese music\cite{fujihara_automatic_2006}. Later they improved
-hereupon\cite{fujihara_three_2008} and even made a ready to use karaoke
-application that can do the this online\cite{fujihara_lyricsynchronizer:_2011}.
+music. Their approach is a three step approach. The first step is reducing the
+accompaniment levels, secondly the vocal segments are separated from the
+non-vocal segments using a simple two-state \gls{HMM}. The chain is concluded
+by applying \gls{Viterbi} alignment on the segregated signals with the lyrics.
+The system showed accuracy levels of $90\%$ on Japanese music%
+\cite{fujihara_automatic_2006}. Later they improved hereupon%
+\cite{fujihara_three_2008} and even made a ready to use karaoke application
+that can do the this online\cite{fujihara_lyricsynchronizer:_2011}.
Singing voice detection can also be seen as a binary genre recognition problem.
Therefore the techniques used in that field might be of use. Genre recognition
classical Turkish music\cite{dzhambazov_automatic_2014}.
\section{Research question}
-It is discutable whether the aforementioned techniques work because the
+It is debatable whether the aforementioned techniques work because the
spectral properties of a growling voice is different from the spectral
properties of a clean singing voice. It has been found that growling voices
have less prominent peaks in the frequency representation and are closer to
-noise then clean singing\cite{kato_acoustic_2013}. This leads us to the
+noise than clean singing\cite{kato_acoustic_2013}. This leads us to the
research question:
\begin{center}\em%
Are standard \gls{ANN} based techniques for singing voice detection
- suitable for non-standard musical genres like \gls{dm} and \gls{dom}.
+ suitable for non-standard musical genres like \gls{dm} and \gls{dom}?
\end{center}
The data is collected from three studio albums. The first band is called
\emph{Cannibal Corpse} and has been producing \gls{dm} for almost 25 years and
-have been creating the same type every album. The singer of \emph{Cannibal
-Corpse} has a very raspy growls and the lyrics are quite comprehensible. The
-vocals produced by \emph{Cannibal Corpse} are bordering regular shouting.
+has been creating album with a consistent style. The singer of \emph{Cannibal
+Corpse} has a very raspy growl and the lyrics are quite comprehensible. The
+vocals produced by \emph{Cannibal Corpse} border regular shouting.
-The second band is called \emph{Disgorge} and make even more violently sounding
-music. The growls of the lead singer sound like a coffee grinder and are more
-shallow. In the spectrals it is clearly visible that there are overtones
-produced during some parts of the growling. The lyrics are completely
+The second band is called \emph{Disgorge} and makes even more violently
+sounding music. The growls of the lead singer sound like a coffee grinder and
+are more shallow. In the spectrals it is clearly visible that there are
+overtones produced during some parts of the growling. The lyrics are completely
incomprehensible and therefore some parts were not annotated with the actual
-lyrics because it was not possible what was being sung.
+lyrics because it was impossible to hear what was being sung.
Lastly a band from Moscow is chosen bearing the name \emph{Who Dies in
Siberian Slush}. This band is a little odd compared to the previous \gls{dm}
\section{\acrlong{MFCC} Features}
The waveforms in itself are not very suitable to be used as features due to the
high dimensionality and correlation. Therefore we use the often used
-\glspl{MFCC} feature vectors which has shown to be
-suitable\cite{rocamora_comparing_2007}. It has also been found that altering
-the mel scale to better suit singing does not yield a better
+\glspl{MFCC} feature vectors which have shown to be suitable%
+\cite{rocamora_comparing_2007}. It has also been found that altering the mel
+scale to better suit singing does not yield a better
performance\cite{you_comparative_2015}. The actual conversion is done using the
\emph{python\_speech\_features}%
\footnote{\url{https://github.com/jameslyons/python_speech_features}} package.
-\gls{MFCC} features are inspired by human auditory processing inspired and
-built incrementally in several steps.
+\gls{MFCC} features are inspired by human auditory processing inspired and are
+created from a waveform incrementally using several steps:
\begin{enumerate}
\item The first step in the process is converting the time representation
of the signal to a spectral representation using a sliding window with
using triangular overlapping windows to get a more tonotopic
representation trying to match the actual representation in the cochlea
of the human ear.
- \item The \emph{Weber-Fechner} law that describes how humans perceive physical
+ \item The \emph{Weber-Fechner} law describes how humans perceive physical
magnitudes\footnote{Fechner, Gustav Theodor (1860). Elemente der
- Psychophysik} and it was found that energy is perceived in logarithmic
+ Psychophysik}. They found that energy is perceived in logarithmic
increments. This means that twice the amount of decibels does not mean
- twice the amount of perceived loudness. Therefore in this step log is
- taken of energy or amplitude of the \gls{MS} frequency spectrum to
- closer match the human hearing.
+ twice the amount of perceived loudness. Therefore we take the log of
+ the energy or amplitude of the \gls{MS} spectrum to closer match the
+ human hearing.
\item The amplitudes of the spectrum are highly correlated and therefore
the last step is a decorrelation step. \Gls{DCT} is applied on the
amplitudes interpreted as a signal. \Gls{DCT} is a technique of
\subsection{\acrlong{ANN}}
The data is classified using standard \gls{ANN} techniques, namely \glspl{MLP}.
-The classification problems are only binary and four-class so therefore it is
-interesting to see where the bottleneck lies. How abstract the abstraction can
-go. The \gls{ANN} is built with the Keras\footnote{\url{https://keras.io}}
+The classification problems are only binary and four-class so it is
+interesting to see where the bottleneck lies; how abstract can the abstraction
+be made. The \gls{ANN} is built with the Keras\footnote{\url{https://keras.io}}
using the TensorFlow\footnote{\url{https://github.com/tensorflow/tensorflow}}
backend that provides a high-level interface to the highly technical networks.
in Equation~\ref{eq:softmax}.
The data is shuffled before fed to the network to mitigate the risk of
-over fitting on one album. Every model was trained using $10$ epochs and a
+overfitting on one album. Every model was trained using $10$ epochs and a
batch size of $32$.
\begin{equation}\label{eq:relu}
\end{subfigure}%
%
\begin{subfigure}{.5\textwidth}
+ \centering
\includegraphics[width=.8\linewidth]{mcann}
\caption{Multiclass classifier network architecture}\label{fig:mcann}
\end{subfigure}
\caption{Plotting the classifier under similar alien data}\label{fig:alien1}
\end{figure}
-To really test the limits a song from the highly atmospheric doom metal band
+To really test the limits, a song from the highly atmospheric doom metal band
called \emph{Catacombs} has been tested on the system. The album \emph{Echoes
Through the Catacombs} is an album that has a lot of synthesizers, heavy
droning guitars and bass lines. The vocals are not mixed in a way that makes
\begin{figure}[H]
\centering
- \includegraphics[width=.6\linewidth]{alien1}.
+ \includegraphics[width=.6\linewidth]{alien2}.
\caption{Plotting the classifier under different alien data}\label{fig:alien2}
\end{figure}