From: Mart Lubbers Date: Tue, 30 May 2017 13:39:15 +0000 (+0200) Subject: process comments of proofread X-Git-Url: https://git.martlubbers.net/?a=commitdiff_plain;h=e007abb755ee420db83c261c570600fdcbe324ae;p=asr1617.git process comments of proofread --- diff --git a/acronyms.tex b/acronyms.tex new file mode 100644 index 0000000..7a564df --- /dev/null +++ b/acronyms.tex @@ -0,0 +1,17 @@ +\newacronym{ANN}{ANN}{Artificial Neural Network} +\newacronym{DCT}{DCT}{Discrete Cosine Transform} +\newacronym{DHMM}{DHMM}{Duration-explicit \acrlong{HMM}} +\newacronym{FA}{FA}{Forced alignment} +\newacronym{GMM}{GMM}{Gaussian Mixture Models} +\newacronym{HMM}{HMM}{Hidden Markov Model} +\newacronym{HTK}{HTK}{\acrlong{HMM} Toolkit} +\newacronym{IFPI}{IFPI}{International Federation of the Phonographic Industry} +\newacronym{LPCC}{LPCC}{\acrlong{LPC} derivec cepstrum} +\newacronym{LPC}{LPC}{Linear Prediction Coefficients} +\newacronym{MFCC}{MFCC}{\acrlong{MFC} coefficient} +\newacronym{MFC}{MFC}{Mel-frequency cepstrum} +\newacronym{MLP}{MLP}{Multi-layer Perceptron} +\newacronym{PLP}{PLP}{Perceptual Linear Prediction} +\newacronym{PPF}{PPF}{Posterior Probability Features} +\newacronym{ZCR}{ZCR}{Zero-crossing Rate} +\newacronym{RELU}{ReLU}{Rectified Linear Unit} diff --git a/asr.pre b/asr.pre index d396631..1af71cc 100644 --- a/asr.pre +++ b/asr.pre @@ -8,7 +8,6 @@ \usepackage{rutitlepage/rutitlepage} % Titlepage \usepackage{hyperref} % Hyperlinks \usepackage{booktabs} % Better looking tables -\usepackage{todonotes} % Todo's \usepackage{float} % Floating tables \usepackage{csquotes} % Typeset quotes \usepackage{subcaption} % Subfigures and captions diff --git a/asr.tex b/asr.tex index 273cdfa..938078f 100644 --- a/asr.tex +++ b/asr.tex @@ -1,37 +1,8 @@ %&asr \usepackage[toc,nonumberlist,acronyms]{glossaries} \makeglossaries% -\newacronym{ANN}{ANN}{Artificial Neural Network} -\newacronym{DCT}{DCT}{Discrete Cosine Transform} -\newacronym{DHMM}{DHMM}{Duration-explicit \acrlong{HMM}} -\newacronym{FA}{FA}{Forced alignment} -\newacronym{GMM}{GMM}{Gaussian Mixture Models} -\newacronym{HMM}{HMM}{Hidden Markov Model} -\newacronym{HTK}{HTK}{\acrlong{HMM} Toolkit} -\newacronym{IFPI}{IFPI}{International Federation of the Phonographic Industry} -\newacronym{LPCC}{LPCC}{\acrlong{LPC} derivec cepstrum} -\newacronym{LPC}{LPC}{Linear Prediction Coefficients} -\newacronym{MFCC}{MFCC}{\acrlong{MFC} coefficient} -\newacronym{MFC}{MFC}{Mel-frequency cepstrum} -\newacronym{MLP}{MLP}{Multi-layer Perceptron} -\newacronym{PLP}{PLP}{Perceptual Linear Prediction} -\newacronym{PPF}{PPF}{Posterior Probability Features} -\newacronym{ZCR}{ZCR}{Zero-crossing Rate} -\newacronym{RELU}{ReLU}{Rectified Linear Unit} -\newglossaryentry{dm}{name={Death Metal}, - description={is an extreme heavy metal music style with growling vocals and - pounding drums}} -\newglossaryentry{dom}{name={Doom Metal}, - description={is an extreme heavy metal music style with growling vocals and - pounding drums played very slowly}} -\newglossaryentry{FT}{name={Fourier Transform}, - description={is a technique of converting a time representation signal to a - frequency representation}} -\newglossaryentry{MS}{name={Mel-Scale}, - description={is a human ear inspired scale for spectral signals}} -\newglossaryentry{Viterbi}{name={Viterbi}, - description={is a dynamic programming algorithm for finding the most likely - sequence of hidden states in a \gls{HMM}}} +\input{acronyms} +\input{glossaries} \begin{document} \frontmatter{} @@ -43,51 +14,26 @@ righttextheader={Supervisor:}, righttext={Louis ten Bosch}, pagenr=1] -\listoftodos[Todo] \tableofcontents -\mainmatter{} -%Berenzweig and Ellis use acoustic classifiers from speech recognition as a -%detector for singing lines. They achive 80\% accuracy for forty 15 second -%exerpts. They mention people that wrote signal features that discriminate -%between speech and music. Neural net -%\glspl{HMM}~\cite{berenzweig_locating_2001}. -% -%In 2014 Dzhambazov et al.\ applied state of the art segmentation methods to -%polyphonic turkish music, this might be interesting to use for heavy metal. -%They mention Fujihara (2011) to have a similar \gls{FA} system. This method uses -%phone level segmentation, first 12 \gls{MFCC}s. They first do vocal/non-vocal -%detection, then melody extraction, then alignment. They compare results with -%Mesaros \& Virtanen, 2008~\cite{dzhambazov_automatic_2014}. Later they -%specialize in long syllables in a capella. They use \glspl{DHMM} with -%\glspl{GMM} and show that adding knowledge increases alignment (bejing opera -%has long syllables)~\cite{dzhambazov_automatic_2016}. -% +\glsaddall{} +\printglossaries{} +\mainmatter{} -%Introduction, leading to a clearly defined research question \chapter{Introduction} -\input{intro.tex} +\input{intro} \chapter{Methods} -\input{methods.tex} +\input{methods} \chapter{Conclusion \& Discussion} -\input{conclusion.tex} +\input{conclusion} %(Appendices) \appendix -\input{appendices.tex} - -\newpage -%Glossaries -\glsaddall{} -\begingroup -\let\clearpage\relax -\let\cleardoublepage\relax -\printglossaries{} -\endgroup +\input{appendices} \bibliographystyle{ieeetr} \bibliography{asr} diff --git a/conclusion.tex b/conclusion.tex index b71301b..97ba7de 100644 --- a/conclusion.tex +++ b/conclusion.tex @@ -1,24 +1,29 @@ \section{Conclusion \& Future Research} -This research shows that existing techniques for singing-voice detection -designed for regular singing voices also work respectably on extreme singing +This study shows that existing techniques for singing-voice detection +designed for regular singing-voices also work respectably on extreme singing styles like grunting. With a standard \gls{ANN} classifier using \gls{MFCC} features a performance of $85\%$ can be achieved which is similar to the same techniques on regular singing. This means that it might be suitable as a -pre-processing step for lyrics forced alignment. The model performs pretty good -on alien data that uses similar singing techniques as the trainingset. However, -the model is not coping very good with different singing techniques or with -data that contains a lot of atmospheric noise and accompaniment. +pre-processing step for lyrics forced alignment. The model performs pretty well +on alien data that uses similar singing techniques as the training set. +However, the model does not cope very well with different singing techniques or +with data that contains a lot of atmospheric noise and accompaniment. +\subsection{Future research} +\paragraph{Forced aligment: } Future interesting research includes doing the actual forced alignment. This probably requires entirely different models. The models used for real speech -are probably not suitable because the acoustic properties of a regular singing -voice is very different from a growling voice, let alone speech. +are probably not suitable because the acoustic properties of a regular +singing-voice are very different from those of a growling voice, let alone +speech. +\paragraph{Generalization: } Secondly, it would be interesting if a model could be trained that could discriminate a singing voice for all styles of singing including growling. Moreover, it is possible to investigate the performance of detecting growling on regular singing-voice trained models and the other way around. +\paragraph{Decorrelation } Another interesting research continuation would be to investigate whether the decorrelation step of the feature extraction is necessary. This transformation might be inefficient or unnatural. The first layer of weights in the model @@ -26,15 +31,17 @@ could be seen as a first processing step. If another layer is added that layer could take over the role of the decorrelating. The downside of this is that training the model is tougher because there are a many more weights to train. -\emph{Singing}-voice detection and \emph{singer}-voice Singing-voice detection -can be seen as a crude way of genre-detection. Therefore it might be -interesting to figure out whether this is generalizable to general genre -recognition. This requires more data from different genres to be added to the -dataset and the models to be retrained. +\paragraph{Genre detection: } +\emph{Singing}-voice detection and \emph{singer}-voice can be seen as a crude +way of genre-detection. Therefore it might be interesting to figure out whether +this is generalizable to general genre recognition. This requires more data +from different genres to be added to the dataset and the models to be +retrained. +\paragraph{\glspl{HMM}: } A lot of similar research on singing-voice detection uses \glspl{HMM} and -existing phone models. It would be fruitful to try the same approach on extreme -singing styles to see whether the phone models can say anything about a +existing phone models. It would be interesting to try the same approach on +extreme singing styles to see whether the phone models can say anything about a growling voice. %Discussion section diff --git a/glossaries.tex b/glossaries.tex new file mode 100644 index 0000000..b00ee19 --- /dev/null +++ b/glossaries.tex @@ -0,0 +1,14 @@ +\newglossaryentry{dm}{name={Death Metal}, + description={is an extreme heavy metal music style with growling vocals and + pounding drums}} +\newglossaryentry{dom}{name={Doom Metal}, + description={is an extreme heavy metal music style with growling vocals and + pounding drums played very slowly}} +\newglossaryentry{FT}{name={Fourier Transform}, + description={is a technique of converting a time representation signal to a + frequency representation}} +\newglossaryentry{MS}{name={Mel-Scale}, + description={is a human ear inspired scale for spectral signals}} +\newglossaryentry{Viterbi}{name={Viterbi}, + description={is a dynamic programming algorithm for finding the most likely + sequence of hidden states in a \gls{HMM}}} diff --git a/intro.tex b/intro.tex index a64a9b2..1753534 100644 --- a/intro.tex +++ b/intro.tex @@ -1,7 +1,7 @@ \section{Introduction} The primary medium for music distribution is rapidly changing from physical media to digital media. The \gls{IFPI} stated that about $43\%$ of music -revenue rises from digital distribution. Another $39\%$ arises from the +revenue arises from digital distribution. Another $39\%$ arises from the physical sale and the remaining $16\%$ is made through performance and synchronisation revenieus. The overtake of digital formats on physical formats took place somewhere in 2015. Moreover, ever since twenty years the music @@ -11,7 +11,7 @@ again\footnote{\url{http://www.ifpi.org/facts-and-stats.php}}. There has always been an interest in lyrics to music alignment to be used in for example karaoke. As early as in the late 1980s karaoke machines were available for consumers. While the lyrics for the track are almost always -available, a alignment is not and it involves manual labour to create such an +available, an alignment is not and it involves manual labour to create such an alignment. A lot of this musical distribution goes via non-official channels such as @@ -46,10 +46,11 @@ tenth century\cite{friis_vikings_2004}: \section{Related work} Applying speech related processing and classification techniques on music already started in the late 90s. Saunders et al.\ devised a technique to -classify audio in the categories \emph{Music} and \emph{Speech}. It was found +classify audio in the categories \emph{Music} and \emph{Speech}. They was found that music has different properties than speech. Music has more bandwidth, tonality and regularity. Multivariate Gaussian classifiers were used to -discriminate the categories with an average performance of $90\%$. +discriminate the categories with an average performance of $90\%% +$\cite{saunders_real-time_1996}. Williams and Ellis were inspired by the aforementioned research and tried to separate the singing segments from the instrumental @@ -61,24 +62,24 @@ separating speech from non-speech signals such as music. The data used was already segmented. Later, Berenzweig showed singing voice segments to be more useful for artist -classification and used a \gls{ANN} (\gls{MLP}) using \gls{PLP} coefficients to -separate detect singing voice\cite{berenzweig_using_2002}. Nwe et al.\ showed -that there is not much difference in accuracy when using different features -founded in speech processing. They tested several features and found accuracies -differ less that a few percent. Moreover, they found that others have tried to -tackle the problem using myriads of different approaches such as using -\gls{ZCR}, \gls{MFCC} and \gls{LPCC} as features and \glspl{HMM} or \glspl{GMM} -as classifiers\cite{nwe_singing_2004}. +classification and used an \gls{ANN} (\gls{MLP}) using \gls{PLP} coefficients +to detect a singing voice\cite{berenzweig_using_2002}. Nwe et al.\ showed that +there is not much difference in accuracy when using different features founded +in speech processing. They tested several features and found accuracies differ +less that a few percent. Moreover, they found that others have tried to tackle +the problem using myriads of different approaches such as using \gls{ZCR}, +\gls{MFCC} and \gls{LPCC} as features and \glspl{HMM} or \glspl{GMM} as +classifiers\cite{nwe_singing_2004}. Fujihara et al.\ took the idea to a next level by attempting to do \gls{FA} on -music. Their approach is a three step approach. First step is reducing the -accompaniment levels, secondly the vocal segments are -separated from the non-vocal segments using a simple two-state \gls{HMM}. -The chain is concluded by applying \gls{Viterbi} alignment on the segregated -signals with the lyrics. The system showed accuracy levels of $90\%$ on -Japanese music\cite{fujihara_automatic_2006}. Later they improved -hereupon\cite{fujihara_three_2008} and even made a ready to use karaoke -application that can do the this online\cite{fujihara_lyricsynchronizer:_2011}. +music. Their approach is a three step approach. The first step is reducing the +accompaniment levels, secondly the vocal segments are separated from the +non-vocal segments using a simple two-state \gls{HMM}. The chain is concluded +by applying \gls{Viterbi} alignment on the segregated signals with the lyrics. +The system showed accuracy levels of $90\%$ on Japanese music% +\cite{fujihara_automatic_2006}. Later they improved hereupon% +\cite{fujihara_three_2008} and even made a ready to use karaoke application +that can do the this online\cite{fujihara_lyricsynchronizer:_2011}. Singing voice detection can also be seen as a binary genre recognition problem. Therefore the techniques used in that field might be of use. Genre recognition @@ -94,14 +95,14 @@ growling like vocals. Dzhambazov also tried aligning lyrics to audio in classical Turkish music\cite{dzhambazov_automatic_2014}. \section{Research question} -It is discutable whether the aforementioned techniques work because the +It is debatable whether the aforementioned techniques work because the spectral properties of a growling voice is different from the spectral properties of a clean singing voice. It has been found that growling voices have less prominent peaks in the frequency representation and are closer to -noise then clean singing\cite{kato_acoustic_2013}. This leads us to the +noise than clean singing\cite{kato_acoustic_2013}. This leads us to the research question: \begin{center}\em% Are standard \gls{ANN} based techniques for singing voice detection - suitable for non-standard musical genres like \gls{dm} and \gls{dom}. + suitable for non-standard musical genres like \gls{dm} and \gls{dom}? \end{center} diff --git a/methods.tex b/methods.tex index 96365ae..e166208 100644 --- a/methods.tex +++ b/methods.tex @@ -29,16 +29,16 @@ metal there are different spectral patterns visible over time. The data is collected from three studio albums. The first band is called \emph{Cannibal Corpse} and has been producing \gls{dm} for almost 25 years and -have been creating the same type every album. The singer of \emph{Cannibal -Corpse} has a very raspy growls and the lyrics are quite comprehensible. The -vocals produced by \emph{Cannibal Corpse} are bordering regular shouting. +has been creating album with a consistent style. The singer of \emph{Cannibal +Corpse} has a very raspy growl and the lyrics are quite comprehensible. The +vocals produced by \emph{Cannibal Corpse} border regular shouting. -The second band is called \emph{Disgorge} and make even more violently sounding -music. The growls of the lead singer sound like a coffee grinder and are more -shallow. In the spectrals it is clearly visible that there are overtones -produced during some parts of the growling. The lyrics are completely +The second band is called \emph{Disgorge} and makes even more violently +sounding music. The growls of the lead singer sound like a coffee grinder and +are more shallow. In the spectrals it is clearly visible that there are +overtones produced during some parts of the growling. The lyrics are completely incomprehensible and therefore some parts were not annotated with the actual -lyrics because it was not possible what was being sung. +lyrics because it was impossible to hear what was being sung. Lastly a band from Moscow is chosen bearing the name \emph{Who Dies in Siberian Slush}. This band is a little odd compared to the previous \gls{dm} @@ -71,15 +71,15 @@ The training and test data is divided as follows: \section{\acrlong{MFCC} Features} The waveforms in itself are not very suitable to be used as features due to the high dimensionality and correlation. Therefore we use the often used -\glspl{MFCC} feature vectors which has shown to be -suitable\cite{rocamora_comparing_2007}. It has also been found that altering -the mel scale to better suit singing does not yield a better +\glspl{MFCC} feature vectors which have shown to be suitable% +\cite{rocamora_comparing_2007}. It has also been found that altering the mel +scale to better suit singing does not yield a better performance\cite{you_comparative_2015}. The actual conversion is done using the \emph{python\_speech\_features}% \footnote{\url{https://github.com/jameslyons/python_speech_features}} package. -\gls{MFCC} features are inspired by human auditory processing inspired and -built incrementally in several steps. +\gls{MFCC} features are inspired by human auditory processing inspired and are +created from a waveform incrementally using several steps: \begin{enumerate} \item The first step in the process is converting the time representation of the signal to a spectral representation using a sliding window with @@ -93,13 +93,13 @@ built incrementally in several steps. using triangular overlapping windows to get a more tonotopic representation trying to match the actual representation in the cochlea of the human ear. - \item The \emph{Weber-Fechner} law that describes how humans perceive physical + \item The \emph{Weber-Fechner} law describes how humans perceive physical magnitudes\footnote{Fechner, Gustav Theodor (1860). Elemente der - Psychophysik} and it was found that energy is perceived in logarithmic + Psychophysik}. They found that energy is perceived in logarithmic increments. This means that twice the amount of decibels does not mean - twice the amount of perceived loudness. Therefore in this step log is - taken of energy or amplitude of the \gls{MS} frequency spectrum to - closer match the human hearing. + twice the amount of perceived loudness. Therefore we take the log of + the energy or amplitude of the \gls{MS} spectrum to closer match the + human hearing. \item The amplitudes of the spectrum are highly correlated and therefore the last step is a decorrelation step. \Gls{DCT} is applied on the amplitudes interpreted as a signal. \Gls{DCT} is a technique of @@ -154,9 +154,9 @@ labeled as \texttt{1000, 0100, 0010, 0001}. \subsection{\acrlong{ANN}} The data is classified using standard \gls{ANN} techniques, namely \glspl{MLP}. -The classification problems are only binary and four-class so therefore it is -interesting to see where the bottleneck lies. How abstract the abstraction can -go. The \gls{ANN} is built with the Keras\footnote{\url{https://keras.io}} +The classification problems are only binary and four-class so it is +interesting to see where the bottleneck lies; how abstract can the abstraction +be made. The \gls{ANN} is built with the Keras\footnote{\url{https://keras.io}} using the TensorFlow\footnote{\url{https://github.com/tensorflow/tensorflow}} backend that provides a high-level interface to the highly technical networks. @@ -177,7 +177,7 @@ activation function suitable for multiple output nodes. The definition is given in Equation~\ref{eq:softmax}. The data is shuffled before fed to the network to mitigate the risk of -over fitting on one album. Every model was trained using $10$ epochs and a +overfitting on one album. Every model was trained using $10$ epochs and a batch size of $32$. \begin{equation}\label{eq:relu} @@ -203,6 +203,7 @@ batch size of $32$. \end{subfigure}% % \begin{subfigure}{.5\textwidth} + \centering \includegraphics[width=.8\linewidth]{mcann} \caption{Multiclass classifier network architecture}\label{fig:mcann} \end{subfigure} @@ -269,7 +270,7 @@ frequency range from $0$ to $3000Hz$. \caption{Plotting the classifier under similar alien data}\label{fig:alien1} \end{figure} -To really test the limits a song from the highly atmospheric doom metal band +To really test the limits, a song from the highly atmospheric doom metal band called \emph{Catacombs} has been tested on the system. The album \emph{Echoes Through the Catacombs} is an album that has a lot of synthesizers, heavy droning guitars and bass lines. The vocals are not mixed in a way that makes @@ -280,6 +281,6 @@ singing from non singing. \begin{figure}[H] \centering - \includegraphics[width=.6\linewidth]{alien1}. + \includegraphics[width=.6\linewidth]{alien2}. \caption{Plotting the classifier under different alien data}\label{fig:alien2} \end{figure}