X-Git-Url: https://git.martlubbers.net/?a=blobdiff_plain;f=asr.tex;h=3f5eaaf8f02197197b5e58c1fec243ed80cdfb15;hb=0ada197a78af4323b8cf5efc508a5aab3d80e4b2;hp=87aae4ed4c092564b05982e016296bb2679206c2;hpb=1fb9c069bb68760071a96e201462e85680939863;p=asr1617.git diff --git a/asr.tex b/asr.tex index 87aae4e..3f5eaaf 100644 --- a/asr.tex +++ b/asr.tex @@ -1,6 +1,7 @@ %&asr \usepackage[nonumberlist,acronyms]{glossaries} -\makeglossaries% +%\makeglossaries% +\newacronym{ANN}{ANN}{Artificial Neural Network} \newacronym{HMM}{HMM}{Hidden Markov Model} \newacronym{GMM}{GMM}{Gaussian Mixture Models} \newacronym{DHMM}{DHMM}{Duration-explicit \acrlong{HMM}} @@ -8,60 +9,284 @@ \newacronym{FA}{FA}{Forced alignment} \newacronym{MFC}{MFC}{Mel-frequency cepstrum} \newacronym{MFCC}{MFCC}{\acrlong{MFC} coefficient} -%\newglossaryentry{mTask}{name=mTask, -% description={is an abstraction for \glspl{Task} living on \acrshort{IoT} devices}} +\newacronym{IFPI}{IFPI}{International Federation of the Phonographic Industry} +\newglossaryentry{dm}{name={Death Metal}, + description={is an extreme heavy metal music style with growling vocals and + pounding drums}} +\newglossaryentry{dom}{name={Doom Metal}, + description={is an extreme heavy metal music style with growling vocals and + pounding drums played very slowly}} +\newglossaryentry{FT}{name={Fourier Transform}, + description={is a technique of converting a time representation signal to a + frequency representation}} +\newglossaryentry{MS}{name={Mel-Scale}, + description={is a human ear inspired scale for spectral signals.}} \begin{document} -%Titlepage +\frontmatter{} + \maketitleru[ course={(Automatic) Speech Recognition}, institute={Radboud University Nijmegen}, - authorstext={Author:}] + authorstext={Author:}, + pagenr=1] \listoftodos[Todo] \tableofcontents %Glossaries -\glsaddall{} -\printglossaries% - -Berenzweig and Ellis use acoustic classifiers from speech recognition as a -detector for singing lines. They achive 80\% accuracy for forty 15 second -exerpts. They mention people that wrote signal features that discriminate -between speech and music. Neural net -\glspl{HMM}~\cite{berenzweig_locating_2001}. - -In 2014 Dzhambazov et al.\ applied state of the art segmentation methods to -polyphonic turkish music, this might be interesting to use for heavy metal. -They mention Fujihara(2011) to have a similar \gls{FA} system. This method uses -phone level segmentation, first 12 \gls{MFCC}s. They first do vocal/non-vocal -detection, then melody extraction, then alignment. They compare results with -Mesaros \& Virtanen, 2008~\cite{dzhambazov_automatic_2014}. Later they -specialize in long syllables in a capella. They use \glspl{DHMM} with -\glspl{GMM} and show that adding knowledge increases alignment (bejing opera -has long syllables)~\cite{dzhambazov_automatic_2016}. - -t\cite{fujihara_automatic_2006} -t\cite{fujihara_lyricsynchronizer:_2011} -t\cite{fujihara_three_2008} -t\cite{mauch_integrating_2012} -t\cite{mesaros_adaptation_2009} -t\cite{mesaros_automatic_2008} -t\cite{mesaros_automatic_2010} -t\cite{muller_multimodal_2012} -t\cite{pedone_phoneme-level_2011} -t\cite{yang_machine_2012} +%\glsaddall{} +%\printglossaries + +\mainmatter{} +%Berenzweig and Ellis use acoustic classifiers from speech recognition as a +%detector for singing lines. They achive 80\% accuracy for forty 15 second +%exerpts. They mention people that wrote signal features that discriminate +%between speech and music. Neural net +%\glspl{HMM}~\cite{berenzweig_locating_2001}. +% +%In 2014 Dzhambazov et al.\ applied state of the art segmentation methods to +%polyphonic turkish music, this might be interesting to use for heavy metal. +%They mention Fujihara (2011) to have a similar \gls{FA} system. This method uses +%phone level segmentation, first 12 \gls{MFCC}s. They first do vocal/non-vocal +%detection, then melody extraction, then alignment. They compare results with +%Mesaros \& Virtanen, 2008~\cite{dzhambazov_automatic_2014}. Later they +%specialize in long syllables in a capella. They use \glspl{DHMM} with +%\glspl{GMM} and show that adding knowledge increases alignment (bejing opera +%has long syllables)~\cite{dzhambazov_automatic_2016}. +% %Introduction, leading to a clearly defined research question +\chapter{Introduction} +\section{Introduction} +The primary medium for music distribution is rapidly changing from physical +media to digital media. The \gls{IFPI} stated that about $43\%$ of music +revenue rises from digital distribution. Another $39\%$ arises from the +physical sale and the remaining $16\%$ is made through performance and +synchronisation revenieus. The overtake of digital formats on physical formats +took place somewhere in 2015. Moreover, ever since twenty years the music +industry has seen significant growth +again\footnote{\url{http://www.ifpi.org/facts-and-stats.php}}. + +There has always been an interest in lyrics to music alignment to be used in +for example karaoke. As early as in the late 1980s karaoke machines were +available for consumers. While the lyrics for the track are almost always +available, a alignment is not and it involves manual labour to create such an +alignment. + +A lot of this musical distribution goes via non-official channels such as +YouTube\footnote{\url{https://youtube.com}} in which fans of the performers +often accompany the music with synchronized lyrics. This means that there is an +enormous treasure of lyrics-annotated music available but not within our reach +since the subtitles are almost always hardcoded into the video stream and thus +not directly usable as data. Because of this interest it is very useful to +device automatic techniques for segmenting instrumental and vocal parts of a +song, apply forced alignment or even lyrics recognition on the audio file. + +Such techniques are heavily researched and working systems have been created. +However, these techniques are designed to detect a clean singing voice and have +not been testen on so-called \emph{extended vocal techniques} such as grunting +or growling. Growling is heavily used in extreme metal genres such as \gls{dm} +but it must be noted that grunting is not a technique only used in extreme +metal styles. Similar or equal techniques have been used in \emph{Beijing +opera}, Japanese \emph{Noh} and but also more western styles like jazz singing +by Louis Armstrong\cite{sakakibara_growl_2004}. It might even be traced back +to viking times. For example, an arab merchant visiting a village in Denmark +wrote in the tenth century\cite{friis_vikings_2004}: + +\begin{displayquote} + Never before I have heard uglier songs than those of the Vikings in + Slesvig. The growling sound coming from their throats reminds me of dogs + howling, only more untamed. +\end{displayquote} + +\section{\gls{dm}} + %Literature overview / related work +\section{Related work} +The field of applying standard speech processing techniques on music started in +the late 90s\cite{saunders_real-time_1996,scheirer_construction_1997} and it +was found that music has different discriminating features compared to normal +speech. + +Berenzweig and Ellis expanded on the aforementioned research by trying to +separate singing from instrumental music\cite{berenzweig_locating_2001}. + +\todo{Incorporate this in literary framing}% +~\cite{fujihara_automatic_2006}% +~\cite{fujihara_lyricsynchronizer:_2011}% +~\cite{fujihara_three_2008}% +~\cite{mauch_integrating_2012}% +~\cite{mesaros_adaptation_2009}% +~\cite{mesaros_automatic_2008}% +~\cite{mesaros_automatic_2010}% +~%\cite{muller_multimodal_2012}% +~\cite{pedone_phoneme-level_2011}% +~\cite{yang_machine_2012}% + + + +\section{Research question} +It is discutable whether the aforementioned techniques work because the +spectral properties of a growling voice is different from the spectral +properties of a clean singing voice. It has been found that growling voices +have less prominent peaks in the frequency representation and are closer to +noise then clean singing\cite{kato_acoustic_2013}. This leads us to the +research question: + +\begin{center}\em% + Are standard \gls{ANN} based techniques for singing voice detection + suitable for non-standard musical genres like \gls{dm}. +\end{center} + +\chapter{Methods} %Methodology + %Experiment(s) (set-up, data, results, discussion) +\section{Data \& Preprocessing} +To run the experiments data has been collected from several \gls{dm} albums. +The exact data used is available in Appendix~\ref{app:data}. The albums are +extracted from the audio CD and converted to a mono channel waveform with the +correct samplerate \emph{SoX}\footnote{\url{http://sox.sourceforge.net/}}. +Every file is annotated using +Praat\cite{boersma_praat_2002} where the utterances are manually aligned to +the audio. Examples of utterances are shown in +Figure~\ref{fig:bloodstained} and Figure~\ref{fig:abominations} where the +waveform, $1-8000$Hz spectrals and annotations are shown. It is clearly visible +that within the genre of death metal there are a different spectral patterns +visible. + +\begin{figure}[ht] + \centering + \includegraphics[width=.7\linewidth]{cement} + \caption{A vocal segment of the \emph{Cannibal Corpse} song + \emph{Bloodstained Cement}}\label{fig:bloodstained} +\end{figure} + +\begin{figure}[ht] + \centering + \includegraphics[width=.7\linewidth]{abominations} + \caption{A vocal segment of the \emph{Disgorge} song + \emph{Enthroned Abominations}}\label{fig:abominations} +\end{figure} + +The data is collected from three studio albums. The +first band is called \emph{Cannibal Corpse} and has been producing \gls{dm} for +almost 25 years and have been creating the same type every album. The singer of +\emph{Cannibal Corpse} has a very raspy growls and the lyrics are quite +comprehensible. The vocals produced by \emph{Cannibal Corpse} are bordering +regular shouting. + +The second band is called \emph{Disgorge} and make even more violently sounding +music. The growls of the lead singer sound like a coffee grinder and are more +shallow. In the spectrals it is clearly visible that there are overtones +produced during some parts of the growling. The lyrics are completely +incomprehensible and therefore some parts were not annotated with the actual +lyrics because it was not possible what was being sung. + +Lastly a band from Moscow is chosen bearing the name \emph{Who Dies in +Siberian Slush}. This band is a little odd compared to the previous \gls{dm} +bands because they create \gls{dom}. \gls{dom} is characterized by the very +slow tempo and low tuned guitars. The vocalist has a very characteristic growl +and performs in several moscovian bands. This band also stands out because it +uses piano's and synthesizers. The droning synthesizers often operate in the +same frequency as the vocals. + +\section{\gls{MFCC} Features} +The waveforms in itself are not very suitable to be used as features due to the +high dimensionality and correlation. Therefore we use the aften used +\glspl{MFCC} feature vectors.\todo{cite which papers use this} The actual +conversion is done using the \emph{python\_speech\_features}% +\footnote{\url{https://github.com/jameslyons/python_speech_features}} package. + +\gls{MFCC} features are nature inspired and built incrementally in a several of +steps. +\begin{enumerate} + \item The first step in the process is converting the time representation + of the signal to a spectral representation using a sliding window with + overlap. The width of the window and the step size are two important + parameters in the system. In classical phonetic analysis window sizes + of $25ms$ with a step of $10ms$ are often chosen because they are small + enough to only contain subphone entities. Singing for $25ms$ is + impossible so it is arguable that the window size is very small. + \item The standard \gls{FT} gives a spectral representation that has + linearly scaled frequencies. This scale is converted to the \gls{MS} + using triangular overlapping windows. + \item +\end{enumerate} + + +\todo{Explain why MFCC and which parameters} + +\section{\gls{ANN} Classifier} +\todo{Spectrals might be enough, no decorrelation} + +\section{Model training} + +\section{Experiments} + +\section{Results} + + +\chapter{Conclusion \& Discussion} +\section{Conclusion} %Discussion section + +\section{Discussion} + +\todo{Novelty} +\todo{Weaknesses} +\todo{Dataset is not very varied but\ldots} + +\todo{Doom metal} %Conclusion section %Acknowledgements %Statement on authors' contributions %(Appendices) +\appendix +\chapter{Experimental data}\label{app:data} +\begin{table}[h] + \centering + \begin{tabular}{cllll} + \toprule + Num. & Artist & Album & Song & Duration\\ + \midrule + 00 & Cannibal Corpse & A Skeletal Domain & High Velocity Impact Spatter & 04:06.91\\ + 01 & Cannibal Corpse & A Skeletal Domain & Sadistic Embodiment & 03:17.31\\ + 02 & Cannibal Corpse & A Skeletal Domain & Kill or Become & 03:50.67\\ + 03 & Cannibal Corpse & A Skeletal Domain & A Skeletal Domain & 03:38.77\\ + 04 & Cannibal Corpse & A Skeletal Domain & Headlong Into Carnage & 03:01.25\\ + 05 & Cannibal Corpse & A Skeletal Domain & The Murderer's Pact & 05:05.23\\ + 06 & Cannibal Corpse & A Skeletal Domain & Funeral Cremation & 03:41.89\\ + 07 & Cannibal Corpse & A Skeletal Domain & Icepick Lobotomy & 03:16.24\\ + 08 & Cannibal Corpse & A Skeletal Domain & Vector of Cruelty & 03:25.15\\ + 09 & Cannibal Corpse & A Skeletal Domain & Bloodstained Cement & 03:41.99\\ + 10 & Cannibal Corpse & A Skeletal Domain & Asphyxiate to Resuscitate & 03:47.40\\ + 11 & Cannibal Corpse & A Skeletal Domain & Hollowed Bodies & 03:05.80\\ + 12 & Disgorge & Parallels of Infinite Torture & Revealed in Obscurity & 05:13.20\\ + 13 & Disgorge & Parallels of Infinite Torture & Enthroned Abominations & 04:05.39\\ + 14 & Disgorge & Parallels of Infinite Torture & Atonement & 02:57.36\\ + 15 & Disgorge & Parallels of Infinite Torture & Abhorrent Desecration of Thee Iniquity & 04:17.20\\ + 16 & Disgorge & Parallels of Infinite Torture & Forgotten Scriptures & 02:01.72\\ + 17 & Disgorge & Parallels of Infinite Torture & Descending Upon Convulsive Devourment & 04:38.85\\ + 18 & Disgorge & Parallels of Infinite Torture & Condemned to Sufferance & 04:57.59\\ + 19 & Disgorge & Parallels of Infinite Torture & Parallels of Infinite Torture & 05:03.33\\ + 20 & Disgorge & Parallels of Infinite Torture & Asphyxiation of Thee Oppressed & 05:42.37\\ + 21 & Disgorge & Parallels of Infinite Torture & Ominous Sigils of Ungodly Ruin & 04:59.15\\ + 22 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & Leave Me & 06:35.60\\ + 23 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & The Woman We Are Looking For & 06:53.63\\ + 24 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & M\"obius Ring & 07:20.56\\ + 25 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & Interlude & 04:26.49\\ + 26 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & Завещание Гумилёва & 08:46.76\\ + 27 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & An Old Road Through The Snow & 02:31.56\\ + 28 & Who Dies In Siberian Slush & Bitterness Of The Years That Are Lost & Bitterness Of The Years That Are Lost & 09:10.49\\ + \midrule + & & & Total: & 02:13:40\\ + \bottomrule + \end{tabular} + \caption{Songs used in the experiments} +\end{table} \bibliographystyle{ieeetr} \bibliography{asr}