Music Source Separation: Francisco Javier Cifuentes Garc Ia
Music Source Separation: Francisco Javier Cifuentes Garc Ia
I. I NTRODUCTION
2) Spectrogram Representation: The spectrogram is the Fig. 2: Spectrogram of a sampled vocal track.
graphical representation of the time-varying magnitude spec-
trum of frequencies of a given signal i.e. the time-varying
the voice is concluded [7]. These methods prove their superior
representation of the Fourier Transform. For each sampled
performance when the assumption of harmonic lead signal is
time window, the amplitude or intensity of each frequency
valid but for most songs this is not the case, as vocals present
is usually represented as a heat map in dB or sound pressure
unvoiced speech, whispers and saturation. Furthermore, when
level with respect to a maximum amplitude. This is due to
another instrument is louder or the singer is silent then the
the fact that most sounds humans hear are concentrated in
incorrect sound is isolated. Thus, these techniques do not fully
very small frequency and amplitude ranges so in order to
handle the case of indefinitely harmonic speech and undom-
visualize better the spectrum both magnitude and frequencies
inant singer voice, resulting in a lower and non-acceptable
should be represented in a logarithmic fashion. The way it is
performance.
applied in the case of song processing is the following: the
Another approach is based on modelling the accompaniment
time domain data is split into overlapping windows and the
supposing redundant structure of the piece and considering
STFT is applied to compute the amplitude of each frequency
the limited instrumental range of fundamental frequencies for
for the given window, then the magnitude is converted into dB
the instrumental notes. One common technique is based on
scale for each frequency (in the case of a power spectrogram
non-negative matrix factorization to spot the groups of the
this is done over the square of the magnitude) and the windows
mixture and then aggregate them into voice or accompani-
are overlapped in order to create a visual representation of
ment. Once the spectra is clustered, the separation and time
the time and frequency varying power of the track. Figure 2
domain reconstruction is performed by a Wiener filter [8],
depicts an example spectrogram of a 14 seconds vocal track
[9]. Detecting the redundant structure of the instrumentals is
where the fundamental frequency of the voice, its harmonics
based on the repeating pattern extraction technique [10] and
and unvoiced speech are noticeable for each time frame. The
its generalization called kernel additive modelling [11] and
harmonic nature is due to the fact that vocal sound is produced
then applying robust principal component analysis or singular
by the vibration of the vocal folds and filtered by the vocal
value decomposition [12]. Using the model of the vocals and
tract so melodies are mostly harmonic.
the accompaniment the separation is successfully performed
when the assumptions hold, however this is not always the
III. L ITERATURE REVIEW case and therefore they often exhibit low performance.
In this section a concise overview of different methods
for voice separation in music tracks is presented. For a B. Data-driven techniques
detailed explanation on signal processing, audio modeling and
a comprehensive discussion about lead and accompaniment These methods do not make any assumptions on the struc-
separation the author recommends reference [6]. ture of the accompaniment or frequency domain characteristics
of the voice, alternatively they are based on learning the
extraction rules from numerous representative examples. The
A. Model-based techniques output of these methods is usually the voice spectrogram or
These methods are grounded on tracking the pitch of the the frequency domain transfer function (TF) mask that maps
vocals and estimating the energy at the harmonics of the the vocals from the mixed track. These methods show the
fundamental for reconstructing the voice. In other words, same issues of every data-driven problem solving method: they
considering vocals perfectly harmonic and with the strongest rely on a large representative dataset, require to tune several
amplitude from the mixture, firstly the fundamental of the parameters and are prone to overfitting.
lead signal is extracted at each time window and secondly One group of techniques rests on probabilistic methods such
applying resynthesis or by filtering the track the extraction of as Bayesian models [13], Gaussian mixture models [14] and
PATTERN RECOGNITION AND MACHINE LEARNING, UPC, JUNE 2020 3
The output of the LSTM is concatenated with the skipped The details of the training procedure are very influential to
connection from the hyperbolic tangent, similarly to what the performance. In this case, the training tacks are cropped
the literature proposes, increasing the dimensionality. Then into smaller audios of 8 s and in order to allocate the data of
another encoder consisting of a fully connected network with each batch into memory the number of tracks sampled is set
output size of 512 maps the stretched vector to the size of the to 32. Following the guidelines of [26] and [27], the adopted
output decoder (symmetric in size to the input encoder) and optimizer is Adam which is a widely used algorithm for
a rectified linear unit (ReLU) is applied before so the output first-order gradient-based optimization of stochastic objective
is in the range [0, ∞). After the decoder fully connected functions and it is included in most software packages due to
network mentioned previously is applied, the output scaling is its computationally efficiency and little memory requirements
performed followed by another ReLU that outputs the vocal’s [36]. The main parameters of the neural network training are
mask. This mask is applied to the spectrogram of the mixed summarized in table I.
song and the vocal track is recovered in the time domain by
Training samples 102400
inverting the spectrogram either by Wiener filtering [31] as
Sample duration 8s
the usual method [8], [9], [29] or by using the fast Griffin-Lim
Channels 2
algorithm for an approximate reconstruction [32].
Frequency 44.1 kHz
The size of each internal layer is selected to be determined Optimizer Adam
based on the number of channels of the track, the STFT Learning rate 1e-3
parameters and the value of the output of the first NN, which Weight decay 1e-5
is set to 512 based on the state of the art aforementioned TABLE I: Training parameters.
implementations. Half of this value is the hidden layer size
of the LSTM. Therefore all layer’s sizes can be computed Three widely used metrics to asses separation quality are:
using the dimension of the first NN output (512) and size Source to Distortion, to Artefact, to Interference ratios (SDR,
of the STFT, nF F T = 4096, as it determines the input and SAR, SIR) [28], [37]. In this case the SIR is computed to
output layer sizes to be bnF F T /2c + 1 = 2049 in the case of compare the implemented method with the state of the art
mono tracks and double that value, 4098, for stereo songs. techniques. This metric is defined by the formula (3).
The second parameter of the STFT is the distance between
neighboring sliding window frames which is set to 1024. kstarget k
SIR = 20 log10 (3)
keinterf erence k
The loss function selected is called smooth L1 loss or Huber
loss and it is a modified version of the L1 loss used in [26] Where the argument of the logarithm is the ratio between the
and [27], which is simply the mean absolute error (MAE) or energy of the target signal and the energy of the interference
norm-1 between each element in the input x and target output signal. The higher the value of the SIR the better the perfor-
y. This modification of the MAE is less sensitive to outliers mance of the method. The computation is done in the time
and prevents exploding gradients [33]. The smooth L1 loss is domain representation of the target signal with respect to the
computed by the expression (2). true ground sources using the package museval [38].
n
1X VI. R ESULTS
loss = zi ,
n i=1 The training of the neural network took 12 hours and the
(2) evolution of the loss function during this process is shown
( in figure 4, with a final value of 0.0028. The results of
1
2 (xi− yi )2 , if |xi − yi | < 1 the performance over the test set show, both from direct
zi =
|xi − yi | − 12 , otherwise listening of the tracks and from the comparison presented
in table II, that the performance is less than the state of the
art trained neural networks for music source separation. This
V. C ASE STUDY is mostly due to the fact that in this case a total of 228
It has been requested for educational purposes the largest hours of tracks (randomly cropped) were used to train the
public dataset for music source separation: MUSDB [34]. It network while in the referenced techniques the amount of
is composed of 150 professionally produced and high quality total tracks duration (also randomly cropped) is considerably
songs: 10 hours and 5.3 GB. Five audio files are available for higher, around 150000 hours (at least) by training using
each song: the mix, and four separated tracks called stems several epochs and augmented data set, constructed using the
(drums, bass, vocals and other). Only western music genres stems of different tracks and randomly mixing them as well
are present, with a vast majority of pop/rock songs, along as swapping left/right channel for each instrument, resulting
with some hip-hop, rap and metal songs. A recommended in a training time of a full week [29], [26]. Therefore the
split of 100 songs corresponding to the training set and 50 to reduction in the execution is not unexpected, although it
the test set is provided beforehand. As some authors noticed, shows that the increase of performance with the number of
this data set is genre biased towards pop and rock music [35]. epochs and training time is not linear.
PATTERN RECOGNITION AND MACHINE LEARNING, UPC, JUNE 2020 5
R EFERENCES
[1] K. Brown and P. Auslander, Karaoke Idols: Popular
Music and the Performance of Identity. Intellect, 2015,
ISBN : 9781783204441. [Online]. Available: https : / /
books.google.es/books?id=9I-JCgAAQBAJ.
Fig. 4: Loss function value during training. [2] E. Pollastri, “A pitch tracking system dedicated to
process singing voice for music retrieval,” in Proceed-
ings. IEEE International Conference on Multimedia and
Signal to Interference Ratio [dB] Expo, vol. 1, Aug. 2002, 341–344 vol.1. DOI: 10.1109/
Case study 9.15 ICME.2002.1035788.
Demucs [24] 12.26 [3] M. Bosi and R. Goldberg, Introduction to Digital Audio
UMX [29] 13.33 Coding and Standards, ser. The Springer International
Spleeter [26] 15.86 Series in Engineering and Computer Science. Springer
US, 2002, ISBN: 9781402073571. [Online]. Available:
TABLE II: Test results.
http://books.google.es/books?id=oHWIRmHpi8YC.
[4] N. Kehtarnavaz, “Chapter 7 - frequency domain pro-
The implemented algorithm is about 3 dB worse than the cessing,” in Digital Signal Processing System Design
U-Net of Demucs [27] which is also about 3 dB less than (Second Edition), N. Kehtarnavaz, Ed., Second Edition,
the performance of the better method called Spleeter [26]. Burlington: Academic Press, 2008, pp. 175–196, ISBN:
In sight of this fact the capability gap may be considered 978-0-12-374490-6. DOI: https : / / doi . org / 10 . 1016 /
acceptable for a first approach to the problem regardless of B978 - 0 - 12 - 374490 - 6 . 00007 - 6. [Online]. Available:
the limitations due to different computing capabilities. http : / / www . sciencedirect . com / science / article / pii /
B9780123744906000076.
[5] J. Guckert, The use of fft and mdct in mp3 audio
VII. C ONCLUSION compression, 2012. [Online]. Available: http : / / www.
math . utah . edu / ∼gustafso / s2012 / 2270 / web - projects /
The problem of blind music source separation is introduced Guckert-audio-compression-svd-mdct-MP3.pdf.
and the specialised literature is reviewed showing that the [6] Z. Rafii, A. Liutkus, F.-R. Stoter, S. I. Mimilakis, D.
state of the art techniques are able to separate vocals from FitzGerald, and B. Pardo, “An overview of lead and
accompaniment successfully although for the overlapping accompaniment separation in music,” IEEE/ACM Trans-
frequencies of the components they tend to exhibit low actions on Audio, Speech, and Language Processing,
performance. The approach that shows better compromise vol. 26, no. 8, pp. 1307–1335, Aug. 2018, ISSN: 2329-
between complexity and execution among the reviewed 9304. DOI: 10 . 1109 / taslp . 2018 . 2825440. [Online].
techniques is a neural network encoder-LSTM-decoder based Available: http : / / dx . doi . org / 10 . 1109 / TASLP. 2018 .
architecture that has been implemented in Python under the 2825440.
PyTorch framework, and trained and tested on the largest [7] J. Salamon, E. Gomez, D. P. W. Ellis, and G. Richard,
public professional quality data set, namely MUSDB. The “Melody extraction from polyphonic music signals:
resulting method is capable of separating the vocal track Approaches, applications, and challenges,” IEEE Signal
from any given song’s digital audio. The test results point out Processing Magazine, vol. 31, no. 2, pp. 118–134, Mar.
that limitations due to memory and computational effort lead 2014, ISSN: 1558-0792. DOI: 10 . 1109 / MSP . 2013 .
to lower performance than the state of the art techniques. 2271648.
Further improvements may be made in the definition of the
training strategy: parallel computing, more training epochs
and different loss functions and training parameters.
PATTERN RECOGNITION AND MACHINE LEARNING, UPC, JUNE 2020 6
[8] S. Vembu and S. Baumann, “Separation of vocals from Available: https://ccrma.stanford.edu/ ∼gautham/Site/
polyphonic audio recordings,” in Proceedings of the Publications files/rafii-ismir2013.pdf.
6th International Conference on Music Information [18] Y. Isik, J. L. Roux, Z. Chen, S. Watanabe, and J. R.
Retrieval, http://ismir2005.ismir.net/proceedings/1028. Hershey, “Single-channel multi-speaker separation us-
pdf, London, UK, Sep. 2005, pp. 337–344. ing deep clustering,” CoRR, vol. abs/1607.02173, 2016.
[9] C. Puntonet and A. Prieto, Independent Component arXiv: 1607.02173. [Online]. Available: http://arxiv.org/
Analysis and Blind Signal Separation: Fifth Interna- abs/1607.02173.
tional Conference, ICA 2004, Granada, Spain, Septem- [19] S. Nie, W. Xue, S. Liang, X. Zhang, and W.-J. Liu,
ber 22-24, 2004, Proceedings, ser. Lecture Notes in “Joint optimization of recurrent networks exploiting
Computer Science. Springer Berlin Heidelberg, 2004, source auto-regression for source separation,” Sep.
ISBN : 9783540301103. [Online]. Available: https : / / 2015.
books.google.es/books?id=84X0BwAAQBAJ. [20] E. Grais, G. Roma, A. Simpson, and M. Plumbley,
[10] Z. Rafii and B. Pardo, “A simple music/voice separation “Combining mask estimates for single channel audio
method based on the extraction of the repeating musical source separation using deep neural networks,” Sep.
structure,” in 2011 IEEE International Conference on 2016, pp. 3339–3343. DOI: 10.21437/Interspeech.2016-
Acoustics, Speech and Signal Processing (ICASSP), 216.
2011, pp. 221–224. [21] S. Uhlich, F. Giron, and Y. Mitsufuji, “Deep neural net-
[11] A. Liutkus, D. Fitzgerald, Z. Rafii, B. Pardo, and L. work based instrument extraction from music,” in 2015
Daudet, “Kernel additive models for source separa- IEEE International Conference on Acoustics, Speech
tion,” IEEE Transactions on Signal Processing, vol. 62, and Signal Processing (ICASSP), 2015, pp. 2135–2139.
no. 16, pp. 4298–4310, 2014. [22] S. Uhlich, M. Porcu, F. Giron, M. Enenkl, T. Kemp, N.
[12] P. Seetharaman, F. Pishdadian, and B. Pardo, “Mu- Takahashi, and Y. Mitsufuji, “Improving music source
sic/voice separation using the 2d fourier transform,” separation based on deep neural networks through data
in 2017 IEEE Workshop on Applications of Signal augmentation and network blending,” in 2017 IEEE In-
Processing to Audio and Acoustics (WASPAA), 2017, ternational Conference on Acoustics, Speech and Signal
pp. 36–40. Processing (ICASSP), 2017, pp. 261–265.
[13] A. Ozerov, P. Philippe, F. Bimbot, and R. Gribon- [23] A. Défossez, N. Zeghidour, N. Usunier, L. Bottou, and
val, “Adaptation of bayesian models for single-channel F. Bach, Sing: Symbol-to-instrument neural generator,
source separation and its application to voice/music 2018. arXiv: 1810.09785 [cs.SD].
separation in popular songs,” Trans. Audio, Speech and [24] D. Stoller, S. Ewert, and S. Dixon, “Wave-u-net: A
Lang. Proc., vol. 15, no. 5, pp. 1564–1578, Jul. 2007, multi-scale neural network for end-to-end audio source
ISSN : 1558-7916. DOI : 10 . 1109 / TASL . 2007 . 899291. separation,” ArXiv, vol. abs/1806.03185, 2018.
[Online]. Available: https : / / doi . org / 10 . 1109 / TASL . [25] S. Mimilakis, E. Cano, J. Abeßer, and G. Schuller,
2007.899291. “New sonorities for jazz recordings: Separation and
[14] E. Vincent, M. G. Jafari, S. A. Abdallah, M. D. mixing using deep neural networks,” Sep. 2016.
Plumbley, and M. E. Davies, “Probabilistic modeling [26] R. Hennequin, A. Khlif, F. Voituret, and M. Mous-
paradigms for audio source separation,” in Machine sallam, Spleeter: A fast and state-of-the art music
Audition: Principles, Algorithms and Systems, W. Wang, source separation tool with pre-trained models, Late-
Ed., IGI Global, 2010, pp. 162–185. DOI: 10.4018/978- Breaking/Demo ISMIR 2019, Deezer Research, Nov.
1-61520-919-4.ch007. [Online]. Available: https://hal. 2019.
inria.fr/inria-00544016. [27] A. Défossez, N. Usunier, L. Bottou, and F. Bach,
[15] G. J. Mysore, P. Smaragdis, and B. Raj, “Non-negative Demucs: Deep extractor for music sources with extra
hidden markov modeling of audio with application unlabeled data remixed, 2019. arXiv: 1909 . 01174
to source separation,” in Proceedings of the 9th In- [cs.SD].
ternational Conference on Latent Variable Analysis [28] F.-R. Stöter, A. Liutkus, and N. Ito, “The 2018 signal
and Signal Separation, ser. LVA/ICA’10, St. Malo, separation evaluation campaign,” Lecture Notes in Com-
France: Springer-Verlag, 2010, pp. 140–148, ISBN: puter Science, pp. 293–305, 2018, ISSN: 1611-3349.
364215994X. DOI : 10 . 1007 / 978 - 3 - 319 - 93764 - 9 28. [Online].
[16] P. Smaragdis, B. Raj, and M. Shashanka, “Super- Available: http : / / dx . doi . org / 10 . 1007 / 978 - 3 - 319 -
vised and semi-supervised separation of sounds from 93764-9 28.
single-channel mixtures,” in Proceedings of the 7th [29] F.-R. Stter, S. Uhlich, A. Liutkus, and Y. Mitsufuji,
International Conference on Independent Component “Open-unmix - a reference implementation for music
Analysis and Signal Separation, ser. ICA’07, Lon- source separation,” Journal of Open Source Software,
don, UK: Springer-Verlag, 2007, pp. 414–421, ISBN: 2019. DOI: 10.21105/joss.01667. [Online]. Available:
3540744932. https://doi.org/10.21105/joss.01667.
[17] F. G. Zafar Rafii and D. Sun, Combining modeling [30] S. Hochreiter and J. Schmidhuber, “Long short-term
of singing voice and background music for automatic memory,” Neural computation, vol. 9, pp. 1735–80,
separation of musical mixtures, Nov. 2013. [Online]. Dec. 1997. DOI: 10.1162/neco.1997.9.8.1735.
PATTERN RECOGNITION AND MACHINE LEARNING, UPC, JUNE 2020 7
[31] A. Liutkus and F.-R. Stöter, Sigsep/norbert: First of- Francisco Javier Cifuentes Garcı́a received the
ficial norbert release, version v0.2.0, Jul. 2019. DOI: Bachelor’s degree in Energy Engineering from the
Technical University of Catalonia (UPC), Barcelona,
10 . 5281 / zenodo . 3269749. [Online]. Available: https : Spain, in 2018 and ever since he has been doing
//doi.org/10.5281/zenodo.3269749. an internship at CITCEA UPC while pursuing a
[32] N. Perraudin, P. Balazs, and P. L. Søndergaard, “A Master’s degree in Industrial Engineering specialized
in Electronics. His research interests include power
fast griffin-lim algorithm,” in 2013 IEEE Workshop electronics dominated power systems, renewable
on Applications of Signal Processing to Audio and energy technologies and machine learning, among
Acoustics, 2013, pp. 1–4. others. He is also a music enthusiast which is the
main motivation for this work.
[33] R. Girshick, “Fast r-cnn,” in 2015 IEEE Interna-
tional Conference on Computer Vision (ICCV), 2015,
pp. 1440–1448.
[34] Z. Rafii, A. Liutkus, F.-R. Stöter, S. I. Mimilakis, and
R. Bittner, The MUSDB18 corpus for music separation,
Dec. 2017. DOI: 10 . 5281 / zenodo . 1117372. [Online].
Available: https://doi.org/10.5281/zenodo.1117372.
[35] L. Prétet, R. Hennequin, J. Royo-Letelier, and A.
Vaglio, “Singing voice separation: A study on training
data,” in ICASSP 2019 - 2019 IEEE International
Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2019, pp. 506–510.
[36] D. P. Kingma and J. A. Ba, “A method for
stochastic optimization. arxiv 2014,” arXiv preprint
arXiv:1412.6980, 2019.
[37] E. Vincent, R. Gribonval, and C. Fevotte, “Performance
measurement in blind audio source separation,” Trans.
Audio, Speech and Lang. Proc., vol. 14, no. 4, pp. 1462–
1469, Jul. 2006, ISSN: 1558-7916. DOI: 10.1109/TSA.
2005.858005. [Online]. Available: https://doi.org/10.
1109/TSA.2005.858005.
[38] F.-R. Stöter and A. Liutkus, Museval 0.3.0, ver-
sion v0.3.0, Aug. 2019. DOI: 10.5281/zenodo.3376621.
[Online]. Available: https : / / doi . org / 10 . 5281 / zenodo .
3376621.