0% found this document useful (0 votes)

55 views

Music Source Separation: Francisco Javier Cifuentes Garc Ia

In this paper the task of isolating the vocal recording from the different instrumental components that arranged form a mixed song is approached by a data driven method based on frequency domain representation of the tune and a neural network that masks the voice track

Uploaded by

Frank Cifuentes

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

55 views

Music Source Separation: Francisco Javier Cifuentes Garc Ia

Uploaded by

Frank Cifuentes

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

PATTERN RECOGNITION AND MACHINE LEARNING, UPC, JUNE 2020 1

Music Source Separation

Francisco Javier Cifuentes Garcı́a†

Abstract—In this paper the task of isolating the vocal recording

from the different instrumental components that arranged form
a mixed song is approached by a data driven method based
on frequency domain representation of the tune and a neural
network that masks the voice track.
Index Terms—Music Source Separation, Blind Source Sepa-
ration, Blind Audio Separation, Signal Separation, Source Sep-
aration, Machine Learning, Deep Neural Network, Supervised
Learning.

I. I NTRODUCTION

T He problem of music source separation is key of a billion

dollar industry, namely the karaoke business, since a song
with its vocals removed constitutes its main piece: the backing
track. It is also the basis for systematic application of lyric
extraction techniques [1]. In addition, the interest for listening
to the a capella version of a song has increased since raw
studio recordings from famous songs has been released e.g. Fig. 1: Time domain aspect of a music track.
Under Pressure by Queen with +12.8M views in YouTube.
Other typical applications are remixing and pitch tracking [2].
Therefore the aim of this work is to present an useful tool to Wonder is shown in figure 1. In light of this example it is
extract the vocal track from a given song. clear that the time domain representation of the signal does
The article is structured as follows: first, the problem is not offer sufficient insights about the structure that builds it,
described and common audio data processing techniques are in other words, it is not clear which is the contribution from
explained, second, a literature review of different approaches each instrument at each time. Therefore, in order to represent
to develop such a tool is presented, third, the adopted approach a digital music track in a way that is easier to identify the
is defined, and lastly, the specific data set and performance different components as perceived by the human ear the
results are analyzed before the conclusion. following two techniques are applied: Short Time Fast Fourier
Transform and Energy/Power Spectrogram Representation.
II. P ROBLEM DESCRIPTION AND DATA PREPROCESSING
What humans perceive as sounds are pressure waves thus 1) Short Time Fast Fourier Transform (STFT): This tech-
analogical information by nature which needs to be captured nique allows to determine the sinusoidal frequency and phase
and converted into digital data for processing, transmission content of local sections of a signal as it changes over time.
and storage purposes. In the case of music tracks the usual In digital based tracks, the time signal is divided into shorter
coding format for digital audio is MPEG-1 Audio Layer III overlapping segments of equal length (Short Time) called
commonly known as MP3. The audio signal is encoded with windows and then the Discrete Fourier transform is computed
a certain frequency, usually 44,1 kHz which has an effect on separately on each segment. Therefore, both the signal and
the quality of the encoder algorithm as well as the complexity time window function are discrete quantities. The fast ad-
of the signal being encoded because the MP3 standard jective means that computation of the Fourier Transform is
allows for freedom regarding encoding algorithms: different based on an optimized algorithm that quantizes and discretizes
encoders exhibit different quality, even with identical bit the frequency spectrum. Although most scientific software
rates. The encoding relies primarily on masking curves, used packages, such as SciPy or Torch, include this algorithm, its
to calculate frequency and temporal masking [3]. Looking performance relies on the selection of the proper parameters of
at the information directly in the time domain, the aspect of the transformation, so for further clarification the mathematical
a 14 seconds sample from the track Superstition by Stevie formula (1) is presented.
∞
X
† Contact : f ranlikestheblues@yahoo.com
X(m, ω) = x[n]w[n − m]e−jωn (1)
June 5, 2020 n=−∞
Barcelona, Spain.
PATTERN RECOGNITION AND MACHINE LEARNING, UPC, JUNE 2020 2

Where X(m, ω) is the Discrete Fourier Transform of the

windowed data over the segmentation window function w
with shift m, x[n] denotes the signal and ω is the frequency.
The STFT is the real time implementation making use of the
symmetry properties of Discrete Fourier Transform which can
be interpreted as the Fourier transform of x[n]w[n − m].
It is worth noting that there is a trade-off between time
and frequency resolution because although a narrow-width
window results in a better resolution in the time domain, it
generates a poor resolution in the frequency domain, and
vice versa [4], hence an adequate selection of these values
should be taken into account depending on the signal and
application purposes. In addition, it is interesting to note that
this transformation is also applied in the encoding algorithm
of MP3 with a number of 1024 points per sample [5].

2) Spectrogram Representation: The spectrogram is the Fig. 2: Spectrogram of a sampled vocal track.
graphical representation of the time-varying magnitude spec-
trum of frequencies of a given signal i.e. the time-varying
the voice is concluded [7]. These methods prove their superior
representation of the Fourier Transform. For each sampled
performance when the assumption of harmonic lead signal is
time window, the amplitude or intensity of each frequency
valid but for most songs this is not the case, as vocals present
is usually represented as a heat map in dB or sound pressure
unvoiced speech, whispers and saturation. Furthermore, when
level with respect to a maximum amplitude. This is due to
another instrument is louder or the singer is silent then the
the fact that most sounds humans hear are concentrated in
incorrect sound is isolated. Thus, these techniques do not fully
very small frequency and amplitude ranges so in order to
handle the case of indefinitely harmonic speech and undom-
visualize better the spectrum both magnitude and frequencies
inant singer voice, resulting in a lower and non-acceptable
should be represented in a logarithmic fashion. The way it is
performance.
applied in the case of song processing is the following: the
Another approach is based on modelling the accompaniment
time domain data is split into overlapping windows and the
supposing redundant structure of the piece and considering
STFT is applied to compute the amplitude of each frequency
the limited instrumental range of fundamental frequencies for
for the given window, then the magnitude is converted into dB
the instrumental notes. One common technique is based on
scale for each frequency (in the case of a power spectrogram
non-negative matrix factorization to spot the groups of the
this is done over the square of the magnitude) and the windows
mixture and then aggregate them into voice or accompani-
are overlapped in order to create a visual representation of
ment. Once the spectra is clustered, the separation and time
the time and frequency varying power of the track. Figure 2
domain reconstruction is performed by a Wiener filter [8],
depicts an example spectrogram of a 14 seconds vocal track
[9]. Detecting the redundant structure of the instrumentals is
where the fundamental frequency of the voice, its harmonics
based on the repeating pattern extraction technique [10] and
and unvoiced speech are noticeable for each time frame. The
its generalization called kernel additive modelling [11] and
harmonic nature is due to the fact that vocal sound is produced
then applying robust principal component analysis or singular
by the vibration of the vocal folds and filtered by the vocal
value decomposition [12]. Using the model of the vocals and
tract so melodies are mostly harmonic.
the accompaniment the separation is successfully performed
when the assumptions hold, however this is not always the
III. L ITERATURE REVIEW case and therefore they often exhibit low performance.
In this section a concise overview of different methods
for voice separation in music tracks is presented. For a B. Data-driven techniques
detailed explanation on signal processing, audio modeling and
a comprehensive discussion about lead and accompaniment These methods do not make any assumptions on the struc-
separation the author recommends reference [6]. ture of the accompaniment or frequency domain characteristics
of the voice, alternatively they are based on learning the
extraction rules from numerous representative examples. The
A. Model-based techniques output of these methods is usually the voice spectrogram or
These methods are grounded on tracking the pitch of the the frequency domain transfer function (TF) mask that maps
vocals and estimating the energy at the harmonics of the the vocals from the mixed track. These methods show the
fundamental for reconstructing the voice. In other words, same issues of every data-driven problem solving method: they
considering vocals perfectly harmonic and with the strongest rely on a large representative dataset, require to tune several
amplitude from the mixture, firstly the fundamental of the parameters and are prone to overfitting.
lead signal is extracted at each time window and secondly One group of techniques rests on probabilistic methods such
applying resynthesis or by filtering the track the extraction of as Bayesian models [13], Gaussian mixture models [14] and
PATTERN RECOGNITION AND MACHINE LEARNING, UPC, JUNE 2020 3

Hidden Markov models [15] to estimate the vocals. There are

also mixed approaches using some model information in the
learning strategy such as the ones presented in [16] and [17].
Other approaches are based on deep neural networks (DNN)
and although they represent a vast group of research effort
in this topic, it is still on early stages of development [6].
A neural network is a set of sequential transformations of
the input which applies the learnt transformation parameters
obtained from a training stage when the network is calibrated
according to the optimization of some function of the input and
target output, called loss function. There are techniques that
rely on deep clustering [18] for the estimation of the TF mask.
Another remarkable approach is considering the building of
the TF mask part of the neural network building blocks in
a non-linear fashion thus including the filtering inside the
recurrent NN resulting in high performance [19]. The optimal
network architecture may vary from one track to another,
therefore some authors considered a fusion of methods aggre-
gating the results from an ensemble of feed-forward DNNs to
predict TF masks for separation [20]. A feed-forward neural
network (FNN) with consecutive bins from the spectrogram
as inputs is proposed in [21] and improved in [22] by using
bidirectional long short-term memory (LSTM). The LSTM
architecture of [23], used for music note synthesis, served as
inspiration for posterior model architectures as in [24]. The
spectra information propagation is used in [24] giving the
convolution of the spectrogram to intermediate output layers
of the network for estimating a mask of the input track. The
introduction of skip connections in [25] allows for propagation
of the spectrogram to transition sections in the network. More
recent developments proved the effectiveness of an architecture
with skip connections using encoder/decoder Convolutional
Neural Networks (CNN) with LSTM called U-Net and direct
synthesis noting the importance of adequate scaling [26], [27].

IV. A DOPTED APPROACH

As the 2018 Signal Separation Evaluation Campaign
pointed out, the vast majority of methods submitted are
based on deep learning, reflecting a shift in the community’s
methodology. In addition, the spectrogram based methods
outperformed the ones working directly on the waveform
domain [28].

Considering the previous analysis of current trends, the

adopted technique is based on deep learning with the objective
of constructing a mask. Specifically, it is based on [26], [27]
and [29]. Figure 3 shows the architecture of the whole method.
In particular, the input is the time domain stereo song which
is transformed into a spectrogram using the STFT, then the
spectrogram is divided into segments and normalized using
the frequency data of the entire track. The data is then passed
by an encoder stage consisting on a fully connected neural
network without bias that reduces the dimension. After, the
hyperbolic tangent is applied, mapping the values to the range
(-1, 1). The main design choice is then encountered which
consists of three bidirectional LSTM. This network allows for
evaluation of current data taking into account information from
the previous and next data bins due to its recursive nature [30]. Fig. 3: Neural network architecture and data flow.
PATTERN RECOGNITION AND MACHINE LEARNING, UPC, JUNE 2020 4

The output of the LSTM is concatenated with the skipped The details of the training procedure are very influential to
connection from the hyperbolic tangent, similarly to what the performance. In this case, the training tacks are cropped
the literature proposes, increasing the dimensionality. Then into smaller audios of 8 s and in order to allocate the data of
another encoder consisting of a fully connected network with each batch into memory the number of tracks sampled is set
output size of 512 maps the stretched vector to the size of the to 32. Following the guidelines of [26] and [27], the adopted
output decoder (symmetric in size to the input encoder) and optimizer is Adam which is a widely used algorithm for
a rectified linear unit (ReLU) is applied before so the output first-order gradient-based optimization of stochastic objective
is in the range [0, ∞). After the decoder fully connected functions and it is included in most software packages due to
network mentioned previously is applied, the output scaling is its computationally efficiency and little memory requirements
performed followed by another ReLU that outputs the vocal’s [36]. The main parameters of the neural network training are
mask. This mask is applied to the spectrogram of the mixed summarized in table I.
song and the vocal track is recovered in the time domain by
Training samples 102400
inverting the spectrogram either by Wiener filtering [31] as
Sample duration 8s
the usual method [8], [9], [29] or by using the fast Griffin-Lim
Channels 2
algorithm for an approximate reconstruction [32].
Frequency 44.1 kHz
The size of each internal layer is selected to be determined Optimizer Adam
based on the number of channels of the track, the STFT Learning rate 1e-3
parameters and the value of the output of the first NN, which Weight decay 1e-5
is set to 512 based on the state of the art aforementioned TABLE I: Training parameters.
implementations. Half of this value is the hidden layer size
of the LSTM. Therefore all layer’s sizes can be computed Three widely used metrics to asses separation quality are:
using the dimension of the first NN output (512) and size Source to Distortion, to Artefact, to Interference ratios (SDR,
of the STFT, nF F T = 4096, as it determines the input and SAR, SIR) [28], [37]. In this case the SIR is computed to
output layer sizes to be bnF F T /2c + 1 = 2049 in the case of compare the implemented method with the state of the art
mono tracks and double that value, 4098, for stereo songs. techniques. This metric is defined by the formula (3).
The second parameter of the STFT is the distance between
neighboring sliding window frames which is set to 1024. kstarget k
SIR = 20 log10 (3)
keinterf erence k
The loss function selected is called smooth L1 loss or Huber
loss and it is a modified version of the L1 loss used in [26] Where the argument of the logarithm is the ratio between the
and [27], which is simply the mean absolute error (MAE) or energy of the target signal and the energy of the interference
norm-1 between each element in the input x and target output signal. The higher the value of the SIR the better the perfor-
y. This modification of the MAE is less sensitive to outliers mance of the method. The computation is done in the time
and prevents exploding gradients [33]. The smooth L1 loss is domain representation of the target signal with respect to the
computed by the expression (2). true ground sources using the package museval [38].
n
1X VI. R ESULTS
loss = zi ,
n i=1 The training of the neural network took 12 hours and the
(2) evolution of the loss function during this process is shown
( in figure 4, with a final value of 0.0028. The results of
1
2 (xi− yi )2 , if |xi − yi | < 1 the performance over the test set show, both from direct
zi =
|xi − yi | − 12 , otherwise listening of the tracks and from the comparison presented
in table II, that the performance is less than the state of the
art trained neural networks for music source separation. This
V. C ASE STUDY is mostly due to the fact that in this case a total of 228
It has been requested for educational purposes the largest hours of tracks (randomly cropped) were used to train the
public dataset for music source separation: MUSDB [34]. It network while in the referenced techniques the amount of
is composed of 150 professionally produced and high quality total tracks duration (also randomly cropped) is considerably
songs: 10 hours and 5.3 GB. Five audio files are available for higher, around 150000 hours (at least) by training using
each song: the mix, and four separated tracks called stems several epochs and augmented data set, constructed using the
(drums, bass, vocals and other). Only western music genres stems of different tracks and randomly mixing them as well
are present, with a vast majority of pop/rock songs, along as swapping left/right channel for each instrument, resulting
with some hip-hop, rap and metal songs. A recommended in a training time of a full week [29], [26]. Therefore the
split of 100 songs corresponding to the training set and 50 to reduction in the execution is not unexpected, although it
the test set is provided beforehand. As some authors noticed, shows that the increase of performance with the number of
this data set is genre biased towards pop and rock music [35]. epochs and training time is not linear.
PATTERN RECOGNITION AND MACHINE LEARNING, UPC, JUNE 2020 5

Concerning related future work, an immediate next step is

to develop or apply some lyrics extraction technique to the iso-
lated vocal track. Furthermore, if the model is further improved
to isolate each instrument then in the author’s point of view
the most interesting next step is to build an automatic musical
score or sheet generator for the main modern instruments:
bass, piano and guitar. There is a huge market of music scores
and transcription of some pieces that may seem difficult even
to the most musical trained human ear, become trivial and fast
through a frequency domain based analysis. Of course, issues
related to harmonic cancellation and other details should be
properly analyzed but the perspectives are promising.

R EFERENCES
[1] K. Brown and P. Auslander, Karaoke Idols: Popular
Music and the Performance of Identity. Intellect, 2015,
ISBN : 9781783204441. [Online]. Available: https : / /
books.google.es/books?id=9I-JCgAAQBAJ.
Fig. 4: Loss function value during training. [2] E. Pollastri, “A pitch tracking system dedicated to
process singing voice for music retrieval,” in Proceed-
ings. IEEE International Conference on Multimedia and
Signal to Interference Ratio [dB] Expo, vol. 1, Aug. 2002, 341–344 vol.1. DOI: 10.1109/
Case study 9.15 ICME.2002.1035788.
Demucs [24] 12.26 [3] M. Bosi and R. Goldberg, Introduction to Digital Audio
UMX [29] 13.33 Coding and Standards, ser. The Springer International
Spleeter [26] 15.86 Series in Engineering and Computer Science. Springer
US, 2002, ISBN: 9781402073571. [Online]. Available:
TABLE II: Test results.
http://books.google.es/books?id=oHWIRmHpi8YC.
[4] N. Kehtarnavaz, “Chapter 7 - frequency domain pro-
The implemented algorithm is about 3 dB worse than the cessing,” in Digital Signal Processing System Design
U-Net of Demucs [27] which is also about 3 dB less than (Second Edition), N. Kehtarnavaz, Ed., Second Edition,
the performance of the better method called Spleeter [26]. Burlington: Academic Press, 2008, pp. 175–196, ISBN:
In sight of this fact the capability gap may be considered 978-0-12-374490-6. DOI: https : / / doi . org / 10 . 1016 /
acceptable for a first approach to the problem regardless of B978 - 0 - 12 - 374490 - 6 . 00007 - 6. [Online]. Available:
the limitations due to different computing capabilities. http : / / www . sciencedirect . com / science / article / pii /
B9780123744906000076.
[5] J. Guckert, The use of fft and mdct in mp3 audio
VII. C ONCLUSION compression, 2012. [Online]. Available: http : / / www.
math . utah . edu / ∼gustafso / s2012 / 2270 / web - projects /
The problem of blind music source separation is introduced Guckert-audio-compression-svd-mdct-MP3.pdf.
and the specialised literature is reviewed showing that the [6] Z. Rafii, A. Liutkus, F.-R. Stoter, S. I. Mimilakis, D.
state of the art techniques are able to separate vocals from FitzGerald, and B. Pardo, “An overview of lead and
accompaniment successfully although for the overlapping accompaniment separation in music,” IEEE/ACM Trans-
frequencies of the components they tend to exhibit low actions on Audio, Speech, and Language Processing,
performance. The approach that shows better compromise vol. 26, no. 8, pp. 1307–1335, Aug. 2018, ISSN: 2329-
between complexity and execution among the reviewed 9304. DOI: 10 . 1109 / taslp . 2018 . 2825440. [Online].
techniques is a neural network encoder-LSTM-decoder based Available: http : / / dx . doi . org / 10 . 1109 / TASLP. 2018 .
architecture that has been implemented in Python under the 2825440.
PyTorch framework, and trained and tested on the largest [7] J. Salamon, E. Gomez, D. P. W. Ellis, and G. Richard,
public professional quality data set, namely MUSDB. The “Melody extraction from polyphonic music signals:
resulting method is capable of separating the vocal track Approaches, applications, and challenges,” IEEE Signal
from any given song’s digital audio. The test results point out Processing Magazine, vol. 31, no. 2, pp. 118–134, Mar.
that limitations due to memory and computational effort lead 2014, ISSN: 1558-0792. DOI: 10 . 1109 / MSP . 2013 .
to lower performance than the state of the art techniques. 2271648.
Further improvements may be made in the definition of the
training strategy: parallel computing, more training epochs
and different loss functions and training parameters.
PATTERN RECOGNITION AND MACHINE LEARNING, UPC, JUNE 2020 6

[8] S. Vembu and S. Baumann, “Separation of vocals from Available: https://ccrma.stanford.edu/ ∼gautham/Site/
polyphonic audio recordings,” in Proceedings of the Publications files/rafii-ismir2013.pdf.
6th International Conference on Music Information [18] Y. Isik, J. L. Roux, Z. Chen, S. Watanabe, and J. R.
Retrieval, http://ismir2005.ismir.net/proceedings/1028. Hershey, “Single-channel multi-speaker separation us-
pdf, London, UK, Sep. 2005, pp. 337–344. ing deep clustering,” CoRR, vol. abs/1607.02173, 2016.
[9] C. Puntonet and A. Prieto, Independent Component arXiv: 1607.02173. [Online]. Available: http://arxiv.org/
Analysis and Blind Signal Separation: Fifth Interna- abs/1607.02173.
tional Conference, ICA 2004, Granada, Spain, Septem- [19] S. Nie, W. Xue, S. Liang, X. Zhang, and W.-J. Liu,
ber 22-24, 2004, Proceedings, ser. Lecture Notes in “Joint optimization of recurrent networks exploiting
Computer Science. Springer Berlin Heidelberg, 2004, source auto-regression for source separation,” Sep.
ISBN : 9783540301103. [Online]. Available: https : / / 2015.
books.google.es/books?id=84X0BwAAQBAJ. [20] E. Grais, G. Roma, A. Simpson, and M. Plumbley,
[10] Z. Rafii and B. Pardo, “A simple music/voice separation “Combining mask estimates for single channel audio
method based on the extraction of the repeating musical source separation using deep neural networks,” Sep.
structure,” in 2011 IEEE International Conference on 2016, pp. 3339–3343. DOI: 10.21437/Interspeech.2016-
Acoustics, Speech and Signal Processing (ICASSP), 216.
2011, pp. 221–224. [21] S. Uhlich, F. Giron, and Y. Mitsufuji, “Deep neural net-
[11] A. Liutkus, D. Fitzgerald, Z. Rafii, B. Pardo, and L. work based instrument extraction from music,” in 2015
Daudet, “Kernel additive models for source separa- IEEE International Conference on Acoustics, Speech
tion,” IEEE Transactions on Signal Processing, vol. 62, and Signal Processing (ICASSP), 2015, pp. 2135–2139.
no. 16, pp. 4298–4310, 2014. [22] S. Uhlich, M. Porcu, F. Giron, M. Enenkl, T. Kemp, N.
[12] P. Seetharaman, F. Pishdadian, and B. Pardo, “Mu- Takahashi, and Y. Mitsufuji, “Improving music source
sic/voice separation using the 2d fourier transform,” separation based on deep neural networks through data
in 2017 IEEE Workshop on Applications of Signal augmentation and network blending,” in 2017 IEEE In-
Processing to Audio and Acoustics (WASPAA), 2017, ternational Conference on Acoustics, Speech and Signal
pp. 36–40. Processing (ICASSP), 2017, pp. 261–265.
[13] A. Ozerov, P. Philippe, F. Bimbot, and R. Gribon- [23] A. Défossez, N. Zeghidour, N. Usunier, L. Bottou, and
val, “Adaptation of bayesian models for single-channel F. Bach, Sing: Symbol-to-instrument neural generator,
source separation and its application to voice/music 2018. arXiv: 1810.09785 [cs.SD].
separation in popular songs,” Trans. Audio, Speech and [24] D. Stoller, S. Ewert, and S. Dixon, “Wave-u-net: A
Lang. Proc., vol. 15, no. 5, pp. 1564–1578, Jul. 2007, multi-scale neural network for end-to-end audio source
ISSN : 1558-7916. DOI : 10 . 1109 / TASL . 2007 . 899291. separation,” ArXiv, vol. abs/1806.03185, 2018.
[Online]. Available: https : / / doi . org / 10 . 1109 / TASL . [25] S. Mimilakis, E. Cano, J. Abeßer, and G. Schuller,
2007.899291. “New sonorities for jazz recordings: Separation and
[14] E. Vincent, M. G. Jafari, S. A. Abdallah, M. D. mixing using deep neural networks,” Sep. 2016.
Plumbley, and M. E. Davies, “Probabilistic modeling [26] R. Hennequin, A. Khlif, F. Voituret, and M. Mous-
paradigms for audio source separation,” in Machine sallam, Spleeter: A fast and state-of-the art music
Audition: Principles, Algorithms and Systems, W. Wang, source separation tool with pre-trained models, Late-
Ed., IGI Global, 2010, pp. 162–185. DOI: 10.4018/978- Breaking/Demo ISMIR 2019, Deezer Research, Nov.
1-61520-919-4.ch007. [Online]. Available: https://hal. 2019.
inria.fr/inria-00544016. [27] A. Défossez, N. Usunier, L. Bottou, and F. Bach,
[15] G. J. Mysore, P. Smaragdis, and B. Raj, “Non-negative Demucs: Deep extractor for music sources with extra
hidden markov modeling of audio with application unlabeled data remixed, 2019. arXiv: 1909 . 01174
to source separation,” in Proceedings of the 9th In- [cs.SD].
ternational Conference on Latent Variable Analysis [28] F.-R. Stöter, A. Liutkus, and N. Ito, “The 2018 signal
and Signal Separation, ser. LVA/ICA’10, St. Malo, separation evaluation campaign,” Lecture Notes in Com-
France: Springer-Verlag, 2010, pp. 140–148, ISBN: puter Science, pp. 293–305, 2018, ISSN: 1611-3349.
364215994X. DOI : 10 . 1007 / 978 - 3 - 319 - 93764 - 9 28. [Online].
[16] P. Smaragdis, B. Raj, and M. Shashanka, “Super- Available: http : / / dx . doi . org / 10 . 1007 / 978 - 3 - 319 -
vised and semi-supervised separation of sounds from 93764-9 28.
single-channel mixtures,” in Proceedings of the 7th [29] F.-R. Stter, S. Uhlich, A. Liutkus, and Y. Mitsufuji,
International Conference on Independent Component “Open-unmix - a reference implementation for music
Analysis and Signal Separation, ser. ICA’07, Lon- source separation,” Journal of Open Source Software,
don, UK: Springer-Verlag, 2007, pp. 414–421, ISBN: 2019. DOI: 10.21105/joss.01667. [Online]. Available:
3540744932. https://doi.org/10.21105/joss.01667.
[17] F. G. Zafar Rafii and D. Sun, Combining modeling [30] S. Hochreiter and J. Schmidhuber, “Long short-term
of singing voice and background music for automatic memory,” Neural computation, vol. 9, pp. 1735–80,
separation of musical mixtures, Nov. 2013. [Online]. Dec. 1997. DOI: 10.1162/neco.1997.9.8.1735.
PATTERN RECOGNITION AND MACHINE LEARNING, UPC, JUNE 2020 7

[31] A. Liutkus and F.-R. Stöter, Sigsep/norbert: First of- Francisco Javier Cifuentes Garcı́a received the
ficial norbert release, version v0.2.0, Jul. 2019. DOI: Bachelor’s degree in Energy Engineering from the
Technical University of Catalonia (UPC), Barcelona,
10 . 5281 / zenodo . 3269749. [Online]. Available: https : Spain, in 2018 and ever since he has been doing
//doi.org/10.5281/zenodo.3269749. an internship at CITCEA UPC while pursuing a
[32] N. Perraudin, P. Balazs, and P. L. Søndergaard, “A Master’s degree in Industrial Engineering specialized
in Electronics. His research interests include power
fast griffin-lim algorithm,” in 2013 IEEE Workshop electronics dominated power systems, renewable
on Applications of Signal Processing to Audio and energy technologies and machine learning, among
Acoustics, 2013, pp. 1–4. others. He is also a music enthusiast which is the
main motivation for this work.
[33] R. Girshick, “Fast r-cnn,” in 2015 IEEE Interna-
tional Conference on Computer Vision (ICCV), 2015,
pp. 1440–1448.
[34] Z. Rafii, A. Liutkus, F.-R. Stöter, S. I. Mimilakis, and
R. Bittner, The MUSDB18 corpus for music separation,
Dec. 2017. DOI: 10 . 5281 / zenodo . 1117372. [Online].
Available: https://doi.org/10.5281/zenodo.1117372.
[35] L. Prétet, R. Hennequin, J. Royo-Letelier, and A.
Vaglio, “Singing voice separation: A study on training
data,” in ICASSP 2019 - 2019 IEEE International
Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2019, pp. 506–510.
[36] D. P. Kingma and J. A. Ba, “A method for
stochastic optimization. arxiv 2014,” arXiv preprint
arXiv:1412.6980, 2019.
[37] E. Vincent, R. Gribonval, and C. Fevotte, “Performance
measurement in blind audio source separation,” Trans.
Audio, Speech and Lang. Proc., vol. 14, no. 4, pp. 1462–
1469, Jul. 2006, ISSN: 1558-7916. DOI: 10.1109/TSA.
2005.858005. [Online]. Available: https://doi.org/10.
1109/TSA.2005.858005.
[38] F.-R. Stöter and A. Liutkus, Museval 0.3.0, ver-
sion v0.3.0, Aug. 2019. DOI: 10.5281/zenodo.3376621.
[Online]. Available: https : / / doi . org / 10 . 5281 / zenodo .
3376621.

The Ceasefire Babiesby Fiona Doyle
No ratings yet
The Ceasefire Babiesby Fiona Doyle
55 pages
Music Note Recognition Using FFT
No ratings yet
Music Note Recognition Using FFT
11 pages
DAFX: Digital Audio Effects
From Everand
DAFX: Digital Audio Effects
Udo Zölzer
3.5/5 (2)
Petrovac U Nob
100% (1)
Petrovac U Nob
686 pages
Improving The Global Parameter Signal To Distortion Value in Music Signals Using Panning Technique and Discrete Wavelet Transforms
No ratings yet
Improving The Global Parameter Signal To Distortion Value in Music Signals Using Panning Technique and Discrete Wavelet Transforms
10 pages
The Use and Effective Analysis of Vocal Spectrum A
No ratings yet
The Use and Effective Analysis of Vocal Spectrum A
14 pages
Aes2001 Bonada PDF
100% (1)
Aes2001 Bonada PDF
10 pages
AI-Based Vocal Judging Application
No ratings yet
AI-Based Vocal Judging Application
8 pages
Convention Paper 5452: Audio Engineering Society
100% (1)
Convention Paper 5452: Audio Engineering Society
10 pages
The Columbine Massacre - Barack Obama - Zionist Wolf in Sheep's (PDFDrive)
No ratings yet
The Columbine Massacre - Barack Obama - Zionist Wolf in Sheep's (PDFDrive)
18 pages
Speech Acoustics Project
No ratings yet
Speech Acoustics Project
22 pages
Klapuri - 2006 - Introduction To Music Transcription
No ratings yet
Klapuri - 2006 - Introduction To Music Transcription
28 pages
Es Sem04 Paper 04307909
No ratings yet
Es Sem04 Paper 04307909
17 pages
Sound Source Separation Algorithm Comparison Using Popular Music
100% (1)
Sound Source Separation Algorithm Comparison Using Popular Music
4 pages
FFT Research
No ratings yet
FFT Research
8 pages
Speech Chapter 4
No ratings yet
Speech Chapter 4
41 pages
Audio File Recognition Using Hash Algorithm
No ratings yet
Audio File Recognition Using Hash Algorithm
8 pages
Bros Sier 04 Fast Notes
No ratings yet
Bros Sier 04 Fast Notes
6 pages
State-Of-The-Art in Fundamental Frequency Tracking: Stéphane Rossignol, Peter Desain and Henkjan Honing
100% (2)
State-Of-The-Art in Fundamental Frequency Tracking: Stéphane Rossignol, Peter Desain and Henkjan Honing
11 pages
ZsaDescriptors A Library
No ratings yet
ZsaDescriptors A Library
5 pages
Musical Instrument Timbres Classification With Spectum
100% (1)
Musical Instrument Timbres Classification With Spectum
10 pages
Content-Based Classification of Musical Instrument Timbres: Agostini Longari Pollastri
100% (1)
Content-Based Classification of Musical Instrument Timbres: Agostini Longari Pollastri
8 pages
paper183
No ratings yet
paper183
8 pages
Multimedia Auditory Signal Analysis
No ratings yet
Multimedia Auditory Signal Analysis
17 pages
Ph.D. Thesis Computationally Efficient Methods For Polyphonic Music Transcription
No ratings yet
Ph.D. Thesis Computationally Efficient Methods For Polyphonic Music Transcription
232 pages
Pert Usa PHD
No ratings yet
Pert Usa PHD
232 pages
Ss Augmented Exp
No ratings yet
Ss Augmented Exp
4 pages
Guide to the Basic Concepts and Techniques of Spectral Music Joshua Fineberg Part 4
No ratings yet
Guide to the Basic Concepts and Techniques of Spectral Music Joshua Fineberg Part 4
6 pages
Dual Attention Network for Pitch Estimation of Monophonic Music
No ratings yet
Dual Attention Network for Pitch Estimation of Monophonic Music
6 pages
Vocal Pitch Detection For Musical Transcription PDF
No ratings yet
Vocal Pitch Detection For Musical Transcription PDF
3 pages
Voice Analysis Using Short Time Fourier Transform and Cross Correlation Methods
No ratings yet
Voice Analysis Using Short Time Fourier Transform and Cross Correlation Methods
6 pages
Analysisof Speech Signal 29 TH October 2018
No ratings yet
Analysisof Speech Signal 29 TH October 2018
16 pages
Signal Processing Methods For Music Transcription Klapuri
No ratings yet
Signal Processing Methods For Music Transcription Klapuri
443 pages
Predicting Singer Voice Using Convolutional Neural Network
No ratings yet
Predicting Singer Voice Using Convolutional Neural Network
17 pages
Beyond NMF: Time-Domain Audio Source Separation Without Phase Reconstruction
No ratings yet
Beyond NMF: Time-Domain Audio Source Separation Without Phase Reconstruction
6 pages
Shazam Princeton ELE201
No ratings yet
Shazam Princeton ELE201
7 pages
ffffffffffffffffff
No ratings yet
ffffffffffffffffff
12 pages
Developing A MATLAB Code For Fundamental Frequency and Pitch Estimation From Audio Signal
No ratings yet
Developing A MATLAB Code For Fundamental Frequency and Pitch Estimation From Audio Signal
16 pages
Audio Noise detection
No ratings yet
Audio Noise detection
29 pages
B. Transient/Steady-State Separation A. Reduction Based On Signal Features 1) Temporal Features: When Observing The Temporal Evo
No ratings yet
B. Transient/Steady-State Separation A. Reduction Based On Signal Features 1) Temporal Features: When Observing The Temporal Evo
1 page
Analysis and Synthesis of Speech Using Matlab
No ratings yet
Analysis and Synthesis of Speech Using Matlab
10 pages
Submitted Paper
No ratings yet
Submitted Paper
20 pages
Robust and Efficient Pitch Tracking For Query-by-Humming: Yongwei Zhu, Mohan S Kankanhalli
No ratings yet
Robust and Efficient Pitch Tracking For Query-by-Humming: Yongwei Zhu, Mohan S Kankanhalli
5 pages
wu2019
No ratings yet
wu2019
4 pages
Musical Genre Classification by Instrumental Features: Dannenberg, Thom, and Watson
No ratings yet
Musical Genre Classification by Instrumental Features: Dannenberg, Thom, and Watson
4 pages
Audio Fingerprinting With Python and Numpy
No ratings yet
Audio Fingerprinting With Python and Numpy
13 pages
Audio Data Analysis Using Machine Learning and Deep
No ratings yet
Audio Data Analysis Using Machine Learning and Deep
74 pages
Mohini Dey - Capstone
No ratings yet
Mohini Dey - Capstone
52 pages
Image Enhancement
No ratings yet
Image Enhancement
14 pages
Pitch Recognition Through Template Matching: Salim Perchy
100% (1)
Pitch Recognition Through Template Matching: Salim Perchy
11 pages
Melnet: A Generative Model For Audio in The Frequency Domain
No ratings yet
Melnet: A Generative Model For Audio in The Frequency Domain
14 pages
Klingbeil Dissertation Web
No ratings yet
Klingbeil Dissertation Web
167 pages
Pitch Detection of Speech Signals (Project Report)
No ratings yet
Pitch Detection of Speech Signals (Project Report)
9 pages
Pitch Detection of Voice Signals
No ratings yet
Pitch Detection of Voice Signals
24 pages
Andy Sun, Maisy Wieman, Analyzing Vocal Patterns To Determine Emotion
No ratings yet
Andy Sun, Maisy Wieman, Analyzing Vocal Patterns To Determine Emotion
5 pages
JournalNX-Mp3 File Retrieval
No ratings yet
JournalNX-Mp3 File Retrieval
3 pages
Instrument Recognition
No ratings yet
Instrument Recognition
1 page
(Burges, Platt, Jana) Distortion Discriminant Anal
No ratings yet
(Burges, Platt, Jana) Distortion Discriminant Anal
10 pages
Filter Bank: Insights into Computer Vision's Filter Bank Techniques
From Everand
Filter Bank: Insights into Computer Vision's Filter Bank Techniques
Fouad Sabry
No ratings yet
Computer Audition: Fundamentals and Applications
From Everand
Computer Audition: Fundamentals and Applications
Fouad Sabry
No ratings yet
Digital Signal Processing for Audio Applications: Volume 1 - Formulae
From Everand
Digital Signal Processing for Audio Applications: Volume 1 - Formulae
Anton R Kamenov
No ratings yet
Transmagnetic Resonance Field Theory
From Everand
Transmagnetic Resonance Field Theory
Timothy E. Douglas
No ratings yet
Music Source Separation Presentation - PPSX
No ratings yet
Music Source Separation Presentation - PPSX
6 pages
Eris - Series E3.5/E3.5 BT/E4.5/ E4.5 BT: Owner's Manual
No ratings yet
Eris - Series E3.5/E3.5 BT/E4.5/ E4.5 BT: Owner's Manual
12 pages
Time Domain Based Active Power Calculation
No ratings yet
Time Domain Based Active Power Calculation
12 pages
Using The Smith Method For Calculating The Mean Temperature Difference in A Cross Flow Heat Exchanger With A Single Pass Motion of Nonmixing Media Through Both Cavities
No ratings yet
Using The Smith Method For Calculating The Mean Temperature Difference in A Cross Flow Heat Exchanger With A Single Pass Motion of Nonmixing Media Through Both Cavities
5 pages
Increasing The Electric Field
No ratings yet
Increasing The Electric Field
2 pages
List of Reporting Verbs PDF
100% (1)
List of Reporting Verbs PDF
1 page
Build The Second Generation Arc Reactor PDF
No ratings yet
Build The Second Generation Arc Reactor PDF
10 pages
Katia Crochet Magazine
100% (8)
Katia Crochet Magazine
159 pages
RUKUS April 2011
100% (3)
RUKUS April 2011
40 pages
Shank, Bob Dylan
No ratings yet
Shank, Bob Dylan
27 pages
EXAM UNIT 3 & 4
No ratings yet
EXAM UNIT 3 & 4
3 pages
Harmony and Proportion - Pythagoras - Music and Space
No ratings yet
Harmony and Proportion - Pythagoras - Music and Space
2 pages
DAILY TEST 2 - Caption - XII - 2021
No ratings yet
DAILY TEST 2 - Caption - XII - 2021
4 pages
Lesosn 1 Express Yourself
No ratings yet
Lesosn 1 Express Yourself
3 pages
Music Art 10 Adm Q4 21 22
No ratings yet
Music Art 10 Adm Q4 21 22
35 pages
Lesson Plan Concert Band Monday Jan 30
No ratings yet
Lesson Plan Concert Band Monday Jan 30
3 pages
Larsha Pekhawar Lyrics
100% (1)
Larsha Pekhawar Lyrics
2 pages
A MAPEH 7 2nd MUSIC ARTS 1
No ratings yet
A MAPEH 7 2nd MUSIC ARTS 1
8 pages
Pentatonic Scale Minor 6
No ratings yet
Pentatonic Scale Minor 6
126 pages
110 - Worthy Is The Lamb - Pepper Choplin
100% (4)
110 - Worthy Is The Lamb - Pepper Choplin
3 pages
Sylvain Luc - Song For My Twins PDF
No ratings yet
Sylvain Luc - Song For My Twins PDF
1 page
Ah, Bleak and Chill The Wintry Wind - E Major - MN0139398
No ratings yet
Ah, Bleak and Chill The Wintry Wind - E Major - MN0139398
3 pages
Mike Magatagan: About The Artist
No ratings yet
Mike Magatagan: About The Artist
2 pages
Arts 10 Grade 10 Day High School Only
No ratings yet
Arts 10 Grade 10 Day High School Only
10 pages
Pa3XLe Upgrade Manual v110 (EFGIS) PDF
No ratings yet
Pa3XLe Upgrade Manual v110 (EFGIS) PDF
10 pages
Espanhol EPS 708
No ratings yet
Espanhol EPS 708
76 pages
Half Life Full Life Consequences
No ratings yet
Half Life Full Life Consequences
4 pages
Akai Lct3201td SM
No ratings yet
Akai Lct3201td SM
107 pages
Download Complete Mixing Audio Concepts Practices and Tools 4th Edition Izhaki Roey PDF for All Chapters
100% (1)
Download Complete Mixing Audio Concepts Practices and Tools 4th Edition Izhaki Roey PDF for All Chapters
67 pages
Hartke Acoustic Preamp
No ratings yet
Hartke Acoustic Preamp
12 pages
Sheet Music: Scores Are Listed Alphabetically by Composer
20% (20)
Sheet Music: Scores Are Listed Alphabetically by Composer
25 pages
Shake It Off Lyric
0% (1)
Shake It Off Lyric
2 pages
04 Trombone
No ratings yet
04 Trombone
2 pages
Graded Music List 2015
No ratings yet
Graded Music List 2015
112 pages

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Music Source Separation: Francisco Javier Cifuentes Garc Ia

Uploaded by

Music Source Separation: Francisco Javier Cifuentes Garc Ia

Uploaded by

PATTERN RECOGNITION AND MACHINE LEARNING, UPC, JUNE 2020 1

Music Source Separation

Abstract—In this paper the task of isolating the vocal recording

T He problem of music source separation is key of a billion

Where X(m, ω) is the Discrete Fourier Transform of the

Hidden Markov models [15] to estimate the vocals. There are

IV. A DOPTED APPROACH

Considering the previous analysis of current trends, the

Concerning related future work, an immediate next step is

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.