0% found this document useful (0 votes)
55 views

Music Source Separation: Francisco Javier Cifuentes Garc Ia

In this paper the task of isolating the vocal recording from the different instrumental components that arranged form a mixed song is approached by a data driven method based on frequency domain representation of the tune and a neural network that masks the voice track

Uploaded by

Frank Cifuentes
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views

Music Source Separation: Francisco Javier Cifuentes Garc Ia

In this paper the task of isolating the vocal recording from the different instrumental components that arranged form a mixed song is approached by a data driven method based on frequency domain representation of the tune and a neural network that masks the voice track

Uploaded by

Frank Cifuentes
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

PATTERN RECOGNITION AND MACHINE LEARNING, UPC, JUNE 2020 1

Music Source Separation


Francisco Javier Cifuentes Garcı́a†

Abstract—In this paper the task of isolating the vocal recording


from the different instrumental components that arranged form
a mixed song is approached by a data driven method based
on frequency domain representation of the tune and a neural
network that masks the voice track.
Index Terms—Music Source Separation, Blind Source Sepa-
ration, Blind Audio Separation, Signal Separation, Source Sep-
aration, Machine Learning, Deep Neural Network, Supervised
Learning.

I. I NTRODUCTION

T He problem of music source separation is key of a billion


dollar industry, namely the karaoke business, since a song
with its vocals removed constitutes its main piece: the backing
track. It is also the basis for systematic application of lyric
extraction techniques [1]. In addition, the interest for listening
to the a capella version of a song has increased since raw
studio recordings from famous songs has been released e.g. Fig. 1: Time domain aspect of a music track.
Under Pressure by Queen with +12.8M views in YouTube.
Other typical applications are remixing and pitch tracking [2].
Therefore the aim of this work is to present an useful tool to Wonder is shown in figure 1. In light of this example it is
extract the vocal track from a given song. clear that the time domain representation of the signal does
The article is structured as follows: first, the problem is not offer sufficient insights about the structure that builds it,
described and common audio data processing techniques are in other words, it is not clear which is the contribution from
explained, second, a literature review of different approaches each instrument at each time. Therefore, in order to represent
to develop such a tool is presented, third, the adopted approach a digital music track in a way that is easier to identify the
is defined, and lastly, the specific data set and performance different components as perceived by the human ear the
results are analyzed before the conclusion. following two techniques are applied: Short Time Fast Fourier
Transform and Energy/Power Spectrogram Representation.
II. P ROBLEM DESCRIPTION AND DATA PREPROCESSING
What humans perceive as sounds are pressure waves thus 1) Short Time Fast Fourier Transform (STFT): This tech-
analogical information by nature which needs to be captured nique allows to determine the sinusoidal frequency and phase
and converted into digital data for processing, transmission content of local sections of a signal as it changes over time.
and storage purposes. In the case of music tracks the usual In digital based tracks, the time signal is divided into shorter
coding format for digital audio is MPEG-1 Audio Layer III overlapping segments of equal length (Short Time) called
commonly known as MP3. The audio signal is encoded with windows and then the Discrete Fourier transform is computed
a certain frequency, usually 44,1 kHz which has an effect on separately on each segment. Therefore, both the signal and
the quality of the encoder algorithm as well as the complexity time window function are discrete quantities. The fast ad-
of the signal being encoded because the MP3 standard jective means that computation of the Fourier Transform is
allows for freedom regarding encoding algorithms: different based on an optimized algorithm that quantizes and discretizes
encoders exhibit different quality, even with identical bit the frequency spectrum. Although most scientific software
rates. The encoding relies primarily on masking curves, used packages, such as SciPy or Torch, include this algorithm, its
to calculate frequency and temporal masking [3]. Looking performance relies on the selection of the proper parameters of
at the information directly in the time domain, the aspect of the transformation, so for further clarification the mathematical
a 14 seconds sample from the track Superstition by Stevie formula (1) is presented.

X
† Contact : f ranlikestheblues@yahoo.com
X(m, ω) = x[n]w[n − m]e−jωn (1)
June 5, 2020 n=−∞
Barcelona, Spain.
PATTERN RECOGNITION AND MACHINE LEARNING, UPC, JUNE 2020 2

Where X(m, ω) is the Discrete Fourier Transform of the


windowed data over the segmentation window function w
with shift m, x[n] denotes the signal and ω is the frequency.
The STFT is the real time implementation making use of the
symmetry properties of Discrete Fourier Transform which can
be interpreted as the Fourier transform of x[n]w[n − m].
It is worth noting that there is a trade-off between time
and frequency resolution because although a narrow-width
window results in a better resolution in the time domain, it
generates a poor resolution in the frequency domain, and
vice versa [4], hence an adequate selection of these values
should be taken into account depending on the signal and
application purposes. In addition, it is interesting to note that
this transformation is also applied in the encoding algorithm
of MP3 with a number of 1024 points per sample [5].

2) Spectrogram Representation: The spectrogram is the Fig. 2: Spectrogram of a sampled vocal track.
graphical representation of the time-varying magnitude spec-
trum of frequencies of a given signal i.e. the time-varying
the voice is concluded [7]. These methods prove their superior
representation of the Fourier Transform. For each sampled
performance when the assumption of harmonic lead signal is
time window, the amplitude or intensity of each frequency
valid but for most songs this is not the case, as vocals present
is usually represented as a heat map in dB or sound pressure
unvoiced speech, whispers and saturation. Furthermore, when
level with respect to a maximum amplitude. This is due to
another instrument is louder or the singer is silent then the
the fact that most sounds humans hear are concentrated in
incorrect sound is isolated. Thus, these techniques do not fully
very small frequency and amplitude ranges so in order to
handle the case of indefinitely harmonic speech and undom-
visualize better the spectrum both magnitude and frequencies
inant singer voice, resulting in a lower and non-acceptable
should be represented in a logarithmic fashion. The way it is
performance.
applied in the case of song processing is the following: the
Another approach is based on modelling the accompaniment
time domain data is split into overlapping windows and the
supposing redundant structure of the piece and considering
STFT is applied to compute the amplitude of each frequency
the limited instrumental range of fundamental frequencies for
for the given window, then the magnitude is converted into dB
the instrumental notes. One common technique is based on
scale for each frequency (in the case of a power spectrogram
non-negative matrix factorization to spot the groups of the
this is done over the square of the magnitude) and the windows
mixture and then aggregate them into voice or accompani-
are overlapped in order to create a visual representation of
ment. Once the spectra is clustered, the separation and time
the time and frequency varying power of the track. Figure 2
domain reconstruction is performed by a Wiener filter [8],
depicts an example spectrogram of a 14 seconds vocal track
[9]. Detecting the redundant structure of the instrumentals is
where the fundamental frequency of the voice, its harmonics
based on the repeating pattern extraction technique [10] and
and unvoiced speech are noticeable for each time frame. The
its generalization called kernel additive modelling [11] and
harmonic nature is due to the fact that vocal sound is produced
then applying robust principal component analysis or singular
by the vibration of the vocal folds and filtered by the vocal
value decomposition [12]. Using the model of the vocals and
tract so melodies are mostly harmonic.
the accompaniment the separation is successfully performed
when the assumptions hold, however this is not always the
III. L ITERATURE REVIEW case and therefore they often exhibit low performance.
In this section a concise overview of different methods
for voice separation in music tracks is presented. For a B. Data-driven techniques
detailed explanation on signal processing, audio modeling and
a comprehensive discussion about lead and accompaniment These methods do not make any assumptions on the struc-
separation the author recommends reference [6]. ture of the accompaniment or frequency domain characteristics
of the voice, alternatively they are based on learning the
extraction rules from numerous representative examples. The
A. Model-based techniques output of these methods is usually the voice spectrogram or
These methods are grounded on tracking the pitch of the the frequency domain transfer function (TF) mask that maps
vocals and estimating the energy at the harmonics of the the vocals from the mixed track. These methods show the
fundamental for reconstructing the voice. In other words, same issues of every data-driven problem solving method: they
considering vocals perfectly harmonic and with the strongest rely on a large representative dataset, require to tune several
amplitude from the mixture, firstly the fundamental of the parameters and are prone to overfitting.
lead signal is extracted at each time window and secondly One group of techniques rests on probabilistic methods such
applying resynthesis or by filtering the track the extraction of as Bayesian models [13], Gaussian mixture models [14] and
PATTERN RECOGNITION AND MACHINE LEARNING, UPC, JUNE 2020 3

Hidden Markov models [15] to estimate the vocals. There are


also mixed approaches using some model information in the
learning strategy such as the ones presented in [16] and [17].
Other approaches are based on deep neural networks (DNN)
and although they represent a vast group of research effort
in this topic, it is still on early stages of development [6].
A neural network is a set of sequential transformations of
the input which applies the learnt transformation parameters
obtained from a training stage when the network is calibrated
according to the optimization of some function of the input and
target output, called loss function. There are techniques that
rely on deep clustering [18] for the estimation of the TF mask.
Another remarkable approach is considering the building of
the TF mask part of the neural network building blocks in
a non-linear fashion thus including the filtering inside the
recurrent NN resulting in high performance [19]. The optimal
network architecture may vary from one track to another,
therefore some authors considered a fusion of methods aggre-
gating the results from an ensemble of feed-forward DNNs to
predict TF masks for separation [20]. A feed-forward neural
network (FNN) with consecutive bins from the spectrogram
as inputs is proposed in [21] and improved in [22] by using
bidirectional long short-term memory (LSTM). The LSTM
architecture of [23], used for music note synthesis, served as
inspiration for posterior model architectures as in [24]. The
spectra information propagation is used in [24] giving the
convolution of the spectrogram to intermediate output layers
of the network for estimating a mask of the input track. The
introduction of skip connections in [25] allows for propagation
of the spectrogram to transition sections in the network. More
recent developments proved the effectiveness of an architecture
with skip connections using encoder/decoder Convolutional
Neural Networks (CNN) with LSTM called U-Net and direct
synthesis noting the importance of adequate scaling [26], [27].

IV. A DOPTED APPROACH


As the 2018 Signal Separation Evaluation Campaign
pointed out, the vast majority of methods submitted are
based on deep learning, reflecting a shift in the community’s
methodology. In addition, the spectrogram based methods
outperformed the ones working directly on the waveform
domain [28].

Considering the previous analysis of current trends, the


adopted technique is based on deep learning with the objective
of constructing a mask. Specifically, it is based on [26], [27]
and [29]. Figure 3 shows the architecture of the whole method.
In particular, the input is the time domain stereo song which
is transformed into a spectrogram using the STFT, then the
spectrogram is divided into segments and normalized using
the frequency data of the entire track. The data is then passed
by an encoder stage consisting on a fully connected neural
network without bias that reduces the dimension. After, the
hyperbolic tangent is applied, mapping the values to the range
(-1, 1). The main design choice is then encountered which
consists of three bidirectional LSTM. This network allows for
evaluation of current data taking into account information from
the previous and next data bins due to its recursive nature [30]. Fig. 3: Neural network architecture and data flow.
PATTERN RECOGNITION AND MACHINE LEARNING, UPC, JUNE 2020 4

The output of the LSTM is concatenated with the skipped The details of the training procedure are very influential to
connection from the hyperbolic tangent, similarly to what the performance. In this case, the training tacks are cropped
the literature proposes, increasing the dimensionality. Then into smaller audios of 8 s and in order to allocate the data of
another encoder consisting of a fully connected network with each batch into memory the number of tracks sampled is set
output size of 512 maps the stretched vector to the size of the to 32. Following the guidelines of [26] and [27], the adopted
output decoder (symmetric in size to the input encoder) and optimizer is Adam which is a widely used algorithm for
a rectified linear unit (ReLU) is applied before so the output first-order gradient-based optimization of stochastic objective
is in the range [0, ∞). After the decoder fully connected functions and it is included in most software packages due to
network mentioned previously is applied, the output scaling is its computationally efficiency and little memory requirements
performed followed by another ReLU that outputs the vocal’s [36]. The main parameters of the neural network training are
mask. This mask is applied to the spectrogram of the mixed summarized in table I.
song and the vocal track is recovered in the time domain by
Training samples 102400
inverting the spectrogram either by Wiener filtering [31] as
Sample duration 8s
the usual method [8], [9], [29] or by using the fast Griffin-Lim
Channels 2
algorithm for an approximate reconstruction [32].
Frequency 44.1 kHz
The size of each internal layer is selected to be determined Optimizer Adam
based on the number of channels of the track, the STFT Learning rate 1e-3
parameters and the value of the output of the first NN, which Weight decay 1e-5
is set to 512 based on the state of the art aforementioned TABLE I: Training parameters.
implementations. Half of this value is the hidden layer size
of the LSTM. Therefore all layer’s sizes can be computed Three widely used metrics to asses separation quality are:
using the dimension of the first NN output (512) and size Source to Distortion, to Artefact, to Interference ratios (SDR,
of the STFT, nF F T = 4096, as it determines the input and SAR, SIR) [28], [37]. In this case the SIR is computed to
output layer sizes to be bnF F T /2c + 1 = 2049 in the case of compare the implemented method with the state of the art
mono tracks and double that value, 4098, for stereo songs. techniques. This metric is defined by the formula (3).
The second parameter of the STFT is the distance between
neighboring sliding window frames which is set to 1024. kstarget k
SIR = 20 log10 (3)
keinterf erence k
The loss function selected is called smooth L1 loss or Huber
loss and it is a modified version of the L1 loss used in [26] Where the argument of the logarithm is the ratio between the
and [27], which is simply the mean absolute error (MAE) or energy of the target signal and the energy of the interference
norm-1 between each element in the input x and target output signal. The higher the value of the SIR the better the perfor-
y. This modification of the MAE is less sensitive to outliers mance of the method. The computation is done in the time
and prevents exploding gradients [33]. The smooth L1 loss is domain representation of the target signal with respect to the
computed by the expression (2). true ground sources using the package museval [38].
n
1X VI. R ESULTS
loss = zi ,
n i=1 The training of the neural network took 12 hours and the
(2) evolution of the loss function during this process is shown
( in figure 4, with a final value of 0.0028. The results of
1
2 (xi− yi )2 , if |xi − yi | < 1 the performance over the test set show, both from direct
zi =
|xi − yi | − 12 , otherwise listening of the tracks and from the comparison presented
in table II, that the performance is less than the state of the
art trained neural networks for music source separation. This
V. C ASE STUDY is mostly due to the fact that in this case a total of 228
It has been requested for educational purposes the largest hours of tracks (randomly cropped) were used to train the
public dataset for music source separation: MUSDB [34]. It network while in the referenced techniques the amount of
is composed of 150 professionally produced and high quality total tracks duration (also randomly cropped) is considerably
songs: 10 hours and 5.3 GB. Five audio files are available for higher, around 150000 hours (at least) by training using
each song: the mix, and four separated tracks called stems several epochs and augmented data set, constructed using the
(drums, bass, vocals and other). Only western music genres stems of different tracks and randomly mixing them as well
are present, with a vast majority of pop/rock songs, along as swapping left/right channel for each instrument, resulting
with some hip-hop, rap and metal songs. A recommended in a training time of a full week [29], [26]. Therefore the
split of 100 songs corresponding to the training set and 50 to reduction in the execution is not unexpected, although it
the test set is provided beforehand. As some authors noticed, shows that the increase of performance with the number of
this data set is genre biased towards pop and rock music [35]. epochs and training time is not linear.
PATTERN RECOGNITION AND MACHINE LEARNING, UPC, JUNE 2020 5

Concerning related future work, an immediate next step is


to develop or apply some lyrics extraction technique to the iso-
lated vocal track. Furthermore, if the model is further improved
to isolate each instrument then in the author’s point of view
the most interesting next step is to build an automatic musical
score or sheet generator for the main modern instruments:
bass, piano and guitar. There is a huge market of music scores
and transcription of some pieces that may seem difficult even
to the most musical trained human ear, become trivial and fast
through a frequency domain based analysis. Of course, issues
related to harmonic cancellation and other details should be
properly analyzed but the perspectives are promising.

R EFERENCES
[1] K. Brown and P. Auslander, Karaoke Idols: Popular
Music and the Performance of Identity. Intellect, 2015,
ISBN : 9781783204441. [Online]. Available: https : / /
books.google.es/books?id=9I-JCgAAQBAJ.
Fig. 4: Loss function value during training. [2] E. Pollastri, “A pitch tracking system dedicated to
process singing voice for music retrieval,” in Proceed-
ings. IEEE International Conference on Multimedia and
Signal to Interference Ratio [dB] Expo, vol. 1, Aug. 2002, 341–344 vol.1. DOI: 10.1109/
Case study 9.15 ICME.2002.1035788.
Demucs [24] 12.26 [3] M. Bosi and R. Goldberg, Introduction to Digital Audio
UMX [29] 13.33 Coding and Standards, ser. The Springer International
Spleeter [26] 15.86 Series in Engineering and Computer Science. Springer
US, 2002, ISBN: 9781402073571. [Online]. Available:
TABLE II: Test results.
http://books.google.es/books?id=oHWIRmHpi8YC.
[4] N. Kehtarnavaz, “Chapter 7 - frequency domain pro-
The implemented algorithm is about 3 dB worse than the cessing,” in Digital Signal Processing System Design
U-Net of Demucs [27] which is also about 3 dB less than (Second Edition), N. Kehtarnavaz, Ed., Second Edition,
the performance of the better method called Spleeter [26]. Burlington: Academic Press, 2008, pp. 175–196, ISBN:
In sight of this fact the capability gap may be considered 978-0-12-374490-6. DOI: https : / / doi . org / 10 . 1016 /
acceptable for a first approach to the problem regardless of B978 - 0 - 12 - 374490 - 6 . 00007 - 6. [Online]. Available:
the limitations due to different computing capabilities. http : / / www . sciencedirect . com / science / article / pii /
B9780123744906000076.
[5] J. Guckert, The use of fft and mdct in mp3 audio
VII. C ONCLUSION compression, 2012. [Online]. Available: http : / / www.
math . utah . edu / ∼gustafso / s2012 / 2270 / web - projects /
The problem of blind music source separation is introduced Guckert-audio-compression-svd-mdct-MP3.pdf.
and the specialised literature is reviewed showing that the [6] Z. Rafii, A. Liutkus, F.-R. Stoter, S. I. Mimilakis, D.
state of the art techniques are able to separate vocals from FitzGerald, and B. Pardo, “An overview of lead and
accompaniment successfully although for the overlapping accompaniment separation in music,” IEEE/ACM Trans-
frequencies of the components they tend to exhibit low actions on Audio, Speech, and Language Processing,
performance. The approach that shows better compromise vol. 26, no. 8, pp. 1307–1335, Aug. 2018, ISSN: 2329-
between complexity and execution among the reviewed 9304. DOI: 10 . 1109 / taslp . 2018 . 2825440. [Online].
techniques is a neural network encoder-LSTM-decoder based Available: http : / / dx . doi . org / 10 . 1109 / TASLP. 2018 .
architecture that has been implemented in Python under the 2825440.
PyTorch framework, and trained and tested on the largest [7] J. Salamon, E. Gomez, D. P. W. Ellis, and G. Richard,
public professional quality data set, namely MUSDB. The “Melody extraction from polyphonic music signals:
resulting method is capable of separating the vocal track Approaches, applications, and challenges,” IEEE Signal
from any given song’s digital audio. The test results point out Processing Magazine, vol. 31, no. 2, pp. 118–134, Mar.
that limitations due to memory and computational effort lead 2014, ISSN: 1558-0792. DOI: 10 . 1109 / MSP . 2013 .
to lower performance than the state of the art techniques. 2271648.
Further improvements may be made in the definition of the
training strategy: parallel computing, more training epochs
and different loss functions and training parameters.
PATTERN RECOGNITION AND MACHINE LEARNING, UPC, JUNE 2020 6

[8] S. Vembu and S. Baumann, “Separation of vocals from Available: https://ccrma.stanford.edu/ ∼gautham/Site/
polyphonic audio recordings,” in Proceedings of the Publications files/rafii-ismir2013.pdf.
6th International Conference on Music Information [18] Y. Isik, J. L. Roux, Z. Chen, S. Watanabe, and J. R.
Retrieval, http://ismir2005.ismir.net/proceedings/1028. Hershey, “Single-channel multi-speaker separation us-
pdf, London, UK, Sep. 2005, pp. 337–344. ing deep clustering,” CoRR, vol. abs/1607.02173, 2016.
[9] C. Puntonet and A. Prieto, Independent Component arXiv: 1607.02173. [Online]. Available: http://arxiv.org/
Analysis and Blind Signal Separation: Fifth Interna- abs/1607.02173.
tional Conference, ICA 2004, Granada, Spain, Septem- [19] S. Nie, W. Xue, S. Liang, X. Zhang, and W.-J. Liu,
ber 22-24, 2004, Proceedings, ser. Lecture Notes in “Joint optimization of recurrent networks exploiting
Computer Science. Springer Berlin Heidelberg, 2004, source auto-regression for source separation,” Sep.
ISBN : 9783540301103. [Online]. Available: https : / / 2015.
books.google.es/books?id=84X0BwAAQBAJ. [20] E. Grais, G. Roma, A. Simpson, and M. Plumbley,
[10] Z. Rafii and B. Pardo, “A simple music/voice separation “Combining mask estimates for single channel audio
method based on the extraction of the repeating musical source separation using deep neural networks,” Sep.
structure,” in 2011 IEEE International Conference on 2016, pp. 3339–3343. DOI: 10.21437/Interspeech.2016-
Acoustics, Speech and Signal Processing (ICASSP), 216.
2011, pp. 221–224. [21] S. Uhlich, F. Giron, and Y. Mitsufuji, “Deep neural net-
[11] A. Liutkus, D. Fitzgerald, Z. Rafii, B. Pardo, and L. work based instrument extraction from music,” in 2015
Daudet, “Kernel additive models for source separa- IEEE International Conference on Acoustics, Speech
tion,” IEEE Transactions on Signal Processing, vol. 62, and Signal Processing (ICASSP), 2015, pp. 2135–2139.
no. 16, pp. 4298–4310, 2014. [22] S. Uhlich, M. Porcu, F. Giron, M. Enenkl, T. Kemp, N.
[12] P. Seetharaman, F. Pishdadian, and B. Pardo, “Mu- Takahashi, and Y. Mitsufuji, “Improving music source
sic/voice separation using the 2d fourier transform,” separation based on deep neural networks through data
in 2017 IEEE Workshop on Applications of Signal augmentation and network blending,” in 2017 IEEE In-
Processing to Audio and Acoustics (WASPAA), 2017, ternational Conference on Acoustics, Speech and Signal
pp. 36–40. Processing (ICASSP), 2017, pp. 261–265.
[13] A. Ozerov, P. Philippe, F. Bimbot, and R. Gribon- [23] A. Défossez, N. Zeghidour, N. Usunier, L. Bottou, and
val, “Adaptation of bayesian models for single-channel F. Bach, Sing: Symbol-to-instrument neural generator,
source separation and its application to voice/music 2018. arXiv: 1810.09785 [cs.SD].
separation in popular songs,” Trans. Audio, Speech and [24] D. Stoller, S. Ewert, and S. Dixon, “Wave-u-net: A
Lang. Proc., vol. 15, no. 5, pp. 1564–1578, Jul. 2007, multi-scale neural network for end-to-end audio source
ISSN : 1558-7916. DOI : 10 . 1109 / TASL . 2007 . 899291. separation,” ArXiv, vol. abs/1806.03185, 2018.
[Online]. Available: https : / / doi . org / 10 . 1109 / TASL . [25] S. Mimilakis, E. Cano, J. Abeßer, and G. Schuller,
2007.899291. “New sonorities for jazz recordings: Separation and
[14] E. Vincent, M. G. Jafari, S. A. Abdallah, M. D. mixing using deep neural networks,” Sep. 2016.
Plumbley, and M. E. Davies, “Probabilistic modeling [26] R. Hennequin, A. Khlif, F. Voituret, and M. Mous-
paradigms for audio source separation,” in Machine sallam, Spleeter: A fast and state-of-the art music
Audition: Principles, Algorithms and Systems, W. Wang, source separation tool with pre-trained models, Late-
Ed., IGI Global, 2010, pp. 162–185. DOI: 10.4018/978- Breaking/Demo ISMIR 2019, Deezer Research, Nov.
1-61520-919-4.ch007. [Online]. Available: https://hal. 2019.
inria.fr/inria-00544016. [27] A. Défossez, N. Usunier, L. Bottou, and F. Bach,
[15] G. J. Mysore, P. Smaragdis, and B. Raj, “Non-negative Demucs: Deep extractor for music sources with extra
hidden markov modeling of audio with application unlabeled data remixed, 2019. arXiv: 1909 . 01174
to source separation,” in Proceedings of the 9th In- [cs.SD].
ternational Conference on Latent Variable Analysis [28] F.-R. Stöter, A. Liutkus, and N. Ito, “The 2018 signal
and Signal Separation, ser. LVA/ICA’10, St. Malo, separation evaluation campaign,” Lecture Notes in Com-
France: Springer-Verlag, 2010, pp. 140–148, ISBN: puter Science, pp. 293–305, 2018, ISSN: 1611-3349.
364215994X. DOI : 10 . 1007 / 978 - 3 - 319 - 93764 - 9 28. [Online].
[16] P. Smaragdis, B. Raj, and M. Shashanka, “Super- Available: http : / / dx . doi . org / 10 . 1007 / 978 - 3 - 319 -
vised and semi-supervised separation of sounds from 93764-9 28.
single-channel mixtures,” in Proceedings of the 7th [29] F.-R. Stter, S. Uhlich, A. Liutkus, and Y. Mitsufuji,
International Conference on Independent Component “Open-unmix - a reference implementation for music
Analysis and Signal Separation, ser. ICA’07, Lon- source separation,” Journal of Open Source Software,
don, UK: Springer-Verlag, 2007, pp. 414–421, ISBN: 2019. DOI: 10.21105/joss.01667. [Online]. Available:
3540744932. https://doi.org/10.21105/joss.01667.
[17] F. G. Zafar Rafii and D. Sun, Combining modeling [30] S. Hochreiter and J. Schmidhuber, “Long short-term
of singing voice and background music for automatic memory,” Neural computation, vol. 9, pp. 1735–80,
separation of musical mixtures, Nov. 2013. [Online]. Dec. 1997. DOI: 10.1162/neco.1997.9.8.1735.
PATTERN RECOGNITION AND MACHINE LEARNING, UPC, JUNE 2020 7

[31] A. Liutkus and F.-R. Stöter, Sigsep/norbert: First of- Francisco Javier Cifuentes Garcı́a received the
ficial norbert release, version v0.2.0, Jul. 2019. DOI: Bachelor’s degree in Energy Engineering from the
Technical University of Catalonia (UPC), Barcelona,
10 . 5281 / zenodo . 3269749. [Online]. Available: https : Spain, in 2018 and ever since he has been doing
//doi.org/10.5281/zenodo.3269749. an internship at CITCEA UPC while pursuing a
[32] N. Perraudin, P. Balazs, and P. L. Søndergaard, “A Master’s degree in Industrial Engineering specialized
in Electronics. His research interests include power
fast griffin-lim algorithm,” in 2013 IEEE Workshop electronics dominated power systems, renewable
on Applications of Signal Processing to Audio and energy technologies and machine learning, among
Acoustics, 2013, pp. 1–4. others. He is also a music enthusiast which is the
main motivation for this work.
[33] R. Girshick, “Fast r-cnn,” in 2015 IEEE Interna-
tional Conference on Computer Vision (ICCV), 2015,
pp. 1440–1448.
[34] Z. Rafii, A. Liutkus, F.-R. Stöter, S. I. Mimilakis, and
R. Bittner, The MUSDB18 corpus for music separation,
Dec. 2017. DOI: 10 . 5281 / zenodo . 1117372. [Online].
Available: https://doi.org/10.5281/zenodo.1117372.
[35] L. Prétet, R. Hennequin, J. Royo-Letelier, and A.
Vaglio, “Singing voice separation: A study on training
data,” in ICASSP 2019 - 2019 IEEE International
Conference on Acoustics, Speech and Signal Processing
(ICASSP), 2019, pp. 506–510.
[36] D. P. Kingma and J. A. Ba, “A method for
stochastic optimization. arxiv 2014,” arXiv preprint
arXiv:1412.6980, 2019.
[37] E. Vincent, R. Gribonval, and C. Fevotte, “Performance
measurement in blind audio source separation,” Trans.
Audio, Speech and Lang. Proc., vol. 14, no. 4, pp. 1462–
1469, Jul. 2006, ISSN: 1558-7916. DOI: 10.1109/TSA.
2005.858005. [Online]. Available: https://doi.org/10.
1109/TSA.2005.858005.
[38] F.-R. Stöter and A. Liutkus, Museval 0.3.0, ver-
sion v0.3.0, Aug. 2019. DOI: 10.5281/zenodo.3376621.
[Online]. Available: https : / / doi . org / 10 . 5281 / zenodo .
3376621.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy