Separating the human voice from the music in an audio recording has long been a challenge in signal processing. There are numerous so-called artificial intelligence (AI) tools around that can do this now with varying degrees of accuracy. The task is difficult due to the complexity of music, which involves multiple overlapping sources across the audible frequency spectrum. There is a need to increase the resolution and clarity of systems that can separate a vocal from the instrumental for a wide range of applications, such as post-production remixes of music, singing instruction and rehearsing, .
A new method is reported in the International Journal of Reasoning-based Intelligent Systems. The researchers, Maoyuan Yin and Li Pan of the School of Music and Dance at Mudanjiang Normal University in Mudanjiang, China, have, they say, improved upon existing techniques by combining several advanced signal processing techniques. Their starting point is the use of a virtual microphone array. This virtual setup helps them localize the human voice within the overall sound and isolate it from the background.
The virtual microphone array creates a spatial representation of the sound, the team explains. To further improve on the results, the team also used near-field and far-field models to simulate the propagation of sound from sources at different distances. This gives them even more precision in localising the vocal within the sound.
Once the voice is accurately located, the system constructs a time-frequency spectrum for both the human voice and the background music. The time-frequency spectrum tracks how the energy of sound signals shifts along the frequency axis over time. The system can then analyse these changes and distinguish between vocal and instrumental, isolating them from one another.
The process is further refined by the use of a sophisticated algorithmic technique – the Hamming window function, which improves the efficiency of the requisite two-dimensional fast Fourier transform (2DFT) processing of the data. This step reduces the number of dimensions of the various extracted sound signals, simplifying the final extraction of vocal from music.
Test results demonstrate the effectiveness of this new approach with a localization error of just 0.50%. For background music, the feature extraction error is reduced to 0.05%. Overall, the team could reach almost 99 percent accuracy in separating vocal from instrumental. The same approach should also work in isolating a human voice from non-musical background noise. It could thus be used to improve automated spoken-word transcription services and help in the development of better hearing aids.
Yin, M. and Pan, L. (2025) ‘Separating voice and background music based on 2DFT transform’, Int. J. Reasoning-based Intelligent Systems, Vol. 17, No. 1, pp.50–57.