Gleefully pitch perfect

A powerful algorithm that can automatically classify different singing voices by vocal characteristics is described in the International Journal of Bio-Inspired Computation. Balachandra Kumaraswamy of the B.M.S. College of Engineering in Bangalore, India, suggests that the development is an important step forward in music technology, allowing a system to quickly and accurately distinguish one voice from another without human intervention.

Everyone’s singing voice is shaped by a range of physiological characteristics such as their vocal folds, lung capacity and diaphragm, the shape of their nose and mouth, the tongue and teeth, and more. Add to that the emotional delivery and stylistic choices a singer might make, and each of us sounds unique. It is fairly easy for us to tell singers apart, even if the singing is within a complex and textured musical environment. However, using machine learning to distinguish voices has remained challenging. Kumaraswamy’s system performs well and could be employed in a wide range of contexts such as music cataloguing, streaming, recommendation, music production, and even for legal purposes such as copyright control.

The new approach takes four steps to distinguish between singers. The first is pre-processing in which an advanced convolutional neural network (CNN) identifies and isolates the vocals from a complex audio recording, discarding instrumentation and other non-vocal sounds.

The second step is feature extraction whereby key characteristics of the voice are obtained from the audio track and various metrics, such as the zero crossing rate (ZCR), which measures the frequency of signal changes, capture the characteristics of the singer’s voice.

The third step involves an algorithm identifying the vibration patterns of the notes being sung and so can create a profile distribution of the harmonics to map the timbre, or texture, of the voice.

The final step used yet more neural networking in the form of bidirectional gated recurrent units (BI-GRU) and long short-term memory (LSTM) networks to analyse the vocal data. These two models can process sequences and so reveal the flow of a singer’s performance over time. This last step is key to the success of Kumaraswamy’s approach.

At this point in the development of the system, the neural networks used require extensive computational resources and large datasets for training. For now, this might limit scalability. However, such issues can be addressed with optimisation of the way the algorithms are applied and the training data used.

Kumaraswamy, B. (2024) ‘Improved harmonic spectral envelope extraction for singer classification with hybridised model’, Int. J. Bio-Inspired Computation, Vol. 24, No. 3, pp.150–163.