Automated classification of music genre, sound objects, and speech by machine learning.
A software system, MediaMined, is described for the efficient analysis and classification of auditory signals. This system has been applied to the tasks of musical instrument identification, classifying musical genre, distinguishing between music and speech, and detection of the gender of human speakers. For each of these tasks, the same algorithm is applied, consisting of low-level signal analysis, statistical processing and perceptual modeling for feature extraction, and then supervised learning of sound classes. Given a ground truth dataset of audio examples, textual descriptive classification labels are then produced. Such labels are suitable for use in automating content interpretation (auditioning) and content retrieval, mixing and signal processing. A multidimensional feature vector is calculated from statistical and perceptual processing of low level signal analysis in the spectral and temporal domains. Machine learning techniques such as support vector machines are applied to produce classification labels given a selected taxonomy. The system is evaluated on large annotated ground truth datasets (n > 30000) and demonstrates success rates (F-measures) greater than 70% correct retrieval, depending on the task. Issues arising from labeling and balancing training sets are discussed. The performance of classification of audio using machine learning methods demonstrates the relative contribution of bottom-up signal derived features and data oriented classification processes to human cognition. Such demonstrations then sharpen the question as to the contribution of top-down, expectation based processes in human auditory cognition.
Leigh M. Smith, Stephen T. Pope, Jay Leboeuf and Steve Tjoa
Proceedings of the 12th International Conference on Music Perception and Cognition, page 943, Thessaloniki, Greece, July 2012. ICMPC/ESCOM. (abstract).