Long-time Music Information Retrieval researcher Pedro Cano has a new book out, based on his dissertation: “Content-based Audio Search: From Audio Fingerprinting to Semantic Audio Retrieval“. From the review:
Music search sound engines rely on metadata, mostly human generated, to manage collections of audio assets. Even though time-consuming and error-prone, human labeling is a common practice. Audio content-based methods, algorithms that automatically extract description from audio files, are generally not mature enough to provide the user friendly representation that users demand when interacting with audio content. This dissertation has two parts. In a first part we explore the strengths and limitation of a pure low-level audio description technique: audio fingerprinting. In the second part, we hypothesize that one of the problems that hinders the closing the semantic gap is the lack of intelligence that encodes common sense knowledge and that such a knowledge base is a primary step toward bridging the semantic gap. We present a sound effects retrieval system which leverages both low-level and semantic technologies.
I am partial to Pedro’s goal, in large part because it mirrors much of my own work in the music IR area. There has been lots of work in the field, over the past ten years, on metadata and tagging methods, everything from Pandora’s manual human expert tagging, to Luis van Ahn-style tagging games, to Last.fm’s community-generated tags. There has also been a lot of work in low-level raw audio processing using extracted features such as zero crossings and spectral flux to automatically determine the genre of a piece of music. And there has been a fair amount of work in the symbolic retrieval that never leaves the music score notations and/or MIDI domain.
But what has sorely underrepresented in the community is work that ties and bridges together low-level signal processing work with higher-level, musicological and musically-intuitive semantic features and methods. Some of my early work (ISMIR 2002) looked at one particular hybrid approach, finding song variations using symbolic/semantic “guitar chords” extracted from raw audio. But the number of researchers pursuing this kind of hybrid low-level plus semantic matching is small. It’s nice to see more work in this area.