More importantly, every second of the song is represented by the numbers. Once all the peaks in the song have been identified and hashed, the transformation is complete: the song now has a unique 32-bit number that serves as its ID in the database. When a computer reads this hash, it will recognize them as representing frequency and time-distance. The result is a string of numbers, easily storable and searchable. In this case the hash is generated by taking two of the high-intensity peaks, measuring the time between them, and adding their two frequencies together. Hashing simply takes a set of inputs, runs them through an algorithm, and assigns them an integer output. But that’s still not quite efficient enough to be immediately searchable, so the next step is to “hash” this sequence of peaks. So every second of every song is stripped down to just a few of the most intense data points everything on the city skyline is removed except the very top. Imagine a city skyline – the most identifiable parts are the tops of buildings, not the middle floors, and that’s what you can see from farthest away. Not only does getting rid of most of a song’s lower-energy parts decrease the size of the spectrogram, but it makes the apps less susceptible to identifying dull, consistent background noise as part of the target sounds. The big breakthrough in music recognition was the realization that you can identify sounds with only a few pieces of data: the peaks, or the most intense parts.
How does soundhound work full#
If you want to look through a database full of millions of songs, though, a full-detail spectrogram has way too many data points to look through at any sort of speed. If all you needed to do was match a few sounds to each other, you could stop here. Any sequence of sounds can thus be converted into a spectrogram, and any point on the spectrogram can be assigned a set of coordinates. This is essentially a graph with time on the x-axis (horizontal), frequency on the y-axis (vertical), and amplitude represented by different levels of color intensity. It all starts with a spectrogram, like the one in the graph above, taken from a paper written by one of Shazam’s founders, Avery Wang. The really interesting part is how you actually get that fingerprint. If you’re just looking for a surface-level explanation, that’s all you need to know. If your ten-second fingerprint is a match to part of a song, you get your (hopefully correct) song result. This fingerprint is checked against the database of existing fingerprints.When a user hits the “Record” button, the app listens to the music and creates a fingerprint based on the few seconds of audio it hears.The app’s database has a massive collection of song “fingerprints,” or small pieces of data about the song’s unique sound patterns.In simple terms, the process looks like this: Soundhound has actually gone a few steps further by also enabling you to sing or hum into their app which they match against a user-submitted database of other singing/humming recordings. Shazam was originally usable on old-fashioned flip phones by just recording a song and texting it to the service. Technically, you don’t even need a smartphone. Shazam, Soundhound, and other music-identification services all work basically the same way: they have a big database of song information, an algorithm that can quickly extract information from your song sample, and an app to let you interface with those things.