The Mythical Shazam

ryanmcg86
Mar 9, 2022
4 min read

When I first started getting interested in Computer Science, for the most part, I was able to reason out how most technologies probably worked. The idea of a binary search tree is fairly intuitive when you've used Microsoft Windows before and searched through their file structure. I didn't know how to code, but I had a reasonable understanding of at least a few data structures. But one app always completely blew my mind whenever I tried to understand how it functioned the way it did. That app was Shazam.

When I use something like IMDB, I type a famous persons' name into the search bar, and they pop right up, with a list of every movie they've ever acted in. You can then click on any of those movies or TV shows, and get a list of every actor involved with that project. This is an impressive bit of technology, but the how of it never seemed too wild to me. Very likely, there is a database somewhere, and since you're typing in the name, there is some type of matching process where what you typed in is searched for with relative ease, and then there would be some sort of relationship that connects all of those projects to that actor, which are in turn displayed when you hit enter on your search. I didn't know all the details, but I had a basic understanding of what was likely happening behind the scenes.

With Shazam, I had no concept of what was going on, and as a result, it always seemed like magic. Think about it, for a second, through the lens of someone that doesn't understand anything about Computer Science. You hear a song playing, likely in a place with a fair amount of ambient sound, you open up Shazam and hit listen. It doesn't matter what part of the song it is, it just has to get any 15 second clip of the song, and within another couple of seconds, its telling you the title, album, and artist of that song. Incredible. The only thing I could imagine is that a database exists, with every single song ever, and not just every single song, but every 15 second increment of every single song ever, and that when you send it a sound file, it somehow filters out the ambient sound, compares that sound byte to its entire database, finds a result, and returns it to you so quickly that the song you're listening to is very likely still playing.

One of the things that made me so interested in Computer Science was wanting to understand the secret sauce that went into making something like Shazam possible. It wasn't until I took Data Structures during my 3rd semester at Nassau Community College that I finally found my answer. It turns out that the creators of the Shazam app actually wrote a fairly succinct paper that explains exactly how they are capable of getting such a quick answer to such a seemingly difficult problem.

The first major step was the development of a 'fingerprinting' process. This process, when a song is put through it, will output a hash token that is unique to every other songs output, but if the same song is put through this process more than once, it will always return the same token. Similar to a fingerprint, each song returns a unique token, and importantly, this process is repeatable. On the back end, they run through their roughly two million song database, applying this process to each one of them, and storing the resultant hash token. This part very likely took quite some time, but as an end user, you don't experience the downside of this process on your end. The time of the song doesn't matter so much in the process of creating a hash because of the repetitive nature of most music. Most songs have a consistent rhythm, and a repetitive melody, as well as a chorus that repeats. If a song doesn't do any of that, then that will make it even more unique, and even easier to create a unique hash token for in a short time span.

The music is filtered for noise pollution through a complicated spectogram. I'm not even going to pretend I fully understand their process here, because I haven't studied it enough to fully get what's going on here, but the gist of it is that they were able to recognize music vs ambient sound based on the higher energy content in music than in outside sounds, and because of that, they could strip away anything that wasn't classified as music.

In terms of data structures, the real secret sauce here was the usage of a hash table. Essentially, because every song is pre-loaded into the back-end database with a unique hash token, all that happens from the users end is the application listens to a song, builds a unique hash token, searches the database for that hash token, and returns the value associated with that token if it exists in the database, near instantaneously. This works in near instant-time, regardless of the size of the database, because hash tables know exactly where the data they're looking for are stored based on the given hash. In computer science speak, the run time is O(1). It's as if you had a magic door, where what was behind the door changed based on the key you used to open it. In this analogy, the key you have is the question, and whatever is behind the door will give you the answer.

The true answer here is a lot more technical than that, I highly recommend reading the paper about it here, but the point is that by understanding how hash tables work, you can gain a basic understanding of what exactly was accomplished in the design of the Shazam app. Even knowing how it works, it still feels like magic to me.

The Mythical Shazam

Recent Posts

Comments