Visit my blog for further content. jmtirado.net

Introduction

The Free Music Archive (FMA) offers free access to open licensed music. This makes the FMA the perfect dataset for music analysis. It provides not only metadata about the track, artist, album, etc. but also the music file itself. The project was originally described in this paper presented at the ISMIR in 2017. Additionally, the authors have a GitHub repo with plenty of details and Jupyter notebooks explaining how to process the data (here).

We are going to use PySpark from a data analyst perspective to explore this dataset. In the aforementioned GitHub repo, you can find how to process this dataset using Pandas. Remember, that Spark now supports the Pandas API.

This notebook shows how to manipulate a real dataset using PySpark to answer the following question:

Does a song with a long title have a longer duration?

Disclaimer

This notebook does not pretend to be a scientific work. This is only an example of how to use PySpark for data analysis and therefore it has to be considered as a tutorial.

The dataset

The FMA dataset contains metadata and features in a single zipped file fma_metadata.zip (342 MiB). A brief description of the dataset:

  • tracks.csv: per track metadata such as ID, title, artist, genres, tags and play counts, for all 106,574 tracks.
  • genres.csv: all 163 genres with name and parent (used to infer the genre hierarchy and top-level genres).
  • features.csv: common features extracted with librosa.
  • echonest.csv: audio features provided by Echonest (now Spotify) for a subset of 13,129 tracks.

Run the code below to download and uncompress this dataset to a temporal folder. It may take a while.

Download fma_metadata to /tmp/fma_metadata.zip...
Done
Unzip...
Done
genres.csv
raw_albums.csv
checksums
not_found.pickle
README.txt
raw_artists.csv
raw_genres.csv
raw_tracks.csv
raw_echonest.csv
tracks.csv
echonest.csv
features.csv

Now we can read the files.

For our purposes, with the raw_tracks file is enough.

Brief data analysis and cleaning

Like in any dataset, we should take a first look at the content and prepare it for our purposes. This will probably require some cleaning.

Let's take a look at the amount of tracks we have in this dataset.

There is a total of 117393 tracks

Interestingly, this does not match the number given in the paper (106574 tracks). Let's remove repeated entries, if any.

After removing duplicated entries, we have 115092 tracks

We have already removed some entries, but we can do better. We are not interested in any track without genres.

This could be a bit tricky. The genres column contains a json object. If no genres are set, the array is empty. Luckily, the json_array_length function checks the length of a json array for us. We can filter by the array length.

106747

OK. We removed a good number of entries.

What about tracks with no duration? Let's filter them. Interestingly, the track_duration is a string with format HH:mm:ss. This is not the best format for us. At this moment, we will compare with '00:00' and remove.

106729

After removing tracks with not duration we have 106729 entries. We said that we want to understand the relationship between track duration and the song's title. This means that we will have to discard songs with unknown title. We can do this with the isnotnull function from the pyspark.sql.functions package.

After removing tracks with unknown title we have 105771 tracks

Finally, we can get rid of the columns we are not going to use during our analysis.

+--------+--------------+----------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|track_id|track_duration|track_title                                   |track_genres                                                                                                                                                                                                        |
+--------+--------------+----------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|177     |02:09         |Petty Hate Machine                            |[{'genre_id': '25', 'genre_title': 'Punk', 'genre_url': 'http://freemusicarchive.org/genre/Punk/'}]                                                                                                                 |
|497     |00:59         |Appear To Be                                  |[{'genre_id': '27', 'genre_title': 'Lo-Fi', 'genre_url': 'http://freemusicarchive.org/genre/Lo-fi/'}, {'genre_id': '66', 'genre_title': 'Indie-Rock', 'genre_url': 'http://freemusicarchive.org/genre/Indie-Rock/'}]|
|625     |03:54         |Climbing To The Top                           |[{'genre_id': '17', 'genre_title': 'Folk', 'genre_url': 'http://freemusicarchive.org/genre/Folk/'}]                                                                                                                 |
|1679    |00:36         |40 Seconds After Albany - A Bridge Called Hate|[{'genre_id': '15', 'genre_title': 'Electronic', 'genre_url': 'http://freemusicarchive.org/genre/Electronic/'}]                                                                                                     |
|1797    |01:29         |Jaws Drop Baby Side                           |[{'genre_id': '12', 'genre_title': 'Rock', 'genre_url': 'http://freemusicarchive.org/genre/Rock/'}]                                                                                                                 |
+--------+--------------+----------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
only showing top 5 rows

Track duration and title length

We are trying to understand if the track's title has an impact in the track duration. Obviously, for this we need the duration and the title length. We have already done some cleaning and preparation for this. Unfortunately, the track_duration is represented with format HH:mm:ss where it would be more useful having something like seconds. For the title length it is enough to count the number of characters. Let's see how we can compute both features.

Compute track duration

We currently have the track duration expressed in a time duration format: '00:10' for durations shorter than one hour or '01:12:45'. This format is not specially handy. It would be more useful for us expressing this value in seconds. We are going to compute the equivalent in seconds. For this operation, we define a udf function to process the track_duration for every row. Basically, we split the string into its elements (hours, minutes, and seconds) and compute the number of seconds.

+--------------+-------------------+
|track_duration|track_duration_secs|
+--------------+-------------------+
|02:09         |129                |
|00:59         |59                 |
|03:54         |234                |
|00:36         |36                 |
|01:29         |89                 |
+--------------+-------------------+
only showing top 5 rows

There is something important to be considered here. The type of the returned column is not an int.

StringType()

We have maintained the type of the initial track_duration column. We were looking for an integer, so we can cast the column type.

IntegerType()
+--------------+-------------------+
|track_duration|track_duration_secs|
+--------------+-------------------+
|02:09         |129                |
|00:59         |59                 |
|03:54         |234                |
|00:36         |36                 |
|01:29         |89                 |
+--------------+-------------------+
only showing top 5 rows

Compute the title length

For the title length, we can use one of the functions available in the pyspark.sql.functions package. The length function will return the length of any string. We add the track_title_length column with the corresponding value.

+----------------------------------------------+------------------+
|track_title                                   |track_title_length|
+----------------------------------------------+------------------+
|Petty Hate Machine                            |18                |
|Appear To Be                                  |12                |
|Climbing To The Top                           |19                |
|40 Seconds After Albany - A Bridge Called Hate|46                |
|Jaws Drop Baby Side                           |19                |
+----------------------------------------------+------------------+
only showing top 5 rows

That was easy.

Analysis of track duration and title length

Some exploratory anaysis of the track duration and title length to know what we have here.

Track duration

For a better understanding of the track duration, let's take a look at the data distribution with a histogram. For this purpose, we are going to use the histogram_numeric function. This could be done using other available functionalities included in pyplot or pandas. However, for large datasets we would not take advantage of the spark computational performance.

The histogram_numeric function returns the bins and the occurrences for each bin.

[Row(histogram_numeric(track_duration_secs, 20)=[Row(x=81, y=95.0), Row(x=184, y=81990.0), Row(x=387, y=15996.0), Row(x=601, y=3663.0), Row(x=807, y=1328.0), Row(x=1005, y=711.0), Row(x=1228, y=586.0), Row(x=1465, y=421.0), Row(x=1755, y=339.0), Row(x=1998, y=148.0), Row(x=2227, y=114.0), Row(x=2466, y=79.0), Row(x=2686, y=50.0), Row(x=2864, y=35.0), Row(x=3078, y=39.0), Row(x=3345, y=32.0), Row(x=3661, y=132.0), Row(x=7346, y=2.0), Row(x=11023, y=2.0), Row(x=18333, y=9.0)])]

Now we will extract the bins with their corresponding counts and use the hist function from pyplot.

Well, we have a long-tailed data distribution. What are the quantiles?

Track duration Quantiles for 10: 91.0, 50: 217.0, 90: 455.0, and  99: 1553.0

99% of our observations are smaller than 1553 seconds. We discard everything greater than that.

This looks better now. However, we cannot forget that we have a number of outliers, and we may need to have a deeper understanding of what is going on here. This is an exercise we are not going to do by now.

Title length

We are going to run the same analysis we did in the previous section for the track duration. We are going to repeat this operation so we better put it into a function.

We have another long-tailed distribution. Let's check quantiles.

Track title length Quantiles for 10: 6.0, 50: 14.0, 90: 32.0, and  99: 60.0

99% of the titles have less than 60 characteres. We can ignore anything larger than that.

Is the song title relevant for the track duration?

In the previous section we conducted a basic exploratory analysis for track_duration_secs and track_title_length. Now the idea is to see if there is a correlation between both variables. Again, our hypothesis is that a song with a long title could increase songs' duration.

Initially, we plot the duration vs the title length.

Initially, I would not say that we have any linear dependency between both variables. Let's remove the observations beyond the 99 quantile to have a better view.

Again, nothing clear from a visual point of view. The moment of truth. What about the correlation value?

The correlation between duration and title length is 0.08265139112077975
The correlation below the 99 percentile is 0.06314520555385048

Correlation is particularly low. We cannot claim that there is a correlation between both variables. Our hypothesis could not be confirmed. The good thing, we know how to discard this hypothesis using PySpark and some statistics :)

Genres

We have shown that there is not a clear relation between the title length and the track duration. However, our previous analysis was a bit naive and ignored the fact that music genres can be difficult to compare. What if we repeat our analysis segregating by music genres?

First, we need to extract the genre from the track_genres column. This column contains a json object with the list of genres a song can belong to. A son can belong to different genres. For the sake of simplicity, we are going to extract only the id and name of the first music genre and add it to a column. For this, we use the get_json_object function that interprets json paths and returns the corresponding value.

+----------------------------------------------+-----------+--------+
|track_title                                   |genre_title|genre_id|
+----------------------------------------------+-----------+--------+
|Petty Hate Machine                            |Punk       |25      |
|Appear To Be                                  |Lo-Fi      |27      |
|Climbing To The Top                           |Folk       |17      |
|40 Seconds After Albany - A Bridge Called Hate|Electronic |15      |
|Jaws Drop Baby Side                           |Rock       |12      |
+----------------------------------------------+-----------+--------+
only showing top 5 rows

Let's make a first attempt to see the correlation for a particular genre. Let's say "Rock".

0.14907336659230266

This is interesting. With this correlation value we cannot claim that title length and track duration are related, but this value is better than the one we found for the entire dataset.

We can compute the correlation between track_duration_secs and track_title_length for every genre and see what we have.

+--------------------+-------------------+-----+
|         genre_title|               corr|occur|
+--------------------+-------------------+-----+
|          Space-Rock|  0.678568121316952|    7|
|               Tango| 0.6329100564262486|    7|
|           Chill-out| 0.6033554398842059|   61|
|               Radio| 0.6014651648355663|   27|
|              Skweee| 0.5934473444699694|    5|
|              Thrash| 0.5831407529547351|    6|
|         Black-Metal| 0.5827209790337382|   45|
|           Brazilian|   0.55335408763509|   17|
|               Polka|  0.527741890814964|   33|
|        Spoken Weird|   0.50783901267039|  125|
|           Jazz: Out| 0.5057760861825155|   70|
|             No Wave| 0.4710368929166127|  135|
|       Asia-Far East| 0.4026402194110572|   46|
|              Indian|0.38254761413518046|   13|
|      Composed Music|0.37494838797172636|   77|
|              Banter|0.36901299214158506|    9|
|          Rockabilly| 0.3614560030633164|   17|
|         Drum & Bass|0.35126986885972183|   43|
|           Christmas| 0.3485185783716703|    3|
|20th Century Clas...|  0.340679389684775|    8|
+--------------------+-------------------+-----+
only showing top 20 rows

Interestingly, if we compute the correlation for every genre separately we get higher correlation values. Said that, the values are not specially high and the number of observations is not high neither. For example, "Space-Rock" has 0.67 correlation with 7 occurrences. Meanwhile "Chill-out" is 0.60 with 61 observations. If we sort the results by the number of occurrences per genre.

+------------------+--------------------+-----+
|       genre_title|                corr|occur|
+------------------+--------------------+-----+
|        Electronic| 0.09422485149167782|20540|
|       Avant-Garde| 0.11075413682221796| 9073|
|      Experimental| 0.09004022527246255| 6716|
|              Rock| 0.14907336659230266| 6622|
|               Pop| 0.10700131090439918| 6158|
|              Folk|  0.0855429284045298| 4244|
|           Hip-Hop|  0.1322582147786929| 4083|
|              Punk|  0.1391367336446937| 3462|
|             Noise| 0.05587489409976939| 3343|
|        Soundtrack|  0.1463408907615455| 3057|
|             Lo-Fi| 0.10475194414906629| 2907|
|  Experimental Pop| 0.08095682384459792| 1996|
|              Jazz| 0.12437740801815118| 1895|
|         Classical| 0.17043937455621433| 1765|
|Ambient Electronic|-0.01489813168950...| 1669|
|     International| 0.04523395912526749| 1653|
|             Blues|-0.02420189404801...| 1646|
|        Indie-Rock|0.003130712416778...| 1607|
|  Field Recordings| 0.03294717555760772| 1149|
|        Psych-Rock| 0.14658993630838896| 1078|
+------------------+--------------------+-----+
only showing top 20 rows

Conclusion

We have explored the FMA dataset to answer the question "does the title length increases the duration of a song?". For this purpose, we have used PySpark and pyplot. Our humble analysis has determined that we cannot claim that the title length impacts the duration of a track. For the aggregated dataset we find low correlation between the track duration and the title length. Furthermore, we perform a correlation analysis segregating the dataset per music genres. We find relevant correlation values (0.6) for certain genres. However, the number of occurrences for these genres is not enough to claim any additional finding.

I hope you found this notebook useful.

Visit my blog for further content. jmtirado.net