Visit my blog for further content. jmtirado.net
Introduction¶
The Free Music Archive (FMA) offers free access to open licensed music. This makes the FMA the perfect dataset for music analysis. It provides not only metadata about the track, artist, album, etc. but also the music file itself. The project was originally described in this paper presented at the ISMIR in 2017. Additionally, the authors have a GitHub repo with plenty of details and Jupyter notebooks explaining how to process the data (here).
We are going to use PySpark from a data analyst perspective to explore this dataset. In the aforementioned GitHub repo, you can find how to process this dataset using Pandas. Remember, that Spark now supports the Pandas API.
This notebook shows how to manipulate a real dataset using PySpark to answer the following question:
Does a song with a long title have a longer duration?
Disclaimer¶
This notebook does not pretend to be a scientific work. This is only an example of how to use PySpark for data analysis and therefore it has to be considered as a tutorial.
The dataset¶
The FMA dataset contains metadata and features in a single zipped file fma_metadata.zip (342 MiB). A brief description of the dataset:
- tracks.csv: per track metadata such as ID, title, artist, genres, tags and play counts, for all 106,574 tracks.
- genres.csv: all 163 genres with name and parent (used to infer the genre hierarchy and top-level genres).
- features.csv: common features extracted with librosa.
- echonest.csv: audio features provided by Echonest (now Spotify) for a subset of 13,129 tracks.
Run the code below to download and uncompress this dataset to a temporal folder. It may take a while.
Download fma_metadata to /tmp/fma_metadata.zip... Done Unzip... Done genres.csv raw_albums.csv checksums not_found.pickle README.txt raw_artists.csv raw_genres.csv raw_tracks.csv raw_echonest.csv tracks.csv echonest.csv features.csv
Now we can read the files.
For our purposes, with the raw_tracks
file is enough.
Brief data analysis and cleaning¶
Like in any dataset, we should take a first look at the content and prepare it for our purposes. This will probably require some cleaning.
Let's take a look at the amount of tracks we have in this dataset.
There is a total of 117393 tracks
Interestingly, this does not match the number given in the paper (106574 tracks). Let's remove repeated entries, if any.
After removing duplicated entries, we have 115092 tracks
We have already removed some entries, but we can do better. We are not interested in any track without genres.
This could be a bit tricky. The genres
column contains a json object. If no genres are set, the array is empty. Luckily, the json_array_length
function checks the length of a json array for us. We can filter by the array length.
106747
OK. We removed a good number of entries.
What about tracks with no duration? Let's filter them. Interestingly, the track_duration
is a string with format HH:mm:ss
. This is not the best format for us. At this moment, we will compare with '00:00'
and remove.
106729
After removing tracks with not duration we have 106729 entries. We said that we want to understand the relationship between track duration and the song's title. This means that we will have to discard songs with unknown title. We can do this with the isnotnull
function from the pyspark.sql.functions
package.
After removing tracks with unknown title we have 105771 tracks
Finally, we can get rid of the columns we are not going to use during our analysis.
+--------+--------------+----------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |track_id|track_duration|track_title |track_genres | +--------+--------------+----------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |177 |02:09 |Petty Hate Machine |[{'genre_id': '25', 'genre_title': 'Punk', 'genre_url': 'http://freemusicarchive.org/genre/Punk/'}] | |497 |00:59 |Appear To Be |[{'genre_id': '27', 'genre_title': 'Lo-Fi', 'genre_url': 'http://freemusicarchive.org/genre/Lo-fi/'}, {'genre_id': '66', 'genre_title': 'Indie-Rock', 'genre_url': 'http://freemusicarchive.org/genre/Indie-Rock/'}]| |625 |03:54 |Climbing To The Top |[{'genre_id': '17', 'genre_title': 'Folk', 'genre_url': 'http://freemusicarchive.org/genre/Folk/'}] | |1679 |00:36 |40 Seconds After Albany - A Bridge Called Hate|[{'genre_id': '15', 'genre_title': 'Electronic', 'genre_url': 'http://freemusicarchive.org/genre/Electronic/'}] | |1797 |01:29 |Jaws Drop Baby Side |[{'genre_id': '12', 'genre_title': 'Rock', 'genre_url': 'http://freemusicarchive.org/genre/Rock/'}] | +--------+--------------+----------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ only showing top 5 rows
Track duration and title length¶
We are trying to understand if the track's title has an impact in the track duration. Obviously, for this we need the duration and the title length. We have already done some cleaning and preparation for this. Unfortunately, the track_duration
is represented with format HH:mm:ss
where it would be more useful having something like seconds. For the title length it is enough to count the number of characters. Let's see how we can compute both features.
Compute track duration¶
We currently have the track duration expressed in a time duration format: '00:10' for durations shorter than one hour or '01:12:45'. This format is not specially handy. It would be more useful for us expressing this value in seconds. We are going to compute the equivalent in seconds. For this operation, we define a udf
function to process the track_duration
for every row. Basically, we split the string into its elements (hours, minutes, and seconds) and compute the number of seconds.
+--------------+-------------------+ |track_duration|track_duration_secs| +--------------+-------------------+ |02:09 |129 | |00:59 |59 | |03:54 |234 | |00:36 |36 | |01:29 |89 | +--------------+-------------------+ only showing top 5 rows
There is something important to be considered here. The type of the returned column is not an int
.
StringType()
We have maintained the type of the initial track_duration
column. We were looking for an integer, so we can cast the column type.
IntegerType()
+--------------+-------------------+ |track_duration|track_duration_secs| +--------------+-------------------+ |02:09 |129 | |00:59 |59 | |03:54 |234 | |00:36 |36 | |01:29 |89 | +--------------+-------------------+ only showing top 5 rows
Compute the title length¶
For the title length, we can use one of the functions available in the pyspark.sql.functions
package. The length
function will return the length of any string. We add the track_title_length
column with the corresponding value.
+----------------------------------------------+------------------+ |track_title |track_title_length| +----------------------------------------------+------------------+ |Petty Hate Machine |18 | |Appear To Be |12 | |Climbing To The Top |19 | |40 Seconds After Albany - A Bridge Called Hate|46 | |Jaws Drop Baby Side |19 | +----------------------------------------------+------------------+ only showing top 5 rows
That was easy.
Analysis of track duration and title length¶
Some exploratory anaysis of the track duration and title length to know what we have here.
Track duration¶
For a better understanding of the track duration, let's take a look at the data distribution with a histogram. For this purpose, we are going to use the histogram_numeric
function. This could be done using other available functionalities included in pyplot
or pandas
. However, for large datasets we would not take advantage of the spark computational performance.
The histogram_numeric
function returns the bins and the occurrences for each bin.
[Row(histogram_numeric(track_duration_secs, 20)=[Row(x=81, y=95.0), Row(x=184, y=81990.0), Row(x=387, y=15996.0), Row(x=601, y=3663.0), Row(x=807, y=1328.0), Row(x=1005, y=711.0), Row(x=1228, y=586.0), Row(x=1465, y=421.0), Row(x=1755, y=339.0), Row(x=1998, y=148.0), Row(x=2227, y=114.0), Row(x=2466, y=79.0), Row(x=2686, y=50.0), Row(x=2864, y=35.0), Row(x=3078, y=39.0), Row(x=3345, y=32.0), Row(x=3661, y=132.0), Row(x=7346, y=2.0), Row(x=11023, y=2.0), Row(x=18333, y=9.0)])]
Now we will extract the bins with their corresponding counts and use the hist
function from pyplot.
Well, we have a long-tailed data distribution. What are the quantiles?
Track duration Quantiles for 10: 91.0, 50: 217.0, 90: 455.0, and 99: 1553.0
99% of our observations are smaller than 1553 seconds. We discard everything greater than that.
This looks better now. However, we cannot forget that we have a number of outliers, and we may need to have a deeper understanding of what is going on here. This is an exercise we are not going to do by now.
Title length¶
We are going to run the same analysis we did in the previous section for the track duration. We are going to repeat this operation so we better put it into a function.
We have another long-tailed distribution. Let's check quantiles.
Track title length Quantiles for 10: 6.0, 50: 14.0, 90: 32.0, and 99: 60.0
99% of the titles have less than 60 characteres. We can ignore anything larger than that.
Is the song title relevant for the track duration?¶
In the previous section we conducted a basic exploratory analysis for track_duration_secs
and track_title_length
. Now the idea is to see if there is a correlation between both variables. Again, our hypothesis is that a song with a long title could increase songs' duration.
Initially, we plot the duration vs the title length.
Initially, I would not say that we have any linear dependency between both variables. Let's remove the observations beyond the 99 quantile to have a better view.
Again, nothing clear from a visual point of view. The moment of truth. What about the correlation value?
The correlation between duration and title length is 0.08265139112077975 The correlation below the 99 percentile is 0.06314520555385048
Correlation is particularly low. We cannot claim that there is a correlation between both variables. Our hypothesis could not be confirmed. The good thing, we know how to discard this hypothesis using PySpark and some statistics :)
Genres¶
We have shown that there is not a clear relation between the title length and the track duration. However, our previous analysis was a bit naive and ignored the fact that music genres can be difficult to compare. What if we repeat our analysis segregating by music genres?
First, we need to extract the genre from the track_genres
column. This column contains a json object with the list of genres a song can belong to. A son can belong to different genres. For the sake of simplicity, we are going to extract only the id and name of the first music genre and add it to a column. For this, we use the get_json_object
function that interprets json paths and returns the corresponding value.
+----------------------------------------------+-----------+--------+ |track_title |genre_title|genre_id| +----------------------------------------------+-----------+--------+ |Petty Hate Machine |Punk |25 | |Appear To Be |Lo-Fi |27 | |Climbing To The Top |Folk |17 | |40 Seconds After Albany - A Bridge Called Hate|Electronic |15 | |Jaws Drop Baby Side |Rock |12 | +----------------------------------------------+-----------+--------+ only showing top 5 rows
Let's make a first attempt to see the correlation for a particular genre. Let's say "Rock".
0.14907336659230266
This is interesting. With this correlation value we cannot claim that title length and track duration are related, but this value is better than the one we found for the entire dataset.
We can compute the correlation between track_duration_secs
and track_title_length
for every genre and see what we have.
+--------------------+-------------------+-----+ | genre_title| corr|occur| +--------------------+-------------------+-----+ | Space-Rock| 0.678568121316952| 7| | Tango| 0.6329100564262486| 7| | Chill-out| 0.6033554398842059| 61| | Radio| 0.6014651648355663| 27| | Skweee| 0.5934473444699694| 5| | Thrash| 0.5831407529547351| 6| | Black-Metal| 0.5827209790337382| 45| | Brazilian| 0.55335408763509| 17| | Polka| 0.527741890814964| 33| | Spoken Weird| 0.50783901267039| 125| | Jazz: Out| 0.5057760861825155| 70| | No Wave| 0.4710368929166127| 135| | Asia-Far East| 0.4026402194110572| 46| | Indian|0.38254761413518046| 13| | Composed Music|0.37494838797172636| 77| | Banter|0.36901299214158506| 9| | Rockabilly| 0.3614560030633164| 17| | Drum & Bass|0.35126986885972183| 43| | Christmas| 0.3485185783716703| 3| |20th Century Clas...| 0.340679389684775| 8| +--------------------+-------------------+-----+ only showing top 20 rows
Interestingly, if we compute the correlation for every genre separately we get higher correlation values. Said that, the values are not specially high and the number of observations is not high neither. For example, "Space-Rock" has 0.67 correlation with 7 occurrences. Meanwhile "Chill-out" is 0.60 with 61 observations. If we sort the results by the number of occurrences per genre.
+------------------+--------------------+-----+ | genre_title| corr|occur| +------------------+--------------------+-----+ | Electronic| 0.09422485149167782|20540| | Avant-Garde| 0.11075413682221796| 9073| | Experimental| 0.09004022527246255| 6716| | Rock| 0.14907336659230266| 6622| | Pop| 0.10700131090439918| 6158| | Folk| 0.0855429284045298| 4244| | Hip-Hop| 0.1322582147786929| 4083| | Punk| 0.1391367336446937| 3462| | Noise| 0.05587489409976939| 3343| | Soundtrack| 0.1463408907615455| 3057| | Lo-Fi| 0.10475194414906629| 2907| | Experimental Pop| 0.08095682384459792| 1996| | Jazz| 0.12437740801815118| 1895| | Classical| 0.17043937455621433| 1765| |Ambient Electronic|-0.01489813168950...| 1669| | International| 0.04523395912526749| 1653| | Blues|-0.02420189404801...| 1646| | Indie-Rock|0.003130712416778...| 1607| | Field Recordings| 0.03294717555760772| 1149| | Psych-Rock| 0.14658993630838896| 1078| +------------------+--------------------+-----+ only showing top 20 rows
Conclusion¶
We have explored the FMA dataset to answer the question "does the title length increases the duration of a song?". For this purpose, we have used PySpark and pyplot. Our humble analysis has determined that we cannot claim that the title length impacts the duration of a track. For the aggregated dataset we find low correlation between the track duration and the title length. Furthermore, we perform a correlation analysis segregating the dataset per music genres. We find relevant correlation values (0.6) for certain genres. However, the number of occurrences for these genres is not enough to claim any additional finding.
I hope you found this notebook useful.
Visit my blog for further content. jmtirado.net