Skip to main content

Covid-19 interactive plots

·9 mins
Data science covid19 jupyter matplotlib plotly python

This is the continuation of my series about how to manipulate Covid-19 data with Python (first entry here). This post adds some extra complexity and introduces some interesting options to define interactive plots.

We explore the generation of plots using Matplotlib and Plotly with Jupyter. First, we generate standard plots that permit a small level of interaction using zoom and scroll inside the plot box. Second, we generate an interactive plot with Plotly and ipywidgets where we can select the number of confirmed cases and fatalities for any country in the dataset. This is a useful example that let you understand the internals of the process.

As usual, I provide you with the Jupyter Notebook. You can find it at my Github repo. The last example from this notebook requires a running backend with Python. For this reason, the rendered HTML you find here, cannot display the last example. To facilitate the access to the code, you can find an interactive Binder image for you to play with the notebook by clicking the button below.

I hope you find it useful.



Introduction
#

No need to say that the Covid19 crisis is a global challenge that is going to change how we see the world. There is a lot of interest in understanding the internals of virus propagation and several disciplines can be really helpful in this task. There is a lot of data going around and we have really accessible tools to work with this data.

For any data scientist this is a nice opportunity to explore and understand time series, graph theory and other fascinating disciplines. If you are just a newbie or a consolidated practitioner, I have decided to share a series of Jupyter Notebooks with some examples of tools and methods that you can find helpful. I will make my best to make all the code available.

Kaggle has opened a challenge to forecast the propagation of the virus. You can check the challenge with more details at the Kaggle site here. I invite you to check the notebooks uploaded by the community. I have not considered to participate in the challenge, but this could be a good opportunity if you plan to start with these kind of challenges.

In this part, I will use Kaggle data to show how we can visualize the virus evolution in different manners. You can download the data (after registration) here. After downloading the zip file with the dataset we have three CSV files:

  • train.csv
  • test.csv
  • submission.csv

For this exercise we will only use the train.csv file.

Assumptions

  • You have an already running Jupyter environment
  • You are familiar with Pandas
  • You have heard about Matplotlib
  • The covid19 files are available in the path covid19-global-forecasting-week-2

Loading a CSV with Pandas
#

There are several solutions to read CSV files in Python. However, with no disussion Pandas is the most suitable option for many scenarios. We import the pandas library and read the csv file with all the training data.

import pandas as pd
data = pd.read_csv("covid19-global-forecasting-week-2/train.csv")
data

IdCountry_RegionProvince_StateDateConfirmedCasesFatalities
01AfghanistanNaN2020-01-220.00.0
12AfghanistanNaN2020-01-230.00.0
23AfghanistanNaN2020-01-240.00.0
34AfghanistanNaN2020-01-250.00.0
45AfghanistanNaN2020-01-260.00.0
.....................
1881129360ZimbabweNaN2020-03-213.00.0
1881229361ZimbabweNaN2020-03-223.00.0
1881329362ZimbabweNaN2020-03-233.01.0
1881429363ZimbabweNaN2020-03-243.01.0
1881529364ZimbabweNaN2020-03-253.01.0

18816 rows × 6 columns

We have a six columns dataframe indicating the country, state, date, number of confirmed cases and number of fatalities. We are going to focus on one country. Let’s say Spain.

spain = data[data['Country_Region']=='Spain']
spain

IdCountry_RegionProvince_StateDateConfirmedCasesFatalities
1337620901SpainNaN2020-01-220.00.0
1337720902SpainNaN2020-01-230.00.0
1337820903SpainNaN2020-01-240.00.0
1337920904SpainNaN2020-01-250.00.0
1338020905SpainNaN2020-01-260.00.0
.....................
1343520960SpainNaN2020-03-2125374.01375.0
1343620961SpainNaN2020-03-2228768.01772.0
1343720962SpainNaN2020-03-2335136.02311.0
1343820963SpainNaN2020-03-2439885.02808.0
1343920964SpainNaN2020-03-2549515.03647.0

64 rows × 6 columns

We have data for 64 days with no information at a province/state level. Now we would like to have a visual representation of the time series.

Matplotlib
#

The first solution to be considered is Pyplot from the Matplotlib library.

from matplotlib import pyplot

pyplot.plot(spain.ConfirmedCases)
pyplot.title('Confirmed cases in Spain')
pyplot.show()

png

The figure above is the representation of the number of confirmed cases in Spain until March 26th. We have not set the X axis, so pyplot is considering the id column defined by Pandas. To define a more reasonable X ticks we simply pass a list with the same number of items of the Y axis starting from zero.

pyplot.plot(range(0,spain.ConfirmedCases.size),spain.ConfirmedCases)
pyplot.title('Confirmed cases in Spain')
pyplot.show()

png

Now we have a clearer view of the X axis. However, we would like to have a comparison of the number of fatalities vs the number of confirmed cases.

pyplot.plot(range(0,spain.ConfirmedCases.size),spain.ConfirmedCases,label='ConfirmedCases')
pyplot.plot(range(0,spain.Fatalities.size),spain.Fatalities,label='Fatalities')
pyplot.legend()
pyplot.title('Confirmed cases vs fatalities in Spain')
pyplot.show()

png

The increment shows an exponential behaviour. A logarithmic scale would help a better view.

pyplot.plot(range(0,spain.ConfirmedCases.size),spain.ConfirmedCases,label='ConfirmedCases')
pyplot.plot(range(0,spain.Fatalities.size),spain.Fatalities,label='Fatalities')
pyplot.yscale('log')
pyplot.title('Confirmed cases vs fatalities in Spain log scale')
pyplot.legend()
pyplot.show()

png

What about displaying the date in the X axis? To do that we need pyplot to format the x axis. This requires datetime structures to set the datetime of every observation. We already have them in the Date column. The main difference is setting the formatter for the x axis using mdates from matplotlib.

import matplotlib.dates as mdates

# convert date strings to datenums
dates = mdates.datestr2num(spain.Date)

pyplot.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
pyplot.gca().xaxis.set_major_locator(mdates.DayLocator(interval=5))
pyplot.plot(dates,spain.ConfirmedCases,label='confirmed')
pyplot.plot(dates,spain.Fatalities,label='fatalities')
pyplot.title('Confirmed cases vs fatalities in Spain with datetime in x axis')
pyplot.legend()
pyplot.gcf().autofmt_xdate()
pyplot.show()

png

Seaborn
#

For those familiar with ggplot, Seaborn will look familiar. Seaborn is built on top of Matplotlib and offers a high level interface for drawing statistical graphics. It is particularly suitable to used in conjunction with Pandas.

We can replicate some of the plots above:

import seaborn as sns

g = sns.relplot(x=range(spain.Date.size),y='ConfirmedCases', data=spain,kind='line',)
g.set_axis_labels(x_var='') # I remove the xlabel for consistency with the previous plot
pyplot.title('Confirmed cases in Spain')
pyplot.show()

png

To set the x axis with datetimes we do the same we did with matplotlib. However, now we are going to directly transform the Date column from the Pandas Dataframe so we can directly call seaborn to use it.

# Transform the Date column to matplotlib datenum
spain.Date = spain.Date.apply(lambda x : mdates.datestr2num(x))
/Users/juan/miniconda3/lib/python3.7/site-packages/pandas/core/generic.py:5303: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[name] = value

After this, the Date column type is a datenum that can be used to correctly format the x axis.

(By the way, this operation triggers a warning message. I let you to investigate why this is happening ;) )

sns.relplot(x='Date',y='ConfirmedCases', data=spain,kind='line',)
pyplot.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
pyplot.gca().xaxis.set_major_locator(mdates.DayLocator(interval=5))
pyplot.gcf().autofmt_xdate()
pyplot.title('Confirmed cases in Spain with datetime in x axis')
pyplot.show()

png

So far we replicated the same plots we already created using pyplot. Why is this seaborn interesting then? I find seaborn particularly relevant to create plots where we can easily compare different series. What if we try to compare the evolution of cases in different countries? We are going to select a sample of countries and compare their evolutions.

To do that we have to run two operations.

  • First. We filter the countries included in a list.
  • Second. For some countries the values per day reflect observations per province. We are only interested in the observations per country and day. We aggregate the confirmed cases and fatalities columns for every country in the same day.
# sample of countries to study
chosen = ['Spain', 'Iran', 'Singapore', 'France', 'United Kingdom']

# 1) Filter rows which country is in the list  2) group by country and date and finally sum the result
sample = data[data.Country_Region.isin(chosen)].groupby(['Date','Country_Region'], as_index=False,).sum()
sample

DateCountry_RegionIdConfirmedCasesFatalities
02020-01-22France1125100.00.0
12020-01-22Iran136010.00.0
22020-01-22Singapore204010.00.0
32020-01-22Spain209010.00.0
42020-01-22United Kingdom1988070.00.0
..................
3152020-03-25France11314025600.01333.0
3162020-03-25Iran1366427017.02077.0
3172020-03-25Singapore20464631.02.0
3182020-03-25Spain2096449515.03647.0
3192020-03-25United Kingdom1992489640.0466.0

320 rows × 5 columns

# As a sanity check we are going to check that the previous operation was correct.
# Lets check how many confirmed cases France had on 2020-03-24
france = data[(data.Country_Region=='France') & (data.Date=='2020-03-24')]
print('These are the values for France on 2020-03-24 before running the aggregation')
display(france)
print('Total number of confirmed cases: ', france.ConfirmedCases.sum())
print('And this is the aggregation we obtained')
sample[(sample.Country_Region=='France') & (sample.Date=='2020-03-24')]
These are the values for France on 2020-03-24 before running the aggregation

IdCountry_RegionProvince_StateDateConfirmedCasesFatalities
697410863FranceFrench Guiana2020-03-2423.00.0
703810963FranceFrench Polynesia2020-03-2425.00.0
710211063FranceGuadeloupe2020-03-2462.01.0
716611163FranceMartinique2020-03-2457.01.0
723011263FranceMayotte2020-03-2436.00.0
729411363FranceNew Caledonia2020-03-2410.00.0
735811463FranceReunion2020-03-2494.00.0
742211563FranceSaint Barthelemy2020-03-243.00.0
748611663FranceSt Martin2020-03-248.00.0
755011763FranceNaN2020-03-2422304.01100.0
Total number of confirmed cases:  22622.0
And this is the aggregation we obtained

DateCountry_RegionIdConfirmedCasesFatalities
3102020-03-24France11313022622.01102.0

We have manually checked that the values we obtained after aggregation are correct. Now we are going to plot a comparison of these values per country.

# remember to transform the date timestamp
sample.Date = sample.Date.apply(lambda x : mdates.datestr2num(x))
# Confirmed cases
sns.relplot(x='Date',y='ConfirmedCases', col='Country_Region', hue='Country_Region', col_wrap=2, data=sample,kind='line',)
pyplot.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
pyplot.gca().xaxis.set_major_locator(mdates.DayLocator(interval=5))
pyplot.gcf().autofmt_xdate()

png

# Fatalities
sns.relplot(x='Date',y='Fatalities', col='Country_Region', hue='Country_Region', col_wrap=2, data=sample,kind='line',)
pyplot.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
pyplot.gca().xaxis.set_major_locator(mdates.DayLocator(interval=5))
pyplot.gcf().autofmt_xdate()

png

Additionally, we can compare all the timelines in the same plot.

sns.relplot(x='Date',y='ConfirmedCases', hue='Country_Region', data=sample,kind='line',)
pyplot.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
pyplot.gca().xaxis.set_major_locator(mdates.DayLocator(interval=5))
pyplot.gcf().autofmt_xdate()

sns.relplot(x='Date',y='Fatalities', hue='Country_Region', data=sample,kind='line',)
pyplot.gca().xaxis.set_major_formatter(mdates.DateFormatter('%Y-%m-%d'))
pyplot.gca().xaxis.set_major_locator(mdates.DayLocator(interval=5))
pyplot.gcf().autofmt_xdate()

png

png

Conclusions
#

In this notebook we have shown how we can use Python Matplotlib and Seaborn with Pandas to plot the time series corresponding to the Covid19 virus.

Related

Covid-19 visualization
·8 mins
Data science covid19 datascience matplotlib pandas python time series
Covid19 spreading in a networking model (part II)
·11 mins
Data science Graphs covid19 datascience graph-tool graphs plotly python
Covid19 spreading in a networking model (part I)
·13 mins
Data science covid19 datascience graph-tool graphs networks pandas plotly python
Covid-19 forecasting
·15 mins
Data science covid19 forecasting matplotlib pandas python statsmodels time series
The many challenges of graph processing
·6 mins
Data science Graphs Software datascience graphs programming structures