Table of Contents

This is a reproduction of a Jupyter notebook you can find here

Introduction
#

The Covid19 has imposed a lot of pressure on epidemiologists to come up with models that can explain the impact of pandemics. There already is lot of theory developed around this topic you can check out. In particular, understanding SIR models is really useful to understand how researchers approach the problem of virus spreading. In this notebook, I introduce some ideas about how to use Pandas, plots and graphs to explore the evolution of pandemics. I show some tools you can use to simulate networks dynamics. Of course, this cannot be considered a contribution to the state of the art. However, you may find some interesting guidelines for other problems.

The question
#

The main question we ask ourselves is has air passengers something to do with Covid19?. If so, can we use air traffic data to know something else about Covid19 evolution?

It is known that the virus started in China and later propagated around the world. If air traffic has contributed to the spread of the virus, this means that we can somehow leverage a model that explains or predict the evolution of the disease. Furthermore, we could even predict what countries could be potentially infected and/or what flights shall be cancelled.

There is a vast amount of work that can be done in this topic. To make it more accessible we will split it into smaller parts. In this first part, we will explore the data we plan to use and how to work with graphs.

Dataset
#

The OpenFlights dataset contains airlines, air routes and airports per country. The information contained in the dataset contains information up to 2014. This may be a bit outdated. However, it is detailed enough for our purposes.

Additionally, we use the Kraggle Covid19 challenge data as we did in our previous Covid19 series.

import pandas as pd
import numpy as np
import plotly.graph_objects as go
import plotly.express as px
import datetime

routes_dataset_url = 'https://raw.githubusercontent.com/jpatokal/openflights/master/data/routes.dat'
airports_dataset_url = 'https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat'
countries_dataset_url = 'https://raw.githubusercontent.com/jpatokal/openflights/master/data/countries.dat'


# Load the file with the routes
routes = pd.read_csv(routes_dataset_url, names=['Airline','AirlineID','SourceAirport', 
                                                       'SourceAirportID', 'DestinationAirport', 'DestinationAirportID',
                                                      'CodeShare','Stops','Equipment'], 
                     na_values='\\N')
display(routes)


# Load the file with the airports
airports = pd.read_csv(airports_dataset_url, names=['AirportID','Name','City','Country','IATA','ICAO',
                                                         'Latitude', 'Longitude', 'Altitude', 'Timezone', 'DST',
                                                         'TzDatabaseTimeZone', 'Type', 'Source'], index_col='AirportID',
                      na_values='\\N')

display(airports)

# Covid19 data
covid_data = pd.read_csv('covid19-global-forecasting-week-2/train.csv',parse_dates=['Date'], 
                   date_parser=lambda x: datetime.datetime.strptime(x, '%Y-%m-%d'))

	Airline	AirlineID	SourceAirport	SourceAirportID	DestinationAirport	DestinationAirportID	CodeShare	Stops	Equipment
0	2B	410.0	AER	2965.0	KZN	2990.0	NaN	0	CR2
1	2B	410.0	ASF	2966.0	KZN	2990.0	NaN	0	CR2
2	2B	410.0	ASF	2966.0	MRV	2962.0	NaN	0	CR2
3	2B	410.0	CEK	2968.0	KZN	2990.0	NaN	0	CR2
4	2B	410.0	CEK	2968.0	OVB	4078.0	NaN	0	CR2
...	...	...	...	...	...	...	...	...	...
67658	ZL	4178.0	WYA	6334.0	ADL	3341.0	NaN	0	SF3
67659	ZM	19016.0	DME	4029.0	FRU	2912.0	NaN	0	734
67660	ZM	19016.0	FRU	2912.0	DME	4029.0	NaN	0	734
67661	ZM	19016.0	FRU	2912.0	OSS	2913.0	NaN	0	734
67662	ZM	19016.0	OSS	2913.0	FRU	2912.0	NaN	0	734

67663 rows × 9 columns

	Name	City	Country	IATA	ICAO	Latitude	Longitude	Altitude	Timezone	DST	TzDatabaseTimeZone	Type	Source
AirportID
1	Goroka Airport	Goroka	Papua New Guinea	GKA	AYGA	-6.081690	145.391998	5282	10.0	U	Pacific/Port_Moresby	airport	OurAirports
2	Madang Airport	Madang	Papua New Guinea	MAG	AYMD	-5.207080	145.789001	20	10.0	U	Pacific/Port_Moresby	airport	OurAirports
3	Mount Hagen Kagamuga Airport	Mount Hagen	Papua New Guinea	HGU	AYMH	-5.826790	144.296005	5388	10.0	U	Pacific/Port_Moresby	airport	OurAirports
4	Nadzab Airport	Nadzab	Papua New Guinea	LAE	AYNZ	-6.569803	146.725977	239	10.0	U	Pacific/Port_Moresby	airport	OurAirports
5	Port Moresby Jacksons International Airport	Port Moresby	Papua New Guinea	POM	AYPY	-9.443380	147.220001	146	10.0	U	Pacific/Port_Moresby	airport	OurAirports
...	...	...	...	...	...	...	...	...	...	...	...	...	...
14106	Rogachyovo Air Base	Belaya	Russia	NaN	ULDA	71.616699	52.478298	272	NaN	NaN	NaN	airport	OurAirports
14107	Ulan-Ude East Airport	Ulan Ude	Russia	NaN	XIUW	51.849998	107.737999	1670	NaN	NaN	NaN	airport	OurAirports
14108	Krechevitsy Air Base	Novgorod	Russia	NaN	ULLK	58.625000	31.385000	85	NaN	NaN	NaN	airport	OurAirports
14109	Desierto de Atacama Airport	Copiapo	Chile	CPO	SCAT	-27.261200	-70.779198	670	NaN	NaN	NaN	airport	OurAirports
14110	Melitopol Air Base	Melitopol	Ukraine	NaN	UKDM	46.880001	35.305000	0	NaN	NaN	NaN	airport	OurAirports

7698 rows × 13 columns

# We compute some intermediate dataframes that will be useful later

# Aggregate the number of flights per airport.
# Source_airport -> destination_airport, number of flights
flights_airport = routes.groupby(['SourceAirportID','DestinationAirportID']).size().reset_index(name='num_flights')

# Get the countries for the previous airport IDs
flights_country=flights_airport.join(airports.Country,on='SourceAirportID',lsuffix='Source').\
                join(airports.Country,on='DestinationAirportID',lsuffix='Destination')[['Country','CountryDestination','num_flights']]

# Aggregate the number of flights between countries
# SourceCountry -> DestinationCountry, number of flights
paths = flights_country.groupby(['Country','CountryDestination']).sum()
# Aggregate the number of flights to other countries.
paths_other_countries = flights_country[flights_country.Country != flights_country.CountryDestination].groupby(['Country','CountryDestination']).sum()

# Total number of flights per country
total_flights_country = paths.groupby(['Country']).sum()
# Total number of flights to other countries
total_flights_other_countries = paths_other_countries.groupby(['Country']).sum()

Air routes visualization
#

The OpenFlights offers information about the flights between airports. Right now we are only interested in the number of flights between countries. This will simplify the visualization of the air traffic around the globle and it is enough to spot airflights hubs in the dataset.

Note: generating the plot may take a while. However, we added a fancy loading bar :)

# Some code to have a nice progress bar :)
from IPython.display import clear_output

def update_progress(progress):
    bar_length = 20
    if isinstance(progress, int):
        progress = float(progress)
    if not isinstance(progress, float):
        progress = 0
    if progress < 0:
        progress = 0
    if progress >= 1:
        progress = 1

    block = int(round(bar_length * progress))
    clear_output(wait = True)
    text = 'Progress: [{0}] {1:.1f}%'.format( '#' * block + '-' * (bar_length - block), progress * 100)
    print(text)

fig = go.Figure()
total_rows = float(len(paths_other_countries.index))
num_row=-1
for index, row in paths_other_countries.iterrows():
    
    num_row = num_row + 1
    update_progress(num_row/total_rows)
    fig.add_trace(
        go.Scattergeo(
            locationmode = 'country names',
            locations = index,
            mode = 'lines',
            hoverinfo= 'skip',
            line = dict(width=1,color='red'),
            opacity = float(row['num_flights'])/float(total_flights_other_countries.loc[index[0]]),
            showlegend=False,
        )
    )

fig.update_layout(
    geo = dict(
        showcountries = True,
    ),
    showlegend = False,
    title = 'Aggregated air routes between countries excluding domestic flights'
)

fig.show(renderer='svg')

Progress: [####################] 100.0%

If the air traffic has something to do with the Covid19 spread then, China’s air traffic requires our attention. The figure below shows the routes between China and other countries excluding domestic flights. The wide of every line is representative of the number of flights between countries.

Excluding domestic flights (which are quite a few), China has connections to 62 countries. Assuming daily flights, this means that an infected passenger can propagate the disease around the globe in a few hours.


aux = paths_other_countries.reset_index()
aux = aux[aux.Country=='China']


fig = go.Figure()

total_rows = float(len(aux.index))
num_row=-1

the_min = aux.num_flights.min()
the_max = aux.num_flights.max()
the_mean = aux.num_flights.mean()
the_std = aux.num_flights.std()
scalator = float(the_max - the_min)

for index, row in aux.iterrows():
    num_row = num_row + 1
    update_progress(num_row/total_rows)
    width = ((float(row['num_flights'])-the_min)/scalator)*15
    #width = np.max([0.5, width])
   
    fig.add_trace(
        go.Scattergeo(
            locationmode = 'country names',
            locations = [row.Country,row.CountryDestination],
            mode = 'lines',
            hoverinfo= 'skip',
            line = dict(width=width,color='red'),
            showlegend=False,
        )
    )
    

fig.update_layout(
    geo = dict(
        showcountries = True,
    ),
    showlegend = False,
    title = 'China aggregated air routes between countries excluding domestic flights'
)

fig.show(renderer='svg')

Progress: [####################] 98.4%

Disseminations with graphs
#

Now that we have a better understanding of our dataset and how does it look, it is time to prepare some artifacts to study the spread of the virus. In this case, we are going to translate our Pandas tables into a graph that represents the network of countries connected by their flights.

We can create a directed graph $G(V,E)$ where the set of vertices $V$ represents the countries and the set of edges $E$ represents existing air routes between countries $A$ and $B$. Additionally, we can set weights $W(E)$ with the number of flights between two countries.

There are some nice libraries to work with graphs in Python. However, I particularly like Graph Tool, maintained by Tiago de Paula Peixoto from the Central European University. It has implementations of some sophisticated algorithms done in C++ with OpenMP.

# Run this to ensure that the drawing module is correctly stablished
from graph_tool.all import *
import graph_tool as gt

g = gt.Graph()

# Create the country property with the country names
countries_prop = g.new_vertex_property('string')
g.vertex_properties['country'] = countries_prop

# Total number of outgoing flights per vertex
out_flights_prop = g.new_vertex_property('int')
g.vertex_properties['out_flights'] = out_flights_prop

# Create edge property with the number of flights between countries
num_flights_prop = g.new_edge_property('double')
g.edge_properties['num_flights'] = num_flights_prop

# Get the complete list of countries
list_countries = np.union1d(paths.reset_index().Country.unique(), paths.reset_index().CountryDestination.unique())

g.add_vertex(len(list_countries))

# For a given country, get the vertex
country_vertex = {}
index=0
#for c in total_flights_country.index:
for c in list_countries:
    v = g.vertex(index)
    countries_prop[v] = c
    country_vertex[c] = v
    # skip self-loops
    try: 
        out_flights_prop[v] = total_flights_other_countries.loc[c]['num_flights']
    except:
        out_flights_prop[v] = 0.0
        
    index=index+1


# Add the edges
for index,num_flights in paths.iterrows():
    s = country_vertex[index[0]]
    d = country_vertex[index[1]]
    # No self-loops
    if s != d:
        e = g.add_edge(s,d)
        num_flights_prop[e] = num_flights

We draw our graph using a radial layout with China in the center. The edges correspond to the neighbors that can be reached in a single step from China. This is, the 62 countries with direct flights from China.

# plot with a radial distribution with China in the center
china_vertex = country_vertex['China']

pos = gt.draw.radial_tree_layout(g,china_vertex)
gv = gt.GraphView(g, efilt=lambda e : e.source()==china_vertex)

p=gt.draw.graph_draw(gv,pos)

Considering a dissemination scenario, we draw the edges for all the countries that can be reached after a scale in a flight from China. As you can see the number is particularly large.

second_step = []
for v in g.get_out_neighbors(china_vertex):
    second_step.append(v)
    
gv2 = gt.GraphView(g, efilt=lambda e : e.target!=china_vertex and e.source() in second_step)
p=gt.draw.graph_draw(gv2,pos)

The graph above is a bit misleading. Not all the vertices will be reached from the root vertex (China) with the same probability. In other words, the frequency of flights between countries will increase the probabilities to transfer an infected passenger from China.

If we consider the weight $W$ of every edge to be the number of outgoing flights following the edge direction, everything becomes more interesting. The figure below changes the width of every edge connecting China with its flight-connected neighbors. A few number of countries (Taiwan, South Korea, Japan) accumulates most of China’s outgoing traffic as the wide arrows show.

Note: in the figure below the $W$ weight is normalized. Otherwise, most of the connections would not be plotted.

aux = gv.edge_properties['num_flights'].copy()

aux.a = (aux.a - aux.a.min())/(aux.a.max()-aux.a.min()) * 10
p=gt.draw.graph_draw(gv,pos,edge_pen_width=aux)

SIR epidemic process
#

The SIR model is a simple, yet powerful, compartimental model for epidemics. The model consists of three compartments $S$, $I$ and $R$. $S$ is the number of susceptible, $I$ the number of infectuous and $R$ the number of recovered or deceased individuals. This is the most common model used right now and has many extensions and enhancements. The number of individuals in $S$ falls down as they become infected ($I$) and then recovered $R$. This is one of the famous infection curves used to forecast Covid19 evolution.

How to determine when a new individual moves between compartments depends on transition rates. Continuing with our initial idea about how air traffic can be an important actor in Covid19 dissemination, we can use our air flight connections graph to run a SIR simulation.

Fortunately, graph tool comes with an implementation of the SIR model easily configurable. We run a SIR model in 63 steps which is the number of days the Covid19 required to expand worldwide in our dataset. For every edge in our graph we compute $\beta_e = \sum_{J \in out(A)} W(A,J)$ that basically sets the transition between vertices as the ratio of flights that connection represent among the total number of outgoing flights in vertex $A$. Additionally, we configure SIR to remove spontaneous infections and reinfections and set China as the dissemination seed.


def runSIR(state,num_iter=100):
    '''
    For a graph tool epidemic model run the state and return a dataframe with the countries infected in every step
    '''
    infections = pd.DataFrame()
    
    previous = state.get_state().fa
    initial = np.where(previous==1)[0]
    
    for t in range(num_iter):
        ret=state.iterate_sync()
        new = state.get_state().fa
        # Find new infected countries
        diff = new - previous
        already_infected = state.get_state().fa.sum()
        non_infected = len(state.get_state().fa)-already_infected
        new_infected = [countries_prop[g.vertex(v)] for v in np.where(diff==1)[0]]
        previous = new.copy()
        # collect the results in a Pandas friendly way
        infections = infections.append([{'step':t,'norm_steps':t/float(num_iter),'infected':already_infected,
                                'new_infected': len(new_infected),'non_infected': non_infected,
                                'new_infected_countries': new_infected}], ignore_index=True)
        
    return infections

# Create an edge property map that can help us to define the beta probability between vertices 
beta_prop = g.new_edge_property('double')
for e in g.edges():
    beta_prop[e] = num_flights_prop[e]/out_flights_prop[e.source()]

# Assume that vertices are not susceptible, except the seed.
susceptible_prop = g.new_vertex_property('float')
for v in g.vertices():
       susceptible_prop[v] = 0.0
susceptible_prop[china_vertex]=1.0

state = gt.dynamics.SIRState(g, beta=beta_prop,v0 = china_vertex, constant_beta=True,gamma=0,
                              r=0,s=susceptible_prop,epsilon=0,)

infections_SIR = runSIR(state,64)
display(infections_SIR[infections_SIR.new_infected>0].head(10))

	step	norm_steps	infected	new_infected	non_infected	new_infected_countries
1	1	0.015625	8	5	217	[Australia, India, Indonesia, Kyrgyzstan, Sout...
2	2	0.031250	13	5	212	[Israel, Russia, Saudi Arabia, Thailand, Unite...
3	3	0.046875	20	7	205	[Egypt, Hong Kong, Malaysia, Taiwan, Turkey, U...
4	4	0.062500	27	7	198	[Burma, France, Germany, Pakistan, Peru, Polan...
5	5	0.078125	36	9	189	[Austria, Belarus, Denmark, Kenya, New Zealand...
6	6	0.093750	50	14	175	[Argentina, Bosnia and Herzegovina, Brazil, Ca...
7	7	0.109375	59	9	166	[Ethiopia, Greece, Greenland, Italy, Libya, Ma...
8	8	0.125000	71	12	154	[Bangladesh, Belgium, Cambodia, Chile, Colombi...
9	9	0.140625	79	8	146	[Azerbaijan, Bahrain, Czech Republic, Jordan, ...
10	10	0.156250	86	7	139	[Dominican Republic, Ecuador, Iran, Lebanon, M...

The output above shows the number of infected countries over time. Observe that it takes a while for the network to consider new infected countries.

It is interesting to compare our outcomes with the current Covid19 evolution per country. We consider in our dataset that a country is infected as soon as it confirms a single case. We assume that every day after the first case appeared in China can be considered as a step in the dissemination process. This assumption is done in order to facilitate the comparison with our SIR simulation.

# Compute the number of infected countries
# Get the first date with confirmed cases for every country
first_date = covid_data[covid_data.ConfirmedCases > 0].groupby('Country_Region', as_index=False).Date.min().sort_values(by='Date')
# Compute the number of elapsed days since the beginning of China's outbreak
first_date['step']=first_date.Date-datetime.datetime(2020,1,22)
# Convert this column into an integer with the number of days.
first_date.step = first_date.step.dt.days
# Get the number of infected countries per region
covid_infected = first_date[['step','Country_Region']].groupby('step').agg(new_infected=('step','count'))
covid_infected['infected'] = covid_infected['new_infected'].cumsum()
covid_infected = covid_infected.reset_index()

f = go.Figure()
f.add_trace(
    go.Scatter(x=covid_infected.step,y=covid_infected.infected,name='Covid19')
)
f.add_trace(
    go.Scatter(x=infections_SIR.step,y=infections_SIR.infected,name='SIR')
)
f.update_layout(
    title_text='Comparing SIR infected countries evolution with original Covid19',
    xaxis=dict(title='Step'),
    yaxis=dict(title='Infected countries')
)

f.show(renderer='svg')

The original Covid19 evolution differs in the first steps with our SIR simulation. The initial plateau before the increasing slope, takes much longer in the Covid19 series. There is probably a lag between a country welcoming infected passengers and the declaration of that case. However, it is interesting to observe this difference between the number of declared cases and the evolution of our model. Obviously, there is some remaining exploration work before we can understanding why this is happening.

Conclusions
#

In this notebook we have introduced some interesting tools to analyze the influence of air flights in the dissemination of the Covid19 around the world. Firstly, we have explored the OpenFlights dataset containing air traffic information around the world. Second, we have used this dataset to create an air traffic network. Finally, we have used this network to run a SIR model to understand the dissemination of the virus using air connections.