12 minute read

This is a reproduction of a Jupyter notebook you can find here


The Covid19 has imposed a lot of pressure on epidemiologists to come up with models that can explain the impact of pandemics. There already is lot of theory developed around this topic you can check out. In particular, understanding SIR models is really useful to understand how researchers approach the problem of virus spreading. In this notebook, I introduce some ideas about how to use Pandas, plots and graphs to explore the evolution of pandemics. I show some tools you can use to simulate networks dynamics. Of course, this cannot be considered a contribution to the state of the art. However, you may find some interesting guidelines for other problems.

The question

The main question we ask ourselves is has air passengers something to do with Covid19?. If so, can we use air traffic data to know something else about Covid19 evolution?

It is known that the virus started in China and later propagated around the world. If air traffic has contributed to the spread of the virus, this means that we can somehow leverage a model that explains or predict the evolution of the disease. Furthermore, we could even predict what countries could be potentially infected and/or what flights shall be cancelled.

There is a vast amount of work that can be done in this topic. To make it more accessible we will split it into smaller parts. In this first part, we will explore the data we plan to use and how to work with graphs.


The OpenFlights dataset contains airlines, air routes and airports per country. The information contained in the dataset contains information up to 2014. This may be a bit outdated. However, it is detailed enough for our purposes.

Additionally, we use the Kraggle Covid19 challenge data as we did in our previous Covid19 series.

import pandas as pd
import numpy as np
import plotly.graph_objects as go
import plotly.express as px
import datetime
routes_dataset_url = 'https://raw.githubusercontent.com/jpatokal/openflights/master/data/routes.dat'
airports_dataset_url = 'https://raw.githubusercontent.com/jpatokal/openflights/master/data/airports.dat'
countries_dataset_url = 'https://raw.githubusercontent.com/jpatokal/openflights/master/data/countries.dat'

# Load the file with the routes
routes = pd.read_csv(routes_dataset_url, names=['Airline','AirlineID','SourceAirport', 
                                                       'SourceAirportID', 'DestinationAirport', 'DestinationAirportID',

# Load the file with the airports
airports = pd.read_csv(airports_dataset_url, names=['AirportID','Name','City','Country','IATA','ICAO',
                                                         'Latitude', 'Longitude', 'Altitude', 'Timezone', 'DST',
                                                         'TzDatabaseTimeZone', 'Type', 'Source'], index_col='AirportID',


# Covid19 data
covid_data = pd.read_csv('covid19-global-forecasting-week-2/train.csv',parse_dates=['Date'], 
                   date_parser=lambda x: datetime.datetime.strptime(x, '%Y-%m-%d'))

Airline AirlineID SourceAirport SourceAirportID DestinationAirport DestinationAirportID CodeShare Stops Equipment
0 2B 410.0 AER 2965.0 KZN 2990.0 NaN 0 CR2
1 2B 410.0 ASF 2966.0 KZN 2990.0 NaN 0 CR2
2 2B 410.0 ASF 2966.0 MRV 2962.0 NaN 0 CR2
3 2B 410.0 CEK 2968.0 KZN 2990.0 NaN 0 CR2
4 2B 410.0 CEK 2968.0 OVB 4078.0 NaN 0 CR2
... ... ... ... ... ... ... ... ... ...
67658 ZL 4178.0 WYA 6334.0 ADL 3341.0 NaN 0 SF3
67659 ZM 19016.0 DME 4029.0 FRU 2912.0 NaN 0 734
67660 ZM 19016.0 FRU 2912.0 DME 4029.0 NaN 0 734
67661 ZM 19016.0 FRU 2912.0 OSS 2913.0 NaN 0 734
67662 ZM 19016.0 OSS 2913.0 FRU 2912.0 NaN 0 734

67663 rows × 9 columns

Name City Country IATA ICAO Latitude Longitude Altitude Timezone DST TzDatabaseTimeZone Type Source
1 Goroka Airport Goroka Papua New Guinea GKA AYGA -6.081690 145.391998 5282 10.0 U Pacific/Port_Moresby airport OurAirports
2 Madang Airport Madang Papua New Guinea MAG AYMD -5.207080 145.789001 20 10.0 U Pacific/Port_Moresby airport OurAirports
3 Mount Hagen Kagamuga Airport Mount Hagen Papua New Guinea HGU AYMH -5.826790 144.296005 5388 10.0 U Pacific/Port_Moresby airport OurAirports
4 Nadzab Airport Nadzab Papua New Guinea LAE AYNZ -6.569803 146.725977 239 10.0 U Pacific/Port_Moresby airport OurAirports
5 Port Moresby Jacksons International Airport Port Moresby Papua New Guinea POM AYPY -9.443380 147.220001 146 10.0 U Pacific/Port_Moresby airport OurAirports
... ... ... ... ... ... ... ... ... ... ... ... ... ...
14106 Rogachyovo Air Base Belaya Russia NaN ULDA 71.616699 52.478298 272 NaN NaN NaN airport OurAirports
14107 Ulan-Ude East Airport Ulan Ude Russia NaN XIUW 51.849998 107.737999 1670 NaN NaN NaN airport OurAirports
14108 Krechevitsy Air Base Novgorod Russia NaN ULLK 58.625000 31.385000 85 NaN NaN NaN airport OurAirports
14109 Desierto de Atacama Airport Copiapo Chile CPO SCAT -27.261200 -70.779198 670 NaN NaN NaN airport OurAirports
14110 Melitopol Air Base Melitopol Ukraine NaN UKDM 46.880001 35.305000 0 NaN NaN NaN airport OurAirports

7698 rows × 13 columns

# We compute some intermediate dataframes that will be useful later

# Aggregate the number of flights per airport.
# Source_airport -> destination_airport, number of flights
flights_airport = routes.groupby(['SourceAirportID','DestinationAirportID']).size().reset_index(name='num_flights')

# Get the countries for the previous airport IDs

# Aggregate the number of flights between countries
# SourceCountry -> DestinationCountry, number of flights
paths = flights_country.groupby(['Country','CountryDestination']).sum()
# Aggregate the number of flights to other countries.
paths_other_countries = flights_country[flights_country.Country != flights_country.CountryDestination].groupby(['Country','CountryDestination']).sum()

# Total number of flights per country
total_flights_country = paths.groupby(['Country']).sum()
# Total number of flights to other countries
total_flights_other_countries = paths_other_countries.groupby(['Country']).sum()

Air routes visualization

The OpenFlights offers information about the flights between airports. Right now we are only interested in the number of flights between countries. This will simplify the visualization of the air traffic around the globle and it is enough to spot airflights hubs in the dataset.

Note: generating the plot may take a while. However, we added a fancy loading bar :)

# Some code to have a nice progress bar :)
from IPython.display import clear_output

def update_progress(progress):
    bar_length = 20
    if isinstance(progress, int):
        progress = float(progress)
    if not isinstance(progress, float):
        progress = 0
    if progress < 0:
        progress = 0
    if progress >= 1:
        progress = 1

    block = int(round(bar_length * progress))
    clear_output(wait = True)
    text = 'Progress: [{0}] {1:.1f}%'.format( '#' * block + '-' * (bar_length - block), progress * 100)
fig = go.Figure()
total_rows = float(len(paths_other_countries.index))
for index, row in paths_other_countries.iterrows():
    num_row = num_row + 1
            locationmode = 'country names',
            locations = index,
            mode = 'lines',
            hoverinfo= 'skip',
            line = dict(width=1,color='red'),
            opacity = float(row['num_flights'])/float(total_flights_other_countries.loc[index[0]]),

    geo = dict(
        showcountries = True,
    showlegend = False,
    title = 'Aggregated air routes between countries excluding domestic flights'

Progress: [####################] 100.0%


If the air traffic has something to do with the Covid19 spread then, China’s air traffic requires our attention. The figure below shows the routes between China and other countries excluding domestic flights. The wide of every line is representative of the number of flights between countries.

Excluding domestic flights (which are quite a few), China has connections to 62 countries. Assuming daily flights, this means that an infected passenger can propagate the disease around the globe in a few hours.

aux = paths_other_countries.reset_index()
aux = aux[aux.Country=='China']

fig = go.Figure()

total_rows = float(len(aux.index))

the_min = aux.num_flights.min()
the_max = aux.num_flights.max()
the_mean = aux.num_flights.mean()
the_std = aux.num_flights.std()
scalator = float(the_max - the_min)

for index, row in aux.iterrows():
    num_row = num_row + 1
    width = ((float(row['num_flights'])-the_min)/scalator)*15
    #width = np.max([0.5, width])
            locationmode = 'country names',
            locations = [row.Country,row.CountryDestination],
            mode = 'lines',
            hoverinfo= 'skip',
            line = dict(width=width,color='red'),

    geo = dict(
        showcountries = True,
    showlegend = False,
    title = 'China aggregated air routes between countries excluding domestic flights'


Progress: [####################] 98.4%


Disseminations with graphs

Now that we have a better understanding of our dataset and how does it look, it is time to prepare some artifacts to study the spread of the virus. In this case, we are going to translate our Pandas tables into a graph that represents the network of countries connected by their flights.

We can create a directed graph $G(V,E)$ where the set of vertices $V$ represents the countries and the set of edges $E$ represents existing air routes between countries $A$ and $B$. Additionally, we can set weights $W(E)$ with the number of flights between two countries.

There are some nice libraries to work with graphs in Python. However, I particularly like Graph Tool, maintained by Tiago de Paula Peixoto from the Central European University. It has implementations of some sophisticated algorithms done in C++ with OpenMP.

# Run this to ensure that the drawing module is correctly stablished
from graph_tool.all import *
import graph_tool as gt
g = gt.Graph()

# Create the country property with the country names
countries_prop = g.new_vertex_property('string')
g.vertex_properties['country'] = countries_prop

# Total number of outgoing flights per vertex
out_flights_prop = g.new_vertex_property('int')
g.vertex_properties['out_flights'] = out_flights_prop

# Create edge property with the number of flights between countries
num_flights_prop = g.new_edge_property('double')
g.edge_properties['num_flights'] = num_flights_prop

# Get the complete list of countries
list_countries = np.union1d(paths.reset_index().Country.unique(), paths.reset_index().CountryDestination.unique())


# For a given country, get the vertex
country_vertex = {}
#for c in total_flights_country.index:
for c in list_countries:
    v = g.vertex(index)
    countries_prop[v] = c
    country_vertex[c] = v
    # skip self-loops
        out_flights_prop[v] = total_flights_other_countries.loc[c]['num_flights']
        out_flights_prop[v] = 0.0

# Add the edges
for index,num_flights in paths.iterrows():
    s = country_vertex[index[0]]
    d = country_vertex[index[1]]
    # No self-loops
    if s != d:
        e = g.add_edge(s,d)
        num_flights_prop[e] = num_flights

We draw our graph using a radial layout with China in the center. The edges correspond to the neighbors that can be reached in a single step from China. This is, the 62 countries with direct flights from China.

# plot with a radial distribution with China in the center
china_vertex = country_vertex['China']

pos = gt.draw.radial_tree_layout(g,china_vertex)
gv = gt.GraphView(g, efilt=lambda e : e.source()==china_vertex)



Considering a dissemination scenario, we draw the edges for all the countries that can be reached after a scale in a flight from China. As you can see the number is particularly large.

second_step = []
for v in g.get_out_neighbors(china_vertex):
gv2 = gt.GraphView(g, efilt=lambda e : e.target!=china_vertex and e.source() in second_step)


The graph above is a bit misleading. Not all the vertices will be reached from the root vertex (China) with the same probability. In other words, the frequency of flights between countries will increase the probabilities to transfer an infected passenger from China.

If we consider the weight $W$ of every edge to be the number of outgoing flights following the edge direction, everything becomes more interesting. The figure below changes the width of every edge connecting China with its flight-connected neighbors. A few number of countries (Taiwan, South Korea, Japan) accumulates most of China’s outgoing traffic as the wide arrows show.

Note: in the figure below the $W$ weight is normalized. Otherwise, most of the connections would not be plotted.

aux = gv.edge_properties['num_flights'].copy()

aux.a = (aux.a - aux.a.min())/(aux.a.max()-aux.a.min()) * 10


SIR epidemic process

The SIR model is a simple, yet powerful, compartimental model for epidemics. The model consists of three compartments $S$, $I$ and $R$. $S$ is the number of susceptible, $I$ the number of infectuous and $R$ the number of recovered or deceased individuals. This is the most common model used right now and has many extensions and enhancements. The number of individuals in $S$ falls down as they become infected ($I$) and then recovered $R$. This is one of the famous infection curves used to forecast Covid19 evolution.

How to determine when a new individual moves between compartments depends on transition rates. Continuing with our initial idea about how air traffic can be an important actor in Covid19 dissemination, we can use our air flight connections graph to run a SIR simulation.

Fortunately, graph tool comes with an implementation of the SIR model easily configurable. We run a SIR model in 63 steps which is the number of days the Covid19 required to expand worldwide in our dataset. For every edge in our graph we compute $\beta_e = \sum_{J \in out(A)} W(A,J)$ that basically sets the transition between vertices as the ratio of flights that connection represent among the total number of outgoing flights in vertex $A$. Additionally, we configure SIR to remove spontaneous infections and reinfections and set China as the dissemination seed.

def runSIR(state,num_iter=100):
    For a graph tool epidemic model run the state and return a dataframe with the countries infected in every step
    infections = pd.DataFrame()
    previous = state.get_state().fa
    initial = np.where(previous==1)[0]
    for t in range(num_iter):
        new = state.get_state().fa
        # Find new infected countries
        diff = new - previous
        already_infected = state.get_state().fa.sum()
        non_infected = len(state.get_state().fa)-already_infected
        new_infected = [countries_prop[g.vertex(v)] for v in np.where(diff==1)[0]]
        previous = new.copy()
        # collect the results in a Pandas friendly way
        infections = infections.append([{'step':t,'norm_steps':t/float(num_iter),'infected':already_infected,
                                'new_infected': len(new_infected),'non_infected': non_infected,
                                'new_infected_countries': new_infected}], ignore_index=True)
    return infections

# Create an edge property map that can help us to define the beta probability between vertices 
beta_prop = g.new_edge_property('double')
for e in g.edges():
    beta_prop[e] = num_flights_prop[e]/out_flights_prop[e.source()]

# Assume that vertices are not susceptible, except the seed.
susceptible_prop = g.new_vertex_property('float')
for v in g.vertices():
       susceptible_prop[v] = 0.0

state = gt.dynamics.SIRState(g, beta=beta_prop,v0 = china_vertex, constant_beta=True,gamma=0,

infections_SIR = runSIR(state,64)
step norm_steps infected new_infected non_infected new_infected_countries
1 1 0.015625 8 5 217 [Australia, India, Indonesia, Kyrgyzstan, Sout...
2 2 0.031250 13 5 212 [Israel, Russia, Saudi Arabia, Thailand, Unite...
3 3 0.046875 20 7 205 [Egypt, Hong Kong, Malaysia, Taiwan, Turkey, U...
4 4 0.062500 27 7 198 [Burma, France, Germany, Pakistan, Peru, Polan...
5 5 0.078125 36 9 189 [Austria, Belarus, Denmark, Kenya, New Zealand...
6 6 0.093750 50 14 175 [Argentina, Bosnia and Herzegovina, Brazil, Ca...
7 7 0.109375 59 9 166 [Ethiopia, Greece, Greenland, Italy, Libya, Ma...
8 8 0.125000 71 12 154 [Bangladesh, Belgium, Cambodia, Chile, Colombi...
9 9 0.140625 79 8 146 [Azerbaijan, Bahrain, Czech Republic, Jordan, ...
10 10 0.156250 86 7 139 [Dominican Republic, Ecuador, Iran, Lebanon, M...

The output above shows the number of infected countries over time. Observe that it takes a while for the network to consider new infected countries.

It is interesting to compare our outcomes with the current Covid19 evolution per country. We consider in our dataset that a country is infected as soon as it confirms a single case. We assume that every day after the first case appeared in China can be considered as a step in the dissemination process. This assumption is done in order to facilitate the comparison with our SIR simulation.

# Compute the number of infected countries
# Get the first date with confirmed cases for every country
first_date = covid_data[covid_data.ConfirmedCases > 0].groupby('Country_Region', as_index=False).Date.min().sort_values(by='Date')
# Compute the number of elapsed days since the beginning of China's outbreak
# Convert this column into an integer with the number of days.
first_date.step = first_date.step.dt.days
# Get the number of infected countries per region
covid_infected = first_date[['step','Country_Region']].groupby('step').agg(new_infected=('step','count'))
covid_infected['infected'] = covid_infected['new_infected'].cumsum()
covid_infected = covid_infected.reset_index()
f = go.Figure()
    title_text='Comparing SIR infected countries evolution with original Covid19',
    yaxis=dict(title='Infected countries')



The original Covid19 evolution differs in the first steps with our SIR simulation. The initial plateau before the increasing slope, takes much longer in the Covid19 series. There is probably a lag between a country welcoming infected passengers and the declaration of that case. However, it is interesting to observe this difference between the number of declared cases and the evolution of our model. Obviously, there is some remaining exploration work before we can understanding why this is happening.


In this notebook we have introduced some interesting tools to analyze the influence of air flights in the dissemination of the Covid19 around the world. Firstly, we have explored the OpenFlights dataset containing air traffic information around the world. Second, we have used this dataset to create an air traffic network. Finally, we have used this network to run a SIR model to understand the dissemination of the virus using air connections.


Tiago P. Peixoto, “The graph-tool python library”, figshare. (2014) DOI: 10.6084/m9.figshare.1164194 sci-hub