Panel data Econometrics – An easy introduction with Python

Panel data (or longitudinal data) set comprises time-series for each cross-sectional unit in a data set. In other words, in a panel data set we take into account the same cross-sectional units over multiple time points. For example, we can consider units such as countries, cities, firms, households, individuals. In this context, we can think of pure time-series and pure cross-sectional data as a subset of panel data with only one dimension.

Panel vs. Pooled data

According to Eviews documentation, pooled data refers to data with relatively few cross-sections, where variables are held in cross-section specific individual series, meanwhile panel data corresponds to data with large numbers of cross-sections, with variables held in single series in a stacked form.

Some experts refer to Pooled data as “time series of cross sections”, where observations in each cross section do not necessarily refer to the same unit. Whereas, panel data refers to samples of the same cross-sectional units at multiple points in time. A panel-data observation has two dimensions:

x_{it}}, where i runs from 1 to N and denotes the cross-sectional unit and t runs from 1 to T and denotes the time of the observation.

Advantageous and Disadvantages of Panel data

There are numerous benefits of panel data over cross-sectional and time-series data:

  • In panel data multiple observations of the same entities allow us to control for unobserved characteristics.
  • Panel data facilitates causal inference which would be difficult with one cross-section or time-series data set.
  • Such data allows us to study significance of lags in behaviour and results of decision-making across time and entities.

Nevertheless, for a data scientist it might be difficult to obtain panel data since it requires a replication of the same entities over multiple periods.

Types of panel data

We can consider panel data as balanced or unbalanced. If each cross-sectional unit is observed in all time periods panel data is balanced. While an unbalanced data set is one where units are not observed in all time periods and contain missing values.

Econometricians also separate panel data into a wide or a long one. In a wide panel data a row or a column represents one observational unit for all points in time, while in a long panel data a row or a column holds one observation per period.

An example of a panel data set

An example of A wide unbalanced panel data set is World Health Organization crude birth rates data set available on Kaggle. For 10 of 239 countries and regions crude birth observations span 60 years from 1960 to 2019 while for others only a few periods. The large amount of missing values is likely due to the high administrative cost of collecting the necessary information.

WHO crude birth rates per 1000

In the table above, we display crude birth rates for first 10 countries and regions in alphabetical order. We observe that United Arab Emirates, Antigua and Barbuda and Australia have no missing data points for from 1960 to 1980.

For more detailed theory and examples consider purchasing Introductory Econometric textbook by Wooldridge.

Panel data set visualization and preprocessing in Python

We use the same WHO crude births data set to describe features of the data and preprocess it for further analysis. In this example we use pandas library to load the file into pandas DataFrame and generate descriptive statistics for each country/region with the help of describe() method. Descriptive statistics includes the count of values, mean and standard deviation, minimum and maximum values as well as 25th percentile vale, 50th percentile value (also median) and 75th percentile.

import pandas as pd
cols2skip = [1,2,3] 
df = pd.read_csv('/kaggle/input/birth-rate/API_SP.DYN.IMRT.FE.IN_DS2_en_csv_v2_2253793.csv', skiprows=3, usecols=[i for i in range(64) if i not in cols2skip],index_col='Country Name').T
df.describe()

As we can see, some of the columns contain full NaNs, hence it is reasonable to drop such data:

df = df.dropna(axis=1, how='all')

We further explore the patterns in the crude birth rates and display the data for the first 10 countries. For interactive visualizations in Python we use Matplotlib library.

import matplotlib.pyplot as plt
df.iloc[:,:10].plot(subplots=True, figsize=(20,60))
plt.show()

To conduct statistical analysis and model the birth rates we have to convert data into an appropriate format for panel data analysis. In the following code we use pandas.melt to massage a DataFrame into a format where one or more columns are identifier variables, while all other columns are measured variables. We also drop missing values, however, some consider interpolation or other techniques of filling missing values. The entity identifier is the index from the previous data frame and the year that we also convert into a separate categorical column for dummy variables creation.

year = df.index
df['year'] = year.astype(int)
df = df.melt(id_vars=["year"], 
        var_name="Country", 
        value_name="Birth Rate").dropna().reset_index(drop=True)
year = df.year
df = df.set_index([df.index, 'year'])
df['year'] = pd.Categorical(year)
df.head()

Panel data analysis

Panel data analysis is a statistical method widespread in the fields of economics, finance and epidemiology to analyze two-dimensional panel data. In a production environment regression estimation and data modeling traditionally follows the collection of a data set. The three most ubiquitous panel data models are a pooled model, a fixed effects model and a random effects model.

Intuition behind fixed effects, first differences and pooled OLS

Pooled OLS in Python

For an estimation of pooled OLS we use linearmodels library and for a creation of a constant for a linear equation we use statsmodels library.

exog_vars = ["Country", 'year']
exog = sm.add_constant(df[exog_vars])
mod = PooledOLS(df['Birth Rate'], exog)
pooled_res = mod.fit()
print(pooled_res)
A snapshot of Pooled OLS model

References

3 thoughts on “Panel data Econometrics – An easy introduction with Python”

  1. Thank yօᥙ, I have reⅽently been looking for informatіon about
    this topic for a long time and yours іs the greatest I’ve ԁiscovеred so far.
    Hoѡever, what concerning the bottom line?
    Are you certain about thе supply?

    Reply
  2. Woᴡ, this piece of writing is fastidious, my younger sister is analyzing such things, thus I am going to let
    know her.

    Reply

Leave a comment