Data preprocessing

Author

Ekaterina Cvetkova

Published

January 27, 2025

The quantity of CO2 is determined and described by the chemical term “mole fraction”, defined as the number of carbon dioxide molecules in a given number of molecules of air, after removal of water vapor. For example, 413 parts per million of CO2 (abbreviated as ppm) means that in every million molecules of (dry) air there are on average 413 CO2 molecules.

1 Importing libraries

Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import kendalltau
from helpers import (coerce_into_full_datetime, add_missing_one_year_rows, 
                      plot_column, add_missing_dates, plot_rolling_correlations, 
                      interpret_p_value, plot_lagged_correlations, plot_entire_df)
from statsmodels.tsa.stattools import adfuller, grangercausalitytests
import warnings
warnings.filterwarnings('ignore')

2 Importation, checks and formatting of station data

Code

df_station = pd.read_csv('data_project - Sheet1.csv')
df_station

	year	month	day	decimal	average	ndays	1 year ago	10 years ago	increase since 1800
0	year	5	19	1974.3795	333.37	5	-999.99	-999.99	50.40
1	year	5	26	1974.3986	332.95	6	-999.99	-999.99	50.06
2	year	6	2	1974.4178	332.35	5	-999.99	-999.99	49.60
3	year	6	9	1974.4370	332.20	7	-999.99	-999.99	49.65
4	year	6	16	1974.4562	332.37	7	-999.99	-999.99	50.06
...	...	...	...	...	...	...	...	...	...
2626	2024	9	15	2024.7063	421.98	7	418.33	395.24	145.49
2627	2024	9	22	2024.7254	421.71	2	418.28	395.47	145.32
2628	2024	9	29	2024.7445	421.95	4	418.35	395.61	145.56
2629	2024	10	6	2024.7637	422.16	4	418.47	395.73	145.68
2630	2024	10	13	2024.7828	422.62	5	419.56	395.86	145.97

2631 rows × 9 columns

Code

invalid_10years = (df_station['10 years ago'] == -999.99).sum()
invalid_10years

np.int64(540)

We can use the information from the other columns to create a valid datetime format.

The ‘decimal’ column along with the ‘day’ and ‘month’ columns can be used to adjust the datetime index accordingly

Code

df_station['year'] = df_station['year'].astype(str).str.strip()
mask = df_station['year'] == 'year'
df_station.loc[mask, 'year'] = df_station.loc[mask, 'decimal'].fillna(0).apply(lambda x: int(float(x)))

Code

df_station.drop(columns = ['decimal'], inplace = True)
df_station = coerce_into_full_datetime(df_station)

Code

invalid_average = (df_station['average'] == -999.99).sum()
invalid_1year = (df_station['1 year ago'] == -999.99).sum()
invalid_10years = (df_station['10 years ago'] == -999.99).sum()

print(invalid_average)
print(invalid_1year)
print(invalid_10years)

18
70
540

We are using the data from the ‘1 year ago’ and ‘10 years ago’ columns to include and adjust missing datetimes, thus giving us a richer dataframe which would allow us to capture the trend effectively.

Code

df_station = add_missing_dates(df_station) # using the function to create new rows using the '10 year ago' column
df_station = add_missing_one_year_rows(df_station) # using the function to create new rows using the '1 year ago' column
df_station.drop(df_station[df_station['average'] == -999.99].index, inplace=True)

3 Importation, preprocessing and evaluation of feature importance from Open Meteo weather data

Code

df_history = pd.read_csv(r'open-meteo-19.44N155.62E0m - Sheet1.csv', skiprows=3)
df_history.index = pd.to_datetime(df_history.index).strftime('%Y-%m-%d %H:%M:%S')
df_history.reset_index(inplace=True)
df_history.set_index('time', inplace=True)
df_history.drop(columns = 'index', inplace = True)
df_history.index = pd.to_datetime(df_history.index)
df_history = df_history.resample('D').mean()

Code

# Creating df on dates when station data and Open Meteo data overlap

common_dates = df_history.index.intersection(df_station.index)
df_history = df_history.loc[common_dates]

Code

df_history

	temperature_2m (°C)	relative_humidity_2m (%)	dew_point_2m (°C)	precipitation (mm)	pressure_msl (hPa)	surface_pressure (hPa)	et0_fao_evapotranspiration (mm)	wind_speed_10m (m/s)	soil_temperature_0_to_7cm (°C)
1974-05-19	25.341667	70.666667	19.587500	0.091667	1009.866667	1009.866667	0.194583	7.951667	27.300000
1974-05-25	26.820833	83.041667	23.716667	0.004167	1014.900000	1014.900000	0.213750	5.412083	27.833333
1974-05-26	26.445833	81.000000	22.929167	0.012500	1014.766667	1014.766667	0.220833	6.092083	27.666667
1974-06-01	26.587500	81.416667	23.120833	0.025000	1012.683333	1012.683333	0.217500	4.369167	28.033333
1974-06-02	26.600000	80.833333	23.037500	0.029167	1011.666667	1011.666667	0.208333	3.939167	28.166667
...	...	...	...	...	...	...	...	...	...
2024-09-15	28.470833	76.041667	23.812500	0.425000	1009.612500	1009.612500	0.215417	7.975000	29.450000
2024-09-22	28.937500	76.416667	24.341667	0.045833	1013.208333	1013.208333	0.235833	7.489167	29.591667
2024-09-29	28.983333	75.750000	24.245833	0.004167	1013.491667	1013.491667	0.232083	6.991667	29.691667
2024-10-06	28.591667	76.458333	24.045833	0.095833	1012.320833	1012.320833	0.207917	5.662083	29.608333
2024-10-13	27.341667	83.375000	24.279167	0.437500	1010.179167	1010.179167	0.152500	6.735000	29.416667

5174 rows × 9 columns

Code

df_station_column = df_station[['average']]
df_CO2_meteo = df_history.join(df_station_column, how='inner')

3.1 Evaluating correlations between weather data and CO2 levels from station data

Code

# Computing Kendall's Tau correlation matrix

kendall_corr = df_CO2_meteo.corr(method='kendall')

plt.figure(figsize=(12, 6))
sns.heatmap(kendall_corr, annot=True, cmap='coolwarm', fmt=".2f")
div_palette = sns.color_palette("RdBu", 12)
plt.rcParams.update({'font.size': 13})
plt.title("Kendall Correlation Heatmap")
plt.show()

Surprisingly and unfortunately, there seems to be little to no correlation between the weather data and the average levels of CO2 at this particular location.

However further testing and assessment is needed before drawing final conclusions.

Code

df_CO2_meteo.drop(['surface_pressure (hPa)'],axis=1,inplace = True)

Code

df_CO2_meteo.rename(columns={'average': 'average_CO2', 'temperature_2m (°C)' : 'temperature', 
                             'relative_humidity_2m (%)':'humidity', 'dew_point_2m (°C)' : 'dew_point',
                             'precipitation (mm)' : 'precipitation', 'pressure_msl (hPa)' : 'pressure',
                             'et0_fao_evapotranspiration (mm)' : 'evapotranspiration', 
                             'wind_speed_10m (m/s)' : 'wind_speed', 'soil_temperature_0_to_7cm (°C)' : 'soil_temperature'}, inplace=True)

Code

df_CO2_meteo.to_csv('df_CO2_meteo.csv', index=True)

4 N2O Importation and analysis

Code

df_N2O = pd.read_csv(r'mlo_N2O_Day.csv', skiprows = 1)

Code

new_column_names = ['year', 'month', 'day', 'median_N2O', 'std._dev.N2O', 'samplesN20']
df_N2O.columns = new_column_names
df_N2O

	year	month	day	median_N2O	std._dev.N2O	samplesN20
0	1998	11	28	NaN	NaN	0
1	1998	11	29	NaN	NaN	0
2	1998	11	30	NaN	NaN	0
3	1998	12	1	NaN	NaN	0
4	1998	12	2	NaN	NaN	0
...	...	...	...	...	...	...
8760	2022	11	23	337.97	1.08	24
8761	2022	11	24	337.61	0.98	24
8762	2022	11	25	337.51	1.02	23
8763	2022	11	26	337.02	0.96	23
8764	2022	11	27	337.51	0.94	24

8765 rows × 6 columns

Code

df_N2O = coerce_into_full_datetime(df_N2O) # using the `year`, `month` and `day` columns

	median_N2O	std._dev.N2O	samplesN20
datetime
1998-11-28	NaN	NaN	0
1998-11-29	NaN	NaN	0
1998-11-30	NaN	NaN	0
1998-12-01	NaN	NaN	0
1998-12-02	NaN	NaN	0
...	...	...	...
2022-11-23	337.97	1.08	24
2022-11-24	337.61	0.98	24
2022-11-25	337.51	1.02	23
2022-11-26	337.02	0.96	23
2022-11-27	337.51	0.94	24

8765 rows × 3 columns

Code

df_N2O.interpolate(method='time', inplace=True)
df_N2O.fillna(method='ffill', inplace=True)
df_N2O.fillna(method='bfill', inplace=True)

Code

df_N2O

	median_N2O	std._dev.N2O	samplesN20
datetime
1998-11-28	315.59	0.38	0
1998-11-29	315.59	0.38	0
1998-11-30	315.59	0.38	0
1998-12-01	315.59	0.38	0
1998-12-02	315.59	0.38	0
...	...	...	...
2022-11-23	337.97	1.08	24
2022-11-24	337.61	0.98	24
2022-11-25	337.51	1.02	23
2022-11-26	337.02	0.96	23
2022-11-27	337.51	0.94	24

8765 rows × 3 columns

5 Methane importation and analysis

Code

df_CH4 = pd.read_csv(r'mlo_CH4_Day.csv')

Code

new_column_names = [
    "site_code", "year", "month", "day", "hour", "minute", "second",
    "datetime", "time_decimal", "midpoint_time", "value_CH4", "value_std_dev_CH4",
    "nvalue_CH4", "latitude", "longitude", "altitude", "elevation", "intake_height", "qcflag"
]

df_CH4.columns = new_column_names

df_CH4.drop(
    columns=["site_code", "year", "month", "day", "hour", "minute", "second", "time_decimal",
             "midpoint_time", "latitude", "longitude", "altitude", "elevation", "intake_height", "qcflag"],
    inplace=True
)
df_CH4["datetime"] = pd.to_datetime(df_CH4["datetime"]).dt.date
df_CH4.set_index("datetime", inplace=True)

df_CH4

	value_CH4	value_std_dev_CH4	nvalue_CH4
datetime
1987-01-02	-999.99	-99.99	0
1987-01-03	-999.99	-99.99	0
1987-01-04	-999.99	-99.99	0
1987-01-05	-999.99	-99.99	0
1987-01-06	-999.99	-99.99	0
...	...	...	...
2024-04-26	1977.78	0.85	7
2024-04-27	1969.78	2.95	7
2024-04-28	1955.87	7.38	7
2024-04-29	1924.00	0.84	7
2024-04-30	1924.86	0.50	7

13634 rows × 3 columns

Code

invalid_CH4 = (df_CH4['value_CH4'] == -999.99).sum()
invalid_CH4

np.int64(906)

Since the invalid rows in this dataset are of a miniscule amount in comparison to the size of the whole data, completely removing them does no harm.

Code

df_CH4 = df_CH4.loc[df_CH4["value_CH4"] != -999.99]

6 SF6 importation and checks

Code

df_SF6 = pd.read_csv(r'mlo_SF6_Day.csv', skiprows = 1)

Code

new_column_names = ['year', 'month', 'day', 'median_SF6', 'std.dev_SF6', 'samples']
df_SF6.columns = new_column_names

Code

df_SF6['year'] = df_SF6['year'].astype(str).str.strip()
mask = df_SF6['year'] == 'year'

df_SF6 = coerce_into_full_datetime(df_SF6) # using the `year`, `month` and `day` columns
df_SF6

	median_SF6	std.dev_SF6	samples
datetime
1998-11-29	NaN	NaN	0
1998-11-30	NaN	NaN	0
1998-12-01	NaN	NaN	0
1998-12-02	NaN	NaN	0
1998-12-03	NaN	NaN	0
...	...	...	...
2022-11-23	11.594	0.052	24
2022-11-24	11.518	0.049	24
2022-11-25	11.455	0.047	24
2022-11-26	11.394	0.045	24
2022-11-27	11.405	0.046	24

8764 rows × 3 columns

Code

df_SF6.interpolate(method='time', inplace=True)
df_SF6.fillna(method='ffill', inplace=True)
df_SF6.fillna(method='bfill', inplace=True)

7 Merging and visualising all data

Code

df_CO2_meteo.rename_axis('datetime', inplace=True)

Code

plot_column(df_CO2_meteo, 'average_CO2', 'red')
plot_column(df_N2O, 'median_N2O', 'blue')
plot_column(df_CH4, 'value_CH4', 'green')
plot_column(df_SF6, 'median_SF6', 'magenta')
plt.tight_layout

Code

dfs = [df.copy() for df in [df_CO2_meteo, df_N2O, df_CH4, df_SF6]]
for i in range(len(dfs)):
    dfs[i].index = pd.to_datetime(dfs[i].index)  # Converting index to proper datetime64[ns]

start_date = df_CO2_meteo.index.min()  # Get the earliest date from df_CO2_meteo
dfs = [df[df.index >= start_date] for df in dfs]

df_combined_outer = pd.concat(
    [dfs[0][['temperature', 'humidity', 'dew_point', 'precipitation', 'pressure',
             'evapotranspiration', 'wind_speed', 'soil_temperature', 'average_CO2']],
     dfs[1][["median_N2O"]],
     dfs[2][["value_CH4"]],
     dfs[3][["median_SF6"]]],
    axis=1, join="outer")

Code

dfs = [df.copy() for df in [df_CO2_meteo, df_N2O, df_CH4, df_SF6]]
for i in range(len(dfs)):
    dfs[i].index = pd.to_datetime(dfs[i].index)  # Converting index to proper datetime64[ns]

start_date = df_CO2_meteo.index.min()  # Get the earliest date from df_CO2_meteo
dfs = [df[df.index >= start_date] for df in dfs]

df_combined_inner = pd.concat(
    [dfs[0][['temperature', 'humidity', 'dew_point', 'precipitation', 'pressure',
             'evapotranspiration', 'wind_speed', 'soil_temperature', 'average_CO2']],
     dfs[1][["median_N2O"]],
     dfs[2][["value_CH4"]],
     dfs[3][["median_SF6"]]],
    axis=1, join="inner")

Code

columns_to_fill = ['temperature', 'humidity', 'dew_point', 'precipitation', 'pressure',
                   'evapotranspiration', 'wind_speed', 'soil_temperature', 'average_CO2']

df_combined_outer[columns_to_fill] = df_combined_outer[columns_to_fill].interpolate(method='time')
df_combined_outer[columns_to_fill] = df_combined_outer[columns_to_fill].fillna(method='ffill') # Forward fill remaining missing values
df_combined_outer[columns_to_fill] = df_combined_outer[columns_to_fill].fillna(method='bfill') # Backward fill remaining missing values

Code

plot_entire_df(df_combined_outer)

8 Visualising Correlations

Code

# Computing Kendall's Tau correlation matrix

kendall_corr = df_combined_outer.corr(method='kendall')

plt.figure(figsize=(14, 7))
sns.heatmap(kendall_corr, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Kendall Correlation Heatmap")
plt.show()

Code

# Compute Pearson's Tau correlation matrix
pearson_corr = df_combined_outer.corr(method='pearson')

plt.figure(figsize=(14, 7))
sns.heatmap(pearson_corr, annot=True, cmap='coolwarm', fmt=".2f")
plt.title("Pearson Correlation Heatmap")
plt.show()

Pearson’s correlation measures the linear relationship between variables.

The correlation coefficient ranges from -1 to +1, where -1 indicates a perfect negative linear relationship, +1 indicates a perfect positive linear relationship, and 0 indicates no linear relationship.

However, it has a limitation: it only captures linear relationships and can miss other types of relationships between variables.

Kendall’s Tau takes a different approach. Instead of measuring linear relationships, it looks at the concordance between variables - essentially, whether they tend to move in the same direction. It’s measuring the tendency of the variables to increase or decrease together, without assuming anything about the shape of that relationship.

In the context of environmental data like CO2 levels, temperature, and other climate variables, Kendall’s Tau might be particularly useful because environmental relationships aren’t always linear, and the data often contains outliers or follows non-normal distributions.

Code

plot_rolling_correlations(df_combined_inner)

Rolling correlations (also called moving correlations) are a dynamic way to measure how the relationship between two variables changes over time. Unlike a single correlation coefficient that shows one number for an entire dataset, rolling correlations show how the correlation evolves throughout a time series.

Rolling correlations can be used in environmental studies to understand how relationships between variables shift with seasonal or long-term changes.

Code

# Checking for stationarity (Augmented Dickey-Fuller Test)
def check_stationarity(series, variable_name):
    result = adfuller(series.dropna())
    p_value = result[1]
    return p_value

stationarity_results = {col: check_stationarity(df_combined_inner[col], col) for col in df_combined_inner.columns}

max_lag = 12
granger_results = {}

# Performing Granger Causality tests
for col in df_combined_inner.columns:
    if col != "average_CO2":
        test_result = grangercausalitytests(df_combined_inner[['average_CO2', col]].dropna(), max_lag, verbose=False)
        granger_results[col] = {lag: test_result[lag][0]['ssr_ftest'][1] for lag in range(1, max_lag + 1)}

stationarity_results

{'temperature': np.float64(9.451145188179234e-22),
 'humidity': np.float64(1.2067943715819874e-12),
 'dew_point': np.float64(1.2150381839441513e-19),
 'precipitation': np.float64(7.824982759354124e-20),
 'pressure': np.float64(6.824231580012553e-12),
 'evapotranspiration': np.float64(5.673478319719433e-14),
 'wind_speed': np.float64(8.298941795111292e-30),
 'soil_temperature': np.float64(1.6253595748496084e-23),
 'average_CO2': np.float64(0.8630583937959493),
 'median_N2O': np.float64(0.999004879583799),
 'value_CH4': np.float64(0.6729054440779778),
 'median_SF6': 1.0}

The ADF test checks whether a time series data is stationary, which is crucial for many statistical analyses. A stationary time series has consistent statistical properties over time - its mean and variance don’t change.

Granger Causality tests explore whether past values of one variable help predict future values of another. It’s testing the “statistical causality” - though it’s important to note that Granger causality doesn’t necessarily mean actual causation.

Code

# Converting Granger causality test results into a readable dataFrame

granger_df = pd.DataFrame.from_dict(
    {var: [granger_results[var][lag] for lag in range(1, max_lag + 1)] for var in granger_results.keys()},
    orient='index',
    columns=[f'Lag {i}' for i in range(1, max_lag + 1)]
)

granger_df["Interpretation"] = granger_df.apply(lambda row: interpret_p_value(row.values), axis=1)
granger_df

	Lag 1	Lag 2	Lag 3	Lag 4	Lag 5	Lag 6	Lag 7	Lag 8	Lag 9	Lag 10	Lag 11	Lag 12	Interpretation
temperature	7.720947e-25	5.497089e-21	7.705556e-26	9.829507e-24	7.675065e-19	3.120083e-17	8.219184e-13	2.203910e-12	5.638884e-10	4.525774e-09	1.018310e-07	2.367689e-07	Strong causality (p < 0.01)
humidity	8.903531e-12	1.981891e-10	8.968556e-14	2.988373e-14	3.172161e-11	6.171779e-10	6.195866e-07	1.348078e-06	5.218207e-04	2.136951e-03	4.108305e-03	6.423288e-03	Strong causality (p < 0.01)
dew_point	8.067610e-25	4.458532e-21	8.264431e-26	3.584386e-24	1.393978e-19	4.411318e-18	1.616901e-13	7.524441e-13	4.799498e-09	3.827182e-08	1.586932e-07	3.114832e-07	Strong causality (p < 0.01)
precipitation	3.819164e-04	5.001872e-03	5.371408e-04	1.094697e-04	8.476688e-03	2.857322e-02	3.756680e-01	5.216954e-01	9.136631e-01	8.859372e-01	3.525676e-01	1.535244e-01	Moderate to low causality (p < 0.05 at some lags)
pressure	4.438619e-11	7.971067e-09	1.926409e-10	6.537783e-09	1.693316e-06	1.212028e-05	1.613346e-03	3.624291e-03	2.254627e-02	3.287454e-02	9.963901e-03	7.508465e-03	Moderate to low causality (p < 0.05 at some lags)
evapotranspiration	2.125455e-04	1.541215e-05	3.031203e-07	2.011486e-06	3.167283e-08	8.742475e-10	4.830566e-09	1.208496e-08	1.646039e-08	2.601857e-09	6.520648e-12	1.285255e-12	Strong causality (p < 0.01)
wind_speed	1.440207e-12	4.899939e-12	5.808182e-14	3.671966e-15	1.207507e-13	1.313350e-12	3.845556e-10	5.768298e-10	1.274597e-08	2.401023e-08	1.356298e-07	6.914299e-08	Strong causality (p < 0.01)
soil_temperature	4.069728e-32	1.903884e-26	1.176250e-39	1.926643e-37	1.831168e-32	1.178714e-30	7.919387e-24	3.344132e-23	6.174720e-22	7.789451e-21	5.191957e-22	2.197068e-21	Strong causality (p < 0.01)
median_N2O	8.008445e-05	4.118513e-05	8.078303e-04	6.303455e-04	1.439309e-04	3.025065e-06	1.612906e-08	1.365409e-08	6.937741e-13	4.135353e-14	9.966893e-18	1.458885e-18	Strong causality (p < 0.01)
value_CH4	6.094348e-06	1.821082e-13	1.783052e-19	2.951395e-22	2.037874e-23	3.249063e-25	1.223335e-22	4.499614e-22	6.773017e-15	8.673936e-14	5.394487e-13	1.508888e-12	Strong causality (p < 0.01)
median_SF6	8.433207e-04	2.402066e-07	7.523613e-06	7.130874e-08	2.532144e-11	1.648528e-12	6.471290e-15	7.232538e-15	5.658483e-17	3.503844e-18	1.710563e-20	1.563221e-22	Strong causality (p < 0.01)

Results 1. Stationarity Check (Augmented Dickey-Fuller Test) CO₂ (average_C2O) is non-stationary → This means CO₂ has a trend and may need differencing to make it stationary before modeling. Although the meteorological data is mostly stationary, the gas related datasets (N₂O, CH₄, and SF₆) are non-stationary.

Granger Causality (p-values across 1-12 lags) Lower p-values (< 0.05) indicate strong causality. The smaller the p-value, the more significant the predictive relationship.

Temperature (temperature_2m (°C)) ✅ Strong causality across all lags - Predicts future CO₂ trends
Relative Humidity (relative_humidity_2m (%)) ✅ Significant up to lag 12 - Influences CO₂ levels, but weaker than temperature
N₂O (median_N2O) ✅ Significant causality at longer lags - Has a delayed effect on CO₂
CH₄ (value_CH4) ✅ Very strong causality - CH₄ changes predict CO₂ variations
SF₆ (median_SF6) ✅ Moderate causality at higher lags - SF₆ shows long-term predictive power

Temperature and CH₄ have a very strong causal effect and are the strongest predictor of CO₂

The p-values are extremely low, suggesting that past temperature values contain significant information about future CO₂ levels.

This makes sense since methane (CH₄) and CO₂ are both greenhouse gases affected by similar processes.

Humidity, N₂O, and SF₆ also predict CO₂, but to a lesser extent.

Why Did Pearson & Kendall Show Weak Correlation, But Granger Shows Strong Causality? The difference comes from how these methods analyze relationships.

Pearson/Kendall Correlation (Static, Instantaneous Relationship) These methods only measure direct relationships between variables at the same time. Pearson Correlation checks linear relationships at one point in time. Kendall Correlation looks at rank-based (monotonic) relationships but still without considering time delays. Since CO₂, temperature, and humidity might have delayed effects on each other, Pearson/Kendall may fail to detect a strong relationship.
Granger Causality (Temporal Dependency) This test considers past values of temperature, humidity, and other variables to see if they help predict future CO₂. Many climate and atmospheric processes do not act immediately—they take weeks or months to show an impact. Example: Higher temperatures today might increase plant respiration or ocean CO₂ release over the next few weeks or months. That’s why even if Pearson/Kendall showed weak relationships, Granger Causality detects delayed effects that standard correlation ignores.

Real-World Example of Delayed Causality in Climate Data

Temperature & CO₂: When temperatures rise, it may take weeks to months before we see a significant change in CO₂ levels due to ocean-atmosphere exchange. Humidity & CO₂: Humidity affects cloud cover, precipitation, and soil moisture, which influence carbon absorption and release but not immediately. Methane (CH₄) & CO₂: CH₄ breaks down into CO₂ over time, meaning its effects on CO₂ might appear after several months.

8.0.1 1 year in the past - analysis

Code

plot_lagged_correlations(df_combined_inner, 'average_CO2', 365)

Key Observations: 🌡 Temperature (temperature_2m (°C)) shows a periodic pattern - The correlation peaks at around 90, 180, and 360 days. - This suggests a seasonal effect, where temperature changes predict CO₂ levels months later. - Possible explanation: Seasonal cycles of vegetation, ocean uptake, or industrial activity.

💧 Humidity (relative_humidity_2m (%)) has a weak but noticeable lag effect - Correlation is slightly negative, meaning higher humidity may be linked to lower CO₂ later. - This could be due to increased plant growth (photosynthesis) reducing CO₂.

🛑 N₂O (median_N2O) shows some delayed correlation

N₂O is related to industrial activity and fossil fuel combustion.
If N₂O increases, CO₂ might follow due to shared emission sources.

*🔥 Methane (value_CH4) has a strong positive correlation with CO₂ over time

CH₄ and CO₂ both contribute to greenhouse effects.
CH₄ breaks down into CO₂ over time, explaining why higher CH₄ leads to increased CO₂ later.

🌎 SF₆ (median_SF6) shows the strongest overall correlation

SF₆ is a long-lived greenhouse gas, and its correlation with CO₂ is almost constant. This suggests shared sources or long-term emission trends.

🔬 Interpretation - Temperature and CH₄ are the strongest predictors of CO₂ over time. - The delayed effects (~90 to 360 days) explain why Pearson/Kendall correlation missed these relationships. - There is a clear seasonal component, especially for temperature and humidity. - Industrial gases (N₂O, SF₆) show long-term trends in relation to CO₂.

8.0.2 5 year in the past - analysis

Code

plot_lagged_correlations(df_combined_inner, 'average_CO2', 1825)

📉 Temperature (temperature_2m (°C)) shows Strong Multi-Year Cycles - Clear periodic pattern every ~365 days → Suggests annual climate cycles affecting CO₂. - Peaks every ~1 year, aligning with seasonal and yearly CO₂ fluctuations.

💧 Humidity (relative_humidity_2m (%)) Shows Opposite Cycles - Correlation oscillates inversely to temperature. - Suggests that higher humidity is associated with lower future CO₂ levels, likely linked to vegetation absorption and precipitation cycles.

🔥 CH₄ (value_CH4) and SF₆ (median_SF6) Show Strong Long-Term Correlations - they maintain high correlation for multiple years.

📈 N₂O (median_N2O) Shows Long-Term Influence - it maintains a high correlation for several years.

The yearly cycles suggest that seasonality must be considered in CO₂ forecasting models.

8.0.3 10 year back analysis

Code

plot_lagged_correlations(df_combined_inner, 'average_CO2', 3650)

🔍 Very Long-Term Lagged Correlation Analysis (10 Years)

This plot extends the lagged correlation window to 10 years (3650 days) to examine even longer-term dependencies between CO₂ and its predictors.

🌡 Temperature (temperature_2m (°C)) Shows a Strong Multi-Year Cycle

Clear oscillations approximately every year (365 days) - suggests strong seasonal and multi-year trends in CO₂.

💧 Humidity (relative_humidity_2m (%)) - shows an inverse correlation to temperature. - Indicates that higher humidity leads to lower CO₂ after several months/years (likely due to enhanced vegetation growth, precipitation, and CO₂ absorption).

🔥 N₂O (median_N2O) and SF₆ (median_SF6) Maintain Strong Long-Term Correlations

They consistently correlate with CO₂ across multiple years - suggests shared emission sources or slow accumulation effects over time.

📈 CH₄ (value_CH4) Shows a Gradual Decrease in Correlation Over Time

The influence on CO₂ appears strongest in the first few years but then declines.

📉 After ~8-10 Years, Correlations Become More Unstable

Around 2500-3500 days (7-10 years), the correlations become noisier. This could be due to external climate variability, policy changes, or model limitations in capturing such long-term effects.

🌍 Implications for Long-Term CO₂ Forecasting - Annual cycles are clearly visible, meaning that any CO₂ forecasting model should incorporate seasonality. - Humidity has a delayed inverse effect, possibly due to its influence on carbon sinks. - Industrial pollutants show shorter-term influence, making them more useful for mid-term (1-5 year) forecasting.

Code

df_combined_inner.drop(columns = ['precipitation', 'wind_speed', 'dew_point', 'pressure'])
df_combined_outer.drop(columns = ['precipitation', 'wind_speed', 'dew_point', 'pressure'])
df_full_CO2 = df_combined_outer[['average_CO2']]

Code

df_combined_inner.to_csv('df_combined_inner.csv', index=True)
df_combined_outer.to_csv('df_combined_outer.csv', index=True)
df_full_CO2.to_csv('df_full_CO2.csv', index=True)