Linear Regression Analysis in Geoscience Studies with Python

Leonides Netto
Aug 12
4 min read

Understanding relationships between variables is a fundamental part of geoscientific research. For example, in soil studies, we often want to know whether there is a measurable relationship between the carbon content of a sample and its organic matter content. Linear regression is a statistical method that helps us quantify this relationship, allowing us to:

Identify correlations between measured parameters;
Predict values of one variable based on another;
Support interpretations about environmental processes;
Provide numerical evidence for trends observed in the field.

In geoscience, such analyses can be important for:

Relating geochemical or geophysical parameters;
Assessing environmental changes over time;
Supporting modeling of natural processes.

The following Python code uses pandas, matplotlib, seaborn, and scikit-learn to:

Merge two datasets by Sample and Days;
Plot the measured data points;
Perform a linear regression between carbon content (%) and organic matter content (%);
Plot the regression line;
Display the regression equation and the coefficient of determination (R²).

import pandas as pd

import matplotlib.pyplot as plt

import seaborn as sns

from sklearn.linear_model import LinearRegression

from sklearn.metrics import r2_score

What it does: imports the libraries used.

pandas for table manipulation (DataFrame).

matplotlib.pyplot and seaborn for plotting (seaborn facilitates aesthetics and the use of hue).

LinearRegression (scikit-learn) to adjust the linear model.

r2_score to calculate R² (coefficient of determination).

Note: ensure that the versions of seaborn/sklearn installed are compatible with your code.

combined_data = pd.merge(dmo, dca, on=['Sample', 'Days'])

What it does: merges two DataFrames (dmo and dca) using the Sample and Days columns as keys. The result (combined_data) will have rows where the keys exist in both — it is an inner join by default.

Important points:

The Sample and Days columns must exist in both DataFrames and be in the same format (same type, no extraneous spaces).

If there are columns with the same name in both DataFrames in addition to the keys, the merge will automatically add the suffixes _x and _y.

Check for duplicates: if there are multiple rows with the same key in one of the DataFrames, the merge may generate a Cartesian product (more rows than expected).

plt.figure(figsize=(10, 6))

What it does: creates the matplotlib figure with size in inches (width 10 × height 6). Affects the space available for the graph.

Note: if you are going to use seaborn with subplots or multiple graphs, it is sometimes better to create fig, ax = plt.subplots(...).

sns.lineplot(data=combined_data, x="carbono", y="materia", hue="Sample", linewidth=0.5, marker='o')

What it does: plots the points and connects them with lines, grouping them by Sample (each Sample becomes a colored series thanks to hue). marker=‘o’ draws points.

Note: lineplot connects points in the order they appear in the data. If the carbon values are not sorted for each Sample, the lines may “zigzag.”

To simply visualize the points (recommended when there are few values per sample), use sns.scatterplot(...) or sns.lineplot(..., sort=True) (or sort the data first).

If the main goal is to show the scatter plot for a global regression, scatterplot is usually the clearest choice.

Example of an alternative:

sns.scatterplot(data=combined_data, x="carbono", y="materia", hue="Sample")

plt.xlabel('Carbon content (%)')

plt.ylabel('Organic matter content (%)')

plt.title('Carbon vs Organic matter')

What it does: labels the axes and adds a title. Simple and essential for reading.

Suggestion: add units (as you did) and, if necessary, adjust the font/size for presentations.

regressor = LinearRegression()

X = combined_data[['carbono']]

y = combined_data['materia']

regressor.fit(X, y)

What it does: creates the linear model and fits it to the data (X is the independent variable, y is the dependent variable).

Technical details

X = combined_data[[‘carbon’]] creates a 2D DataFrame (shape (n_samples, 1)), which is the format expected by scikit-learn. If you use X = combined_data[‘carbon’] (1D series), reshaping will be necessary. Its shape is correct.

Check for NaNs: if there are missing values in carbon or matter, fit() will give an error. Do combined_data.dropna(subset=[‘carbon’,'matter']) or handle the NaNs first.

By default, LinearRegression() adjusts the intercept (fit_intercept=True). If you want to force a line passing through the origin, use fit_intercept=False.

y_pred = regressor.predict(X)

What it does: calculates the values predicted by the model (the fitted line) for each value of X in the data set.

plt.plot(X, y_pred, color='black', linestyle='--', linewidth=1.5, label='Linear regression')

What it does: plots the regression line on the graph.

r_squared = r2_score(y, y_pred)

What it does: calculates the coefficient of determination R², which indicates the fraction of variance in y explained by the linear model.

Quick interpretation

R² close to 1 → good fit (much of the variability explained).

R² close to 0 → the linear model does not explain the variability.

Caution: R² does not prove causality and can be misleading if there are outliers; always check the scatter plot and residuals.

formula = f'Linear regression y = {regressor.coef_[0]:.2f} * x {regressor.intercept_:.2f}\nR² = {r_squared:.2f}'

plt.annotate(formula, xy=(0.6, 0.05), xycoords='axes fraction', fontsize=12)

What it does: assembles a string with the coefficients and R² and places it on the graph.

plt.legend(title="Sample")

plt.rc('font', size=12)

plt.tick_params(axis='y', rotation=0)

plt.grid(True)

What it does:

plt.legend(title=“Sample”): displays legend (items come from seaborn hue and regression line label).

plt.rc(‘font’, size=12): changes global font settings (can be set at the beginning of the script to affect everything).

plt.tick_params(...): controls the appearance of ticks (here without vertical rotation).

plt.grid(True): shows background grid.

plt.show()

What it does: renders and displays the graph. In Jupyter notebooks, this usually appears automatically, but in standalone scripts, plt.show() is necessary.

Linear regression graph for two parameters obtained with the above code. Source: Guimarães, C.C., Guireli Netto, L. & Barbosa, A.M. Monitoring physicochemical parameters in the bioremediation of soil contaminated by petroleum hydrocarbons using poultry manure. Int. J. Environ. Sci. Technol. 21, 8113–8130 (2024). https://doi.org/10.1007/s13762-024-05567-y — Linear regression graph for two parameters obtained with the above code. Source: Guimarães, C.C., Guireli Netto, L. & Barbosa, A.M. Monitoring physicochemical parameters in the bioremediation of soil contaminated by petroleum hydrocarbons using poultry manure. *Int. J. Environ. Sci. Technol.* 21, 8113–8130 (2024). https://doi.org/10.1007/s13762-024-05567-y

Full code: