Fitting linear regression and computing metrics in python

Question

I have two data series of model prediction and observations. I am able to make line plots of these series. I would like to add a linear regression fit of the two data series. i would also like to add metrics as annotations to the plot. i need assistance how to configure the fit formula, currently it is not being accepted (error) and the MAE (-18.1903) and r squared (-6.4282) are incorrect.

I have not been able to solve my problem from similar posts, assistance please.

Date,HumFc,HumOb
20260201, 74.5, 78.2
20260201, 74.5, 78.2
20260202, 71.4, 93.9
20260203, 60.1, 80.2
20260204, 67.9, 91.4
20260205, 71.4, 89.5
20260206, 62.9, 97.7
20260207, 64.5, 89.7
20260208, 76.1, 88.2
20260209, 75.8, 83.7
20260210, 73.8, 90.2
20260211, 65.4, 89.9
20260212, 50.4, 80.7
20260213, 60.8, 75.6
20260214, 65.0, 93.9
20260215, 64.3, 85.3
20260216, 69.1, 86.2
20260217, 74.0, 95.0
20260218, 81.8, 87.7
20260219, 71.6, 89.9
20260220, 65.9, 86.5
20260221, 52.9, 90.9
20260222, 86.2, 87.8
20260223, 75.4, 68.9
20260224, 80.2, 87.6
20260225, 70.4, 90.6
20260226, 70.4, 87.2
20260227, 74.8, 86.1
20260228, 65.2, 84.9
20260301, 65.6, 94.5
20260302, 71.9, 88.1
20260303, 68.4, 92.0

import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from datetime import datetime

df = pd.read_csv('/Hum_h5d_06.csv', skipinitialspace=True)
print(df.columns)
df['Date'] = pd.to_datetime(df['Date'].astype(str), format='%Y%m%d')
df['Date'] = pd.to_datetime(df['Date'], format="%b %d")
#quit()

#1. Create and fit the regression model:
model = LinearRegression()

#2. fit regression model
X = np.array(df['HumOb']).reshape((-1, 1))
Y = df['HumFc']
model.fit(X, Y)

#3. calculate metrics
RMSE = np.sqrt(((df['HumFc'] - df['HumOb'])**2).mean())
print(f"RMSE: {RMSE:.4f}")
MAE = mean_absolute_error(df['HumFc'], df['HumOb'])
print(f"MAE: {MAE:.4f}")
RSQ = r2_score(df['HumFc'], df['HumOb'])
print(f"RSQ: {RSQ:.4f}")
quit()

#4. Create plot and get the axes object
fig, ax = plt.subplots()
ax.plot(df['Date'], df['HumOb'])

date_format = mdates.DateFormatter('%b %d')

#5. Apply the formatter to the x-axis
ax.xaxis.set_major_formatter(date_format)

# Optional: automatically format and rotate the date labels for better visibility
fig.autofmt_xdate()

#6. make the plot
plt.plot(df['Date'], df['HumOb'], color='b', label='HumOb')
plt.plot(df['Date'], df['HumFc'], color='r', label='HumFc')
plt.title('5-Day Humidity Forecast and AWS Recorded Humidity')
plt.xlabel('Date_of_Forecast')
plt.ylabel('Humidity')
plt.legend(loc = "lower right", bbox_to_anchor=(1.01, 0.04), fontsize=10)
plt.grid(True)

#7. Annotation text
metrics_text = (
    f"RMSE: {RMSE:.3f}\n"
    f"MAE: {MAE:.3f}\n"
    f"RSQ: {RSQ:.3f}"
)

#8. Add the metrics as an annotation on the plot
plt.annotate(
    metrics_text,
    xy=(0.025, 0.065),  # Position (x, y) relative to plot axes (0,0 is bottom left, 1,1 is top right)
    xycoords='axes fraction',
    fontsize=8,
    bbox=dict(boxstyle="round,pad=0.5", fc="white", alpha=0.5)
)
plt.show()

always put full error message (traceback) because there are other useful information. — furas
– furas, Commented Mar 14 at 9:03
you could put example data in ``` ``` to make it more readable. Of course remove empty lines in data. — furas
– furas, Commented Mar 14 at 9:03
if you get error then show it in question, not in comments - more people may see it and more people may help. — furas
– furas, Commented Mar 15 at 20:06

JohnM · Accepted Answer · 2026-03-30 00:21:57Z

You can create a placeholder array with the independent variable, HumFc. Use np.linalg.lstsq() to calculate the slope and intercept for each column, followed by matrix multiplication to populate the placeholder array with the fitted values.

I prefer numpy for numeric operations over scikit-learn because it has less overhead and is more transparent as to what operations are being performed. Scikit-learn's functions are simpler though. You may want to use the regression metrics functions in sklearn.metrics if Numpy is giving you trouble.

from io import StringIO
import hvplot.pandas  # noqa: F401
import numpy as np
import pandas as pd

# Data for the original dataframe
raw_data = """
    Date,HumFc,HumOb
    20260201, 74.5, 78.2
    20260201, 74.5, 78.2
    20260202, 71.4, 93.9
    20260203, 60.1, 80.2
    20260204, 67.9, 91.4
    20260205, 71.4, 89.5
    20260206, 62.9, 97.7
    20260207, 64.5, 89.7
    20260208, 76.1, 88.2
    20260209, 75.8, 83.7
    20260210, 73.8, 90.2
    20260211, 65.4, 89.9
    20260212, 50.4, 80.7
    20260213, 60.8, 75.6
    20260214, 65.0, 93.9
    20260215, 64.3, 85.3
    20260216, 69.1, 86.2
    20260217, 74.0, 95.0
    20260218, 81.8, 87.7
    20260219, 71.6, 89.9
    20260220, 65.9, 86.5
    20260221, 52.9, 90.9
    20260222, 86.2, 87.8
    20260223, 75.4, 68.9
    20260224, 80.2, 87.6
    20260225, 70.4, 90.6
    20260226, 70.4, 87.2
    20260227, 74.8, 86.1
    20260228, 65.2, 84.9
    20260301, 65.6, 94.5
    20260302, 71.9, 88.1
    20260303, 68.4, 92.0
"""


def reproduce_dataframe(raw_data: str) -> pd.DataFrame:
    df = pd.read_csv(
        StringIO(raw_data.strip()),
        skipinitialspace=True,
        parse_dates=["Date"],
    )
    df[["HumFc", "HumOb"]] = df[["HumFc", "HumOb"]].astype(float)
    return df


def add_fitted_values(df: pd.DataFrame) -> pd.DataFrame:
    result = df.copy()
    x = result["HumFc"].to_numpy()
    y = result["HumOb"].to_numpy()

    # Array of shape [length(df), 2] to accept fitted values for each x
    scaffold = np.column_stack([x, np.ones(len(x))])

    # Best-fit slope and intercept, found by least squares
    solution = np.linalg.lstsq(scaffold, y, rcond=None)[0]

    result["HumOb_Fitted"] = np.matmul(scaffold, solution)
    return result


# Plot
def plot_humidity(df_with_fit: pd.DataFrame):
    points_ = df_with_fit.hvplot.scatter(
        x="HumFc", y="HumOb", ylim=(0, 110), xlabel="HumFc", ylabel="HumOb"
    )
    lines_ = df_with_fit.sort_values("HumFc").hvplot.line(
        x="HumFc", y="HumOb_Fitted", ylim=(0, 110), line_dash="dashed"
    )
    return points_, lines_


df = reproduce_dataframe(raw_data)
df_with_fit = add_fitted_values(df).set_index("Date").sort_values("HumFc")

points_, lines_ = plot_humidity(df_with_fit)
points_ * lines_

Rabia Naz · Accepted Answer · 2026-03-31 15:12:57Z

This code has three bugs that explain both the wrong metrics and the missing regression line.

r2_score and mean_absolute_error in scikit-learn expect (y_true, y_pred) as the argument order. Your code passes forecast first and observations second so the calculation runs backwards. An R² that far into the negatives almost always means the arguments are swapped. Change these to r2_score(df['HumOb'], df['HumFc']) and the same for MAE and RMSE.

The date column gets parsed twice. The first pd.to_datetime call with format='%Y%m%d' already converts everything correctly. The second call tries to reparse that datetime column using format="%b %d" which raises an error. Remove that second line. quit() stops the script before any plotting code runs.

The regression model also gets fitted but model.predict(X) is never called so there are no fitted values to plot. Remove quit() and add the predict call. Your first two data rows are identical (20260201, 74.5, 78.2 twice). Might be intentional but duplicates will pull the regression so worth double checking.

import pandas as pd
import numpy as np 
from sklearn.linear_model import LinearRegression 
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score 
import matplotlib.pyplot as plt 
import matplotlib.dates as mdates 

df = pd.read_csv('/Hum_h5d_06.csv', skipinitialspace=True) 
df['Date'] = pd.to_datetime(df['Date'].astype(str), format='%Y%m%d') 

model = LinearRegression() 

X = np.array(df['HumOb']).reshape((-1, 1)) 
Y = df['HumFc'] 

model.fit(X, Y) 
y_reg = model.predict(X) 

# observations first, forecast second 
RMSE = np.sqrt(mean_squared_error(df['HumOb'], df['HumFc'])) 
MAE = mean_absolute_error(df['HumOb'], df['HumFc']) 
RSQ = r2_score(df['HumOb'], df['HumFc']) 

fig, ax = plt.subplots() 

ax.plot(df['Date'], df['HumOb'], color='b', label='HumOb') 
ax.plot(df['Date'], df['HumFc'], color='r', label='HumFc') 
ax.plot(df['Date'], y_reg, color='g', linestyle='--', label='Regression Fit') 

ax.xaxis.set_major_formatter(mdates.DateFormatter('%b %d')) 
fig.autofmt_xdate() 

plt.title('5-Day Humidity Forecast and AWS Recorded Humidity') 
plt.xlabel('Date_of_Forecast') 
plt.ylabel('Humidity') 

plt.legend(loc="lower right", bbox_to_anchor=(1.01, 0.04), fontsize=10) 
plt.grid(True) 

metrics_text = f"RMSE: {RMSE:.3f}\nMAE: {MAE:.3f}\nR²: {RSQ:.3f}" 

plt.annotate(metrics_text, xy=(0.025, 0.065), xycoords='axes fraction', fontsize=8, bbox=dict(boxstyle="round,pad=0.5", fc="white", alpha=0.5)) 

plt.show()

See also r2_score, mean_absolute_error, LinearRegression, pd.to_datetime

Collectives™ on Stack Overflow

Fitting linear regression and computing metrics in python

2 Answers 2

Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Related