Jupyter Notebook - ML Project on Weather Prediction

Question

I am working on a personal Machine-Learning (ML) project to predict weather on Jupyter Notebook. Once I am done with it, I will start working on converting it into a Flask app.

I have completed my code on Jupyter Notebook. All the codes are working fine. However, I am not sure if everything is done correctly.

Would you please review my code on GitHub and inform me if I am doing it the right way?

https://github.com/SteveAustin583/weather-prediction-ml/blob/main/weather-prediction-ml-stackexchange-feedback-implemented-3.ipynb

Here is my code:

# %% [markdown]
# # Seattle Weather Category Prediction - Jupyter Notebook
#
# ## This notebook aims to predict the weather category (e.g., sun, rain, snow)
# # based on other meteorological features. It follows a structure similar to
# # the provided Kaggle example, using Gaussian Naive Bayes.
#
# ## Steps:
# (Original numbered steps removed as per feedback, structure implied by headers)
# - Load and Explore Data
# - Visualize Data
# - Preprocess Data (Label Encoding for target, add lagged features)
# - Train and Evaluate Models
# - Conduct Ablation Study
# - Save the Model and Label Encoder for Flask App

# %% [markdown]
# ## Setup and Load Data

# %%
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import io # To load CSV from string in some environments (not strictly used by pd.read_csv here)
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder
import joblib

# %%
# Display plots inline
%matplotlib inline
# Set some display options for Pandas
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

# %% [markdown]
# ### Helper Functions

# %%
def print_header(message):
    """Prints a formatted header message."""
    print(f"\n{message}:")

def load_and_preprocess_data(file_path):
    """Loads the dataset and performs initial date conversion."""
    df = pd.read_csv(file_path)
    print_header("Dataset loaded successfully")
    df['date'] = pd.to_datetime(df['date']) # Convert date column to datetime
    return df

def create_visualization_df(dataframe):
    """Creates a copy of the dataframe for visualization and extracts year/month."""
    df_vis = dataframe.copy()
    df_vis['year'] = df_vis['date'].dt.year
    df_vis['month'] = df_vis['date'].dt.month
    return df_vis

# --- Feedback Implementation: Helper Defaults and Splitting ---
_DEFAULT_FEATURES_TO_LAG = ("precipitation", "temp_max", "temp_min", "wind", "weather_encoded")
_DEFAULT_TEMP_FEATURES_FOR_DELTA = ("temp_max", "temp_min")

def _create_lag_features(df_input, features_to_lag=_DEFAULT_FEATURES_TO_LAG, lag_period=1):
    """
    Adds lagged versions of specified features.
    Sorts by 'date' column if present before lagging.
    """
    df_out = df_input.copy()

    if 'date' in df_out.columns:
        df_out = df_out.sort_values(by='date').reset_index(drop=True) # Sort and reset index
    else:
        print("Warning: 'date' column not present for sorting in _create_lag_features. Assuming pre-sorted data for lag features.")

    for feature in features_to_lag:
        if feature in df_out.columns:
            df_out[f'{feature}_lag{lag_period}'] = df_out[feature].shift(lag_period)
        else:
            print(f"Warning: Feature '{feature}' not found in DataFrame for lagging.")
    return df_out # Returns DataFrame with NaNs from shifting

def _create_delta_features(df_input, temp_features_for_delta=_DEFAULT_TEMP_FEATURES_FOR_DELTA, lag_period=1):
    """
    Adds delta (difference) features for specified temperature-like features.
    Assumes df_input already contains the necessary current and lagged features.
    """
    df_out = df_input.copy()

    for temp_feature in temp_features_for_delta:
        current_feature_name = temp_feature
        lagged_feature_name = f'{temp_feature}_lag{lag_period}'

        if current_feature_name in df_out.columns and lagged_feature_name in df_out.columns:
            df_out[f'delta_{temp_feature}'] = df_out[current_feature_name] - df_out[lagged_feature_name]
        else:
            if current_feature_name not in df_out.columns:
                 print(f"Warning: Current feature {current_feature_name} not found for delta calculation.")
            if lagged_feature_name not in df_out.columns:
                 print(f"Warning: Lagged feature {lagged_feature_name} not found for delta calculation of {current_feature_name}.")
    return df_out

def train_and_evaluate_model(model, X_train, y_train, X_test, y_test, target_names, model_name="Model"):
    """Trains a model and prints evaluation metrics."""
    print_header(f"--- {model_name} Training and Evaluation ---")
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)
    classification_rep = classification_report(y_test, y_pred, target_names=target_names, zero_division=0)

    print(f"Accuracy: {accuracy:.4f}")
    print_header("Confusion Matrix")
    plt.figure(figsize=(8, 6))
    sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=target_names, yticklabels=target_names)
    plt.xlabel('Predicted Label')
    plt.ylabel('True Label')
    plt.title(f'Confusion Matrix - {model_name}')
    plt.show()
    print_header("Classification Report")
    print(classification_rep)
    return model, accuracy

# %%
# --- Load the dataset ---
file_path = 'seattle-weather.csv'
df = load_and_preprocess_data(file_path)
df.head()

# %% [markdown]
# ## Initial Data Exploration & Visualization

# %%
print_header("Dataset Info")
df.info()

# %%
print_header("Statistical Summary")
print(df.describe())

# %%
print_header("Missing Values Check")
print(df.isnull().sum())
print(f"Any NA values present: {df.isna().sum().any()}")

# %%
print_header("Duplicate Rows Check")
print(f"Number of duplicated rows: {df.duplicated().sum()}")

# %%
print_header("Day with Minimum temp_min")
print(df[df['temp_min']==min(df.temp_min)])

# %%
print_header("Day with Maximum temp_max")
print(df[df['temp_max']==max(df.temp_max)])

# %%
t_min_overall = pd.concat([df['temp_min'], df['temp_max']]).min()
t_max_overall = pd.concat([df['temp_min'], df['temp_max']]).max()
bins = np.arange(np.floor(t_min_overall), np.ceil(t_max_overall) + 1, 1)

plt.figure(figsize=(12,6))
sns.histplot(data=df, x='temp_max', bins=bins, kde=True, palette="viridis") # Using palette instead of color for histplot
plt.title('Distribution of Maximum Temperature')
plt.xlabel('Max Temperature (°C)')
plt.ylabel('Frequency')
plt.xlim(bins.min(), bins.max())
plt.xticks(bins[::2])
plt.show()

# %%
plt.figure(figsize=(12,6))
sns.histplot(data=df, x='temp_min', bins=bins, kde=True, palette="viridis") # Using palette instead of color for histplot
plt.title('Distribution of Minimum Temperature')
plt.xlabel('Min Temperature (°C)')
plt.ylabel('Frequency')
plt.xlim(bins.min(), bins.max())
plt.xticks(bins[::2])
plt.show()

# %% [markdown]
# ### FacetGrid Visualizations (Month vs. Weather Variables by Year)

# %%
df_vis = create_visualization_df(df)

# %%
g = sns.FacetGrid(df_vis, col='year', col_wrap=4, height=3.5, aspect=1.2)
g.map(sns.lineplot, 'month', 'temp_max', errorbar=None)
g.set_axis_labels('Month', 'Max Temperature (°C)')
g.set_titles(col_template="{col_name}")
g.fig.suptitle('Max Temperature by Month for Each Year', y=1.03)
plt.tight_layout()
plt.show()

# %%
g = sns.FacetGrid(df_vis, col='year', col_wrap=4, height=3.5, aspect=1.2)
g.map(sns.lineplot, 'month', 'temp_min', errorbar=None)
g.set_axis_labels('Month', 'Min Temperature (°C)')
g.set_titles(col_template="{col_name}")
g.fig.suptitle('Min Temperature by Month for Each Year', y=1.03)
plt.tight_layout()
plt.show()

# %%
g = sns.FacetGrid(df_vis, col='year', col_wrap=4, height=3.5, aspect=1.2)
g.map(sns.lineplot, 'month', 'precipitation', errorbar=None)
g.set_axis_labels('Month', 'Precipitation (mm)')
g.set_titles(col_template="{col_name}")
g.fig.suptitle('Precipitation by Month for Each Year', y=1.03)
plt.tight_layout()
plt.show()

# %%
g = sns.FacetGrid(df_vis, col='year', col_wrap=4, height=3.5, aspect=1.2)
g.map(sns.lineplot, 'month', 'wind', errorbar=None)
g.set_axis_labels('Month', 'Wind Speed')
g.set_titles(col_template="{col_name}")
g.fig.suptitle('Wind Speed by Month for Each Year', y=1.03)
plt.tight_layout()
plt.show()

# %% [markdown]
# ### Weather Category Distribution

# %%
print_header("Weather Category Counts")
weather_counts = df['weather'].value_counts()
print(weather_counts)

# %%
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='weather', order=weather_counts.index, hue='weather', palette="viridis", legend=False)
plt.title('Distribution of Weather Types')
plt.xlabel('Weather Type')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.show()

# %%
plt.figure(figsize=(10, 8))
plt.pie(weather_counts, labels=weather_counts.index, autopct='%1.1f%%', startangle=140,
        colors=sns.color_palette("viridis", len(weather_counts)))
plt.title('Distribution of Weather Types (Pie Chart)')
plt.axis('equal')
plt.show()

# %% [markdown]
# ## Data Preprocessing for Classification

# %%
df_processed = df.copy()
if 'date' in df_processed.columns:
    df_processed = df_processed.drop('date', axis=1)

print_header("DataFrame columns before modeling")
print(df_processed.columns.tolist())
df_processed.head()

# %%
le = LabelEncoder()
df_processed['weather_encoded'] = le.fit_transform(df_processed['weather'])

print_header("Label Encoding Mapping for 'weather'")
for i, class_name in enumerate(le.classes_):
    print(f"{class_name} -> {i}")

# %%
joblib.dump(le, 'weather_label_encoder.joblib')
print_header("Saved weather_label_encoder.joblib")
df_processed.head()

# %% [markdown]
# ### Adding Lagged Time Series Features

# %%
df_for_lagged_processing = df.copy() # Start with the original df that includes 'date'

le_lagged = LabelEncoder() # Use a separate encoder if necessary, or ensure le can be reused
df_for_lagged_processing['weather_encoded'] = le_lagged.fit_transform(df_for_lagged_processing['weather'])

# --- Feedback Implementation: Using split helper functions ---
# Step 1: Create lag features.
# The _create_lag_features function will sort by 'date' internally.
# Default features_to_lag includes 'weather_encoded'.
df_temp_lags = _create_lag_features(
    df_input=df_for_lagged_processing,
    lag_period=1
)

# Step 2: Create delta features using the DataFrame that now includes lags.
# Default temp_features_for_delta are 'temp_max', 'temp_min'.
df_with_lags_and_deltas_raw = _create_delta_features(
    df_input=df_temp_lags,
    lag_period=1
)

# Step 3: Drop rows with NaN values introduced by shifting and reset index
df_with_lags = df_with_lags_and_deltas_raw.dropna().reset_index(drop=True)

print_header("DataFrame with Lagged and Delta Features")
print(df_with_lags.head())

df_processed_lagged = df_with_lags.drop(columns=['date', 'weather'], errors='ignore')
print_header("df_processed_lagged head")
print(df_processed_lagged.head())

# %% [markdown]
# ## Feature Selection and Train-Test Split

# %%
original_features = ['temp_min', 'temp_max', 'precipitation', 'wind']
X_original = df_processed[original_features] # df_processed does not have 'date'
y_original = df_processed['weather_encoded']

lagged_features_input_list = ['temp_min', 'temp_max', 'precipitation', 'wind',
                              'precipitation_lag1', 'temp_max_lag1', 'temp_min_lag1',
                              'wind_lag1', 'weather_encoded_lag1',
                              'delta_temp_max', 'delta_temp_min']

# Ensure all features in list are present in df_processed_lagged.columns
# Filter to only include columns that actually exist in df_processed_lagged
lagged_features_input_list = [col for col in lagged_features_input_list if col in df_processed_lagged.columns]

X_lagged = df_processed_lagged[lagged_features_input_list]
y_lagged = df_processed_lagged['weather_encoded']

feature_names_for_model = X_original.columns.tolist()
joblib.dump(feature_names_for_model, 'classifier_feature_names.joblib')
print(f"Saved classifier_feature_names.joblib with features: {feature_names_for_model}")

# --- Feedback Implementation: Train/Test Split ratio ---
# Changed test_size from 0.2 to 0.25
X_train_original, X_test_original, y_train_original, y_test_original = train_test_split(
    X_original, y_original, test_size=0.25, random_state=42, stratify=y_original
)
print_header("Original Data Split Shapes")
print(f"Original X_train shape: {X_train_original.shape}, y_train shape: {y_train_original.shape}")
print(f"Original X_test shape: {X_test_original.shape}, y_test shape: {y_test_original.shape}")

# Changed test_size from 0.2 to 0.25
X_train_lagged, X_test_lagged, y_train_lagged, y_test_lagged = train_test_split(
    X_lagged, y_lagged, test_size=0.25, random_state=42, stratify=y_lagged
)
print_header("Lagged Data Split Shapes")
print(f"Lagged X_train shape: {X_train_lagged.shape}, y_train shape: {y_train_lagged.shape}")
print(f"Lagged X_test shape: {X_test_lagged.shape}, y_test shape: {y_test_lagged.shape}")

# %% [markdown]
# ## Naïve Model (Climate Prediction)

# %%
df_naive = df.copy()
df_naive['month'] = df_naive['date'].dt.month
monthly_most_frequent_weather = df_naive.groupby('month')['weather'].agg(lambda x: x.mode()[0] if not x.mode().empty else "Unknown")
print_header("Most frequent weather type per month (Naïve Model)")
print(monthly_most_frequent_weather)

df_naive['predicted_weather_naive'] = df_naive['month'].map(monthly_most_frequent_weather)
naive_accuracy = accuracy_score(df_naive['weather'], df_naive['predicted_weather_naive'])
print(f"\nNaïve Model Accuracy (Predicting most frequent weather by month): {naive_accuracy:.4f}")
print("This simple model predicts the 'climate' for each month, rather than specific 'weather'.")

# %% [markdown]
# ## Model Training and Evaluation

# %%
nb_model_original, nb_accuracy_original = train_and_evaluate_model(
    GaussianNB(), X_train_original, y_train_original, X_test_original, y_test_original,
    le.classes_, "Gaussian Naive Bayes (Original Features)"
)

# Ensure le_lagged.classes_ is used for lagged model evaluation if its encoding differs.
# If df['weather'].unique() is consistent, le.classes_ and le_lagged.classes_ will be the same.
nb_model_lagged, nb_accuracy_lagged = train_and_evaluate_model(
    GaussianNB(), X_train_lagged, y_train_lagged, X_test_lagged, y_test_lagged,
    le_lagged.classes_, "Gaussian Naive Bayes (Lagged Features)"
)

lr_model_original, lr_accuracy_original = train_and_evaluate_model(
    LogisticRegression(max_iter=1000, random_state=42), X_train_original, y_train_original, X_test_original, y_test_original,
    le.classes_, "Logistic Regression (Original Features)"
)

svm_model_original, svm_accuracy_original = train_and_evaluate_model(
    SVC(random_state=42), X_train_original, y_train_original, X_test_original, y_test_original,
    le.classes_, "Support Vector Machine (Original Features)"
)


# %% [markdown]
# ## Ablation Study

# %%
features_to_ablate = [
    ['temp_min', 'temp_max', 'precipitation', 'wind'],
    ['temp_min', 'temp_max', 'precipitation'],
    ['temp_min', 'temp_max', 'wind'],
    ['precipitation', 'wind'],
    ['temp_max', 'precipitation', 'wind'],
    ['temp_min', 'precipitation', 'wind'],
    ['temp_max'],
    ['wind']
]

ablation_results = {}

print_header("--- Ablation Study (Gaussian Naive Bayes with Original Features) ---")
for i, current_features in enumerate(features_to_ablate):
    print(f"\nTraining with features: {current_features}")
    X_ablation = df_processed[current_features]
    y_ablation = df_processed['weather_encoded']

    # --- Feedback Implementation: Train/Test Split ratio ---
    # Changed test_size from 0.2 to 0.25
    X_train_ab, X_test_ab, y_train_ab, y_test_ab = train_test_split(
        X_ablation, y_ablation, test_size=0.25, random_state=42, stratify=y_ablation
    )

    model = GaussianNB()
    model.fit(X_train_ab, y_train_ab)
    y_pred_ab = model.predict(X_test_ab)
    accuracy_ab = accuracy_score(y_test_ab, y_pred_ab)
    ablation_results[tuple(current_features)] = accuracy_ab
    print(f"Accuracy: {accuracy_ab:.4f}")

print_header("--- Ablation Study Summary ---")
for features, acc in ablation_results.items():
    print(f"Features: {features} -> Accuracy: {acc:.4f}")

# %% [markdown]
# ## Save the Model for Flask App

# %%
joblib.dump(nb_model_original, 'weather_prediction_model.joblib')
print_header("Saved weather_prediction_model.joblib (Gaussian Naive Bayes with original features)")

# %% [markdown]
# ## Note on Further Model Exploration (Feedback Suggestion)
#
# The feedback included a suggestion to explore a Random Forest classifier.
# Random Forests can be powerful and offer insights into feature importance.
#
# Key benefits:
# - Often provide good performance with less hyperparameter tuning.
# - Can handle a mix of numerical and categorical features (though scikit-learn requires encoding).
# - Provide `feature_importances_` attribute, which can be used to understand which features the model found most predictive. This automates some aspects of an ablation study or feature selection.
# - Generally robust to overfitting, especially with more trees.
#
# This would be a good next step to potentially improve predictive performance and gain further insights from the data.
# You could train it similarly to the other models:
#
# ```python
# from sklearn.ensemble import RandomForestClassifier
#
# # rf_model, rf_accuracy = train_and_evaluate_model(
# # RandomForestClassifier(random_state=42, n_estimators=100),
# # X_train_original, y_train_original, X_test_original, y_test_original,
# # le.classes_, "Random Forest (Original Features)"
# # )
# # print(f"Random Forest Feature Importances: {rf_model.feature_importances_}")
# ```

# %%
print_header("Notebook execution complete!")

The code on above is based on feedback that I received from my previous question regarding this project. Here is the link:

ML Project on Jupyter Notebook - Weather Prediction

I just need to be sure if I have managed to implement the feedback accurately before transforming the code of Jupyter Notebook into a Flask app.

Stack Exchange Network

Jupyter Notebook - ML Project on Weather Prediction

0

You must log in to answer this question.

Linked

Hot Network Questions

Jupyter Notebook - ML Project on Weather Prediction

0

You must log in to answer this question.

Linked

Related

Hot Network Questions