I am working on a personal Machine-Learning (ML) project to predict weather on Jupyter Notebook. Once I am done with it, I will start working on converting it into a Flask app.
I have completed my code on Jupyter Notebook. All the codes are working fine. However, I am not sure if everything is done correctly.
Would you please review my code on GitHub and inform me if I am doing it the right way?
Here is my code:
# %% [markdown]
# # Seattle Weather Category Prediction - Jupyter Notebook
#
# ## This notebook aims to predict the weather category (e.g., sun, rain, snow)
# # based on other meteorological features. It follows a structure similar to
# # the provided Kaggle example, using Gaussian Naive Bayes.
#
# ## Steps:
# (Original numbered steps removed as per feedback, structure implied by headers)
# - Load and Explore Data
# - Visualize Data
# - Preprocess Data (Label Encoding for target, add lagged features)
# - Train and Evaluate Models
# - Conduct Ablation Study
# - Save the Model and Label Encoder for Flask App
# %% [markdown]
# ## Setup and Load Data
# %%
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import io # To load CSV from string in some environments (not strictly used by pd.read_csv here)
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
from sklearn.preprocessing import LabelEncoder
import joblib
# %%
# Display plots inline
%matplotlib inline
# Set some display options for Pandas
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)
# %% [markdown]
# ### Helper Functions
# %%
def print_header(message):
"""Prints a formatted header message."""
print(f"\n{message}:")
def load_and_preprocess_data(file_path):
"""Loads the dataset and performs initial date conversion."""
df = pd.read_csv(file_path)
print_header("Dataset loaded successfully")
df['date'] = pd.to_datetime(df['date']) # Convert date column to datetime
return df
def create_visualization_df(dataframe):
"""Creates a copy of the dataframe for visualization and extracts year/month."""
df_vis = dataframe.copy()
df_vis['year'] = df_vis['date'].dt.year
df_vis['month'] = df_vis['date'].dt.month
return df_vis
# --- Feedback Implementation: Helper Defaults and Splitting ---
_DEFAULT_FEATURES_TO_LAG = ("precipitation", "temp_max", "temp_min", "wind", "weather_encoded")
_DEFAULT_TEMP_FEATURES_FOR_DELTA = ("temp_max", "temp_min")
def _create_lag_features(df_input, features_to_lag=_DEFAULT_FEATURES_TO_LAG, lag_period=1):
"""
Adds lagged versions of specified features.
Sorts by 'date' column if present before lagging.
"""
df_out = df_input.copy()
if 'date' in df_out.columns:
df_out = df_out.sort_values(by='date').reset_index(drop=True) # Sort and reset index
else:
print("Warning: 'date' column not present for sorting in _create_lag_features. Assuming pre-sorted data for lag features.")
for feature in features_to_lag:
if feature in df_out.columns:
df_out[f'{feature}_lag{lag_period}'] = df_out[feature].shift(lag_period)
else:
print(f"Warning: Feature '{feature}' not found in DataFrame for lagging.")
return df_out # Returns DataFrame with NaNs from shifting
def _create_delta_features(df_input, temp_features_for_delta=_DEFAULT_TEMP_FEATURES_FOR_DELTA, lag_period=1):
"""
Adds delta (difference) features for specified temperature-like features.
Assumes df_input already contains the necessary current and lagged features.
"""
df_out = df_input.copy()
for temp_feature in temp_features_for_delta:
current_feature_name = temp_feature
lagged_feature_name = f'{temp_feature}_lag{lag_period}'
if current_feature_name in df_out.columns and lagged_feature_name in df_out.columns:
df_out[f'delta_{temp_feature}'] = df_out[current_feature_name] - df_out[lagged_feature_name]
else:
if current_feature_name not in df_out.columns:
print(f"Warning: Current feature {current_feature_name} not found for delta calculation.")
if lagged_feature_name not in df_out.columns:
print(f"Warning: Lagged feature {lagged_feature_name} not found for delta calculation of {current_feature_name}.")
return df_out
def train_and_evaluate_model(model, X_train, y_train, X_test, y_test, target_names, model_name="Model"):
"""Trains a model and prints evaluation metrics."""
print_header(f"--- {model_name} Training and Evaluation ---")
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
classification_rep = classification_report(y_test, y_pred, target_names=target_names, zero_division=0)
print(f"Accuracy: {accuracy:.4f}")
print_header("Confusion Matrix")
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=target_names, yticklabels=target_names)
plt.xlabel('Predicted Label')
plt.ylabel('True Label')
plt.title(f'Confusion Matrix - {model_name}')
plt.show()
print_header("Classification Report")
print(classification_rep)
return model, accuracy
# %%
# --- Load the dataset ---
file_path = 'seattle-weather.csv'
df = load_and_preprocess_data(file_path)
df.head()
# %% [markdown]
# ## Initial Data Exploration & Visualization
# %%
print_header("Dataset Info")
df.info()
# %%
print_header("Statistical Summary")
print(df.describe())
# %%
print_header("Missing Values Check")
print(df.isnull().sum())
print(f"Any NA values present: {df.isna().sum().any()}")
# %%
print_header("Duplicate Rows Check")
print(f"Number of duplicated rows: {df.duplicated().sum()}")
# %%
print_header("Day with Minimum temp_min")
print(df[df['temp_min']==min(df.temp_min)])
# %%
print_header("Day with Maximum temp_max")
print(df[df['temp_max']==max(df.temp_max)])
# %%
t_min_overall = pd.concat([df['temp_min'], df['temp_max']]).min()
t_max_overall = pd.concat([df['temp_min'], df['temp_max']]).max()
bins = np.arange(np.floor(t_min_overall), np.ceil(t_max_overall) + 1, 1)
plt.figure(figsize=(12,6))
sns.histplot(data=df, x='temp_max', bins=bins, kde=True, palette="viridis") # Using palette instead of color for histplot
plt.title('Distribution of Maximum Temperature')
plt.xlabel('Max Temperature (°C)')
plt.ylabel('Frequency')
plt.xlim(bins.min(), bins.max())
plt.xticks(bins[::2])
plt.show()
# %%
plt.figure(figsize=(12,6))
sns.histplot(data=df, x='temp_min', bins=bins, kde=True, palette="viridis") # Using palette instead of color for histplot
plt.title('Distribution of Minimum Temperature')
plt.xlabel('Min Temperature (°C)')
plt.ylabel('Frequency')
plt.xlim(bins.min(), bins.max())
plt.xticks(bins[::2])
plt.show()
# %% [markdown]
# ### FacetGrid Visualizations (Month vs. Weather Variables by Year)
# %%
df_vis = create_visualization_df(df)
# %%
g = sns.FacetGrid(df_vis, col='year', col_wrap=4, height=3.5, aspect=1.2)
g.map(sns.lineplot, 'month', 'temp_max', errorbar=None)
g.set_axis_labels('Month', 'Max Temperature (°C)')
g.set_titles(col_template="{col_name}")
g.fig.suptitle('Max Temperature by Month for Each Year', y=1.03)
plt.tight_layout()
plt.show()
# %%
g = sns.FacetGrid(df_vis, col='year', col_wrap=4, height=3.5, aspect=1.2)
g.map(sns.lineplot, 'month', 'temp_min', errorbar=None)
g.set_axis_labels('Month', 'Min Temperature (°C)')
g.set_titles(col_template="{col_name}")
g.fig.suptitle('Min Temperature by Month for Each Year', y=1.03)
plt.tight_layout()
plt.show()
# %%
g = sns.FacetGrid(df_vis, col='year', col_wrap=4, height=3.5, aspect=1.2)
g.map(sns.lineplot, 'month', 'precipitation', errorbar=None)
g.set_axis_labels('Month', 'Precipitation (mm)')
g.set_titles(col_template="{col_name}")
g.fig.suptitle('Precipitation by Month for Each Year', y=1.03)
plt.tight_layout()
plt.show()
# %%
g = sns.FacetGrid(df_vis, col='year', col_wrap=4, height=3.5, aspect=1.2)
g.map(sns.lineplot, 'month', 'wind', errorbar=None)
g.set_axis_labels('Month', 'Wind Speed')
g.set_titles(col_template="{col_name}")
g.fig.suptitle('Wind Speed by Month for Each Year', y=1.03)
plt.tight_layout()
plt.show()
# %% [markdown]
# ### Weather Category Distribution
# %%
print_header("Weather Category Counts")
weather_counts = df['weather'].value_counts()
print(weather_counts)
# %%
plt.figure(figsize=(10, 6))
sns.countplot(data=df, x='weather', order=weather_counts.index, hue='weather', palette="viridis", legend=False)
plt.title('Distribution of Weather Types')
plt.xlabel('Weather Type')
plt.ylabel('Frequency')
plt.xticks(rotation=45)
plt.show()
# %%
plt.figure(figsize=(10, 8))
plt.pie(weather_counts, labels=weather_counts.index, autopct='%1.1f%%', startangle=140,
colors=sns.color_palette("viridis", len(weather_counts)))
plt.title('Distribution of Weather Types (Pie Chart)')
plt.axis('equal')
plt.show()
# %% [markdown]
# ## Data Preprocessing for Classification
# %%
df_processed = df.copy()
if 'date' in df_processed.columns:
df_processed = df_processed.drop('date', axis=1)
print_header("DataFrame columns before modeling")
print(df_processed.columns.tolist())
df_processed.head()
# %%
le = LabelEncoder()
df_processed['weather_encoded'] = le.fit_transform(df_processed['weather'])
print_header("Label Encoding Mapping for 'weather'")
for i, class_name in enumerate(le.classes_):
print(f"{class_name} -> {i}")
# %%
joblib.dump(le, 'weather_label_encoder.joblib')
print_header("Saved weather_label_encoder.joblib")
df_processed.head()
# %% [markdown]
# ### Adding Lagged Time Series Features
# %%
df_for_lagged_processing = df.copy() # Start with the original df that includes 'date'
le_lagged = LabelEncoder() # Use a separate encoder if necessary, or ensure le can be reused
df_for_lagged_processing['weather_encoded'] = le_lagged.fit_transform(df_for_lagged_processing['weather'])
# --- Feedback Implementation: Using split helper functions ---
# Step 1: Create lag features.
# The _create_lag_features function will sort by 'date' internally.
# Default features_to_lag includes 'weather_encoded'.
df_temp_lags = _create_lag_features(
df_input=df_for_lagged_processing,
lag_period=1
)
# Step 2: Create delta features using the DataFrame that now includes lags.
# Default temp_features_for_delta are 'temp_max', 'temp_min'.
df_with_lags_and_deltas_raw = _create_delta_features(
df_input=df_temp_lags,
lag_period=1
)
# Step 3: Drop rows with NaN values introduced by shifting and reset index
df_with_lags = df_with_lags_and_deltas_raw.dropna().reset_index(drop=True)
print_header("DataFrame with Lagged and Delta Features")
print(df_with_lags.head())
df_processed_lagged = df_with_lags.drop(columns=['date', 'weather'], errors='ignore')
print_header("df_processed_lagged head")
print(df_processed_lagged.head())
# %% [markdown]
# ## Feature Selection and Train-Test Split
# %%
original_features = ['temp_min', 'temp_max', 'precipitation', 'wind']
X_original = df_processed[original_features] # df_processed does not have 'date'
y_original = df_processed['weather_encoded']
lagged_features_input_list = ['temp_min', 'temp_max', 'precipitation', 'wind',
'precipitation_lag1', 'temp_max_lag1', 'temp_min_lag1',
'wind_lag1', 'weather_encoded_lag1',
'delta_temp_max', 'delta_temp_min']
# Ensure all features in list are present in df_processed_lagged.columns
# Filter to only include columns that actually exist in df_processed_lagged
lagged_features_input_list = [col for col in lagged_features_input_list if col in df_processed_lagged.columns]
X_lagged = df_processed_lagged[lagged_features_input_list]
y_lagged = df_processed_lagged['weather_encoded']
feature_names_for_model = X_original.columns.tolist()
joblib.dump(feature_names_for_model, 'classifier_feature_names.joblib')
print(f"Saved classifier_feature_names.joblib with features: {feature_names_for_model}")
# --- Feedback Implementation: Train/Test Split ratio ---
# Changed test_size from 0.2 to 0.25
X_train_original, X_test_original, y_train_original, y_test_original = train_test_split(
X_original, y_original, test_size=0.25, random_state=42, stratify=y_original
)
print_header("Original Data Split Shapes")
print(f"Original X_train shape: {X_train_original.shape}, y_train shape: {y_train_original.shape}")
print(f"Original X_test shape: {X_test_original.shape}, y_test shape: {y_test_original.shape}")
# Changed test_size from 0.2 to 0.25
X_train_lagged, X_test_lagged, y_train_lagged, y_test_lagged = train_test_split(
X_lagged, y_lagged, test_size=0.25, random_state=42, stratify=y_lagged
)
print_header("Lagged Data Split Shapes")
print(f"Lagged X_train shape: {X_train_lagged.shape}, y_train shape: {y_train_lagged.shape}")
print(f"Lagged X_test shape: {X_test_lagged.shape}, y_test shape: {y_test_lagged.shape}")
# %% [markdown]
# ## Naïve Model (Climate Prediction)
# %%
df_naive = df.copy()
df_naive['month'] = df_naive['date'].dt.month
monthly_most_frequent_weather = df_naive.groupby('month')['weather'].agg(lambda x: x.mode()[0] if not x.mode().empty else "Unknown")
print_header("Most frequent weather type per month (Naïve Model)")
print(monthly_most_frequent_weather)
df_naive['predicted_weather_naive'] = df_naive['month'].map(monthly_most_frequent_weather)
naive_accuracy = accuracy_score(df_naive['weather'], df_naive['predicted_weather_naive'])
print(f"\nNaïve Model Accuracy (Predicting most frequent weather by month): {naive_accuracy:.4f}")
print("This simple model predicts the 'climate' for each month, rather than specific 'weather'.")
# %% [markdown]
# ## Model Training and Evaluation
# %%
nb_model_original, nb_accuracy_original = train_and_evaluate_model(
GaussianNB(), X_train_original, y_train_original, X_test_original, y_test_original,
le.classes_, "Gaussian Naive Bayes (Original Features)"
)
# Ensure le_lagged.classes_ is used for lagged model evaluation if its encoding differs.
# If df['weather'].unique() is consistent, le.classes_ and le_lagged.classes_ will be the same.
nb_model_lagged, nb_accuracy_lagged = train_and_evaluate_model(
GaussianNB(), X_train_lagged, y_train_lagged, X_test_lagged, y_test_lagged,
le_lagged.classes_, "Gaussian Naive Bayes (Lagged Features)"
)
lr_model_original, lr_accuracy_original = train_and_evaluate_model(
LogisticRegression(max_iter=1000, random_state=42), X_train_original, y_train_original, X_test_original, y_test_original,
le.classes_, "Logistic Regression (Original Features)"
)
svm_model_original, svm_accuracy_original = train_and_evaluate_model(
SVC(random_state=42), X_train_original, y_train_original, X_test_original, y_test_original,
le.classes_, "Support Vector Machine (Original Features)"
)
# %% [markdown]
# ## Ablation Study
# %%
features_to_ablate = [
['temp_min', 'temp_max', 'precipitation', 'wind'],
['temp_min', 'temp_max', 'precipitation'],
['temp_min', 'temp_max', 'wind'],
['precipitation', 'wind'],
['temp_max', 'precipitation', 'wind'],
['temp_min', 'precipitation', 'wind'],
['temp_max'],
['wind']
]
ablation_results = {}
print_header("--- Ablation Study (Gaussian Naive Bayes with Original Features) ---")
for i, current_features in enumerate(features_to_ablate):
print(f"\nTraining with features: {current_features}")
X_ablation = df_processed[current_features]
y_ablation = df_processed['weather_encoded']
# --- Feedback Implementation: Train/Test Split ratio ---
# Changed test_size from 0.2 to 0.25
X_train_ab, X_test_ab, y_train_ab, y_test_ab = train_test_split(
X_ablation, y_ablation, test_size=0.25, random_state=42, stratify=y_ablation
)
model = GaussianNB()
model.fit(X_train_ab, y_train_ab)
y_pred_ab = model.predict(X_test_ab)
accuracy_ab = accuracy_score(y_test_ab, y_pred_ab)
ablation_results[tuple(current_features)] = accuracy_ab
print(f"Accuracy: {accuracy_ab:.4f}")
print_header("--- Ablation Study Summary ---")
for features, acc in ablation_results.items():
print(f"Features: {features} -> Accuracy: {acc:.4f}")
# %% [markdown]
# ## Save the Model for Flask App
# %%
joblib.dump(nb_model_original, 'weather_prediction_model.joblib')
print_header("Saved weather_prediction_model.joblib (Gaussian Naive Bayes with original features)")
# %% [markdown]
# ## Note on Further Model Exploration (Feedback Suggestion)
#
# The feedback included a suggestion to explore a Random Forest classifier.
# Random Forests can be powerful and offer insights into feature importance.
#
# Key benefits:
# - Often provide good performance with less hyperparameter tuning.
# - Can handle a mix of numerical and categorical features (though scikit-learn requires encoding).
# - Provide `feature_importances_` attribute, which can be used to understand which features the model found most predictive. This automates some aspects of an ablation study or feature selection.
# - Generally robust to overfitting, especially with more trees.
#
# This would be a good next step to potentially improve predictive performance and gain further insights from the data.
# You could train it similarly to the other models:
#
# ```python
# from sklearn.ensemble import RandomForestClassifier
#
# # rf_model, rf_accuracy = train_and_evaluate_model(
# # RandomForestClassifier(random_state=42, n_estimators=100),
# # X_train_original, y_train_original, X_test_original, y_test_original,
# # le.classes_, "Random Forest (Original Features)"
# # )
# # print(f"Random Forest Feature Importances: {rf_model.feature_importances_}")
# ```
# %%
print_header("Notebook execution complete!")
The code on above is based on feedback that I received from my previous question regarding this project. Here is the link:
ML Project on Jupyter Notebook - Weather Prediction
I just need to be sure if I have managed to implement the feedback accurately before transforming the code of Jupyter Notebook into a Flask app.