-4

I'm working on small project using the modern portfolio theory for portfolio optimization, the final required output is a recommendations on which assets to buy and how much to buy.

to achive that goal, I use MTP with Black–Litterman model, which are non-AI and use deterministic logic, but for the Black–Litterman model, there is input called 'views', as the name suggest, they are views or opinions or PREDICTIONS about the real assets returns, I generate those 'views' using LSTM model , I trained it on data of shape (1518, 484) (i.e., 1518 samples, and 484 different equities).

here is a reproducable code to predict the returns of some assets:

# Source - https://stackoverflow.com/q/79933560
# Posted by abd klaib, modified by community. See post 'Timeline' for change history
# Retrieved 2026-04-29, License - CC BY-SA 4.0

# Necessary libraries.
import yfinance as yf
import numpy as np
import pandas as pd
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from tensorflow import keras
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import preprocessing
from numpy.typing import NDArray
from typing import Any
# Getting Stored data.
equities=['PNC','MDLZ','FAST','FCX','ESI','RGLD','TSLA','PCAR','C','IBM','PPC','CVS','AZO','MTCH','PFGC','PR','VNOM','MAA'] # can add other equities; up to 484.
x=yf.download(equities,start="2020-04-09",auto_adjust=True,threads=True,interval="1d",group_by="ticker")
df=x.xs("Close",level=1,axis=1).pct_change().dropna()
df.to_parquet("latest_close_returns.parquet")


def create_sequences(data:pd.DataFrame, window:int)->tuple[NDArray[Any], NDArray[Any]]:

    """

    Description: this function to be used for  constructing labels for time series data.

    Args:

        data (pd.DataFrame): Unlabeled time series data.

        window (int): Number of past time steps (days) used as input to the model to predict the next time step. For example, a window of 50 means the model uses the previous 50 days of data to forecast day 51.
    
    Returns:

        X: Three dimensional data (for example: (1468, 50, 484), where 1468 is number of samples, 50 is the window, and 484 is the number of features).
        
        y: Labels that are Two dimensional .(for example: (1468,484), where 1468 is number of samples, and 484 is the number of features, effectively having label for each training sample).
    
"""
    
    X, y = [], []
    for i in range(len(data) - window): 
        X.append(data.iloc[i:i + window])
        y.append(data.iloc[i + window,:])
    return np.array(X), np.array(y)



# Constructing labels for the Unlabeled time series data.
X, y = create_sequences(df, 50) # play with window

# Splitting data before and preprocessing to prevent data leakage.
split = int(0.85 * len(X))
X_train, X_test = X[:split], X[split:]
y_train, y_test = y[:split], y[split:]


# Preprocessing here is just Scaling, as there is no missing data at all. 
# Scaling needs 2D data, not 3D data. After scaling is done, reshape again to    3D, as training need 3D data.
X_train_2D = X_train.reshape(-1, X_train.shape[2])
X_test_2D = X_test.reshape(-1, X_test.shape[2])
scaler=preprocessing.StandardScaler()
# fitting the scaler only on the training data.
X_train_scaled = scaler.fit_transform(X_train_2D).reshape(X_train.shape) 
y_train_scaled = scaler.transform(y_train)
X_test_scaled = scaler.transform(X_test_2D).reshape(X_test.shape)
y_test_scaled =scaler.transform(y_test)


#Training the model
callback=keras.callbacks.EarlyStopping(
    monitor="val_loss",
    min_delta=0,
    patience=10,
    verbose=0,
    mode="min",
    baseline=None,
    restore_best_weights=True,
    start_from_epoch=0,
)
model = keras.Sequential()
model.add(keras.layers.LSTM(64, input_shape=(
    X_train_scaled.shape[1], X_train_scaled.shape[2])))
model.add(keras.layers.Dropout(0.3))
model.add(keras.layers.Dense(len(df.columns)))
model.compile(loss="mse", optimizer="adam", metrics=["mae"])
model.summary()
model.fit(X_train_scaled, y_train_scaled, epochs=150,callbacks=[callback])

# Predicting; Predcitions result must be unscaled.
pred_scaled=model.predict(X_test_scaled)
pred = scaler.inverse_transform(pred_scaled)    
y_test=scaler.inverse_transform(y_test_scaled) 

# Performance metrics 
mse = mean_squared_error(y_test, pred)
mae = mean_absolute_error(y_test, pred)
r2 = r2_score(y_test, pred)
print(f"mse is :{mse}\n")
print(f"mae is {mae}\n")
print(f"r2 is {r2}")

The overall results acroos all equities are so bad :

mse is :0.0005406439183876653

mae is 0.01578705713631846

r2 is -0.20972200159793528

I uploaded visualization of the predictions results for the first feature

Questions: “Why is R² negative?” “Is my scaling logically correct?” “Is my LSTM setup valid for multivariate returns?”

4
  • 2
    This question is similar to: How to make good reproducible pandas examples. If you believe it’s different, please edit the question, make it clear how it’s different and/or how the answers on that question are not helpful for your problem. Commented 2 days ago
  • by the way: there are similar sites for ML/AI: CrossValidated, Artificial Intelligence Commented yesterday
  • 1
    I wonder why you have string # Source - https://stackoverflow.com/q/79933560 (and other comments) in your code. It looks like you copied your own code using button Copy and Stackoverflow added information about source. Commented yesterday
  • yes I did the Copy button XD , Idid not want to write the question from scratch again and again Commented yesterday

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.