PyTorch Dataloader for multiple files with sliding window

Question

I am working on a problem where I have multiple CSVs files and I need to read those multiple CSVs one by one with a sliding window. Let’s assume that, one CSV file is having 330 data points and the window size is 32 so we should be having (10*32 = 320) and the last 10 points will be discarded.

I started making a dataset that looks like this but after spending too much time, I am not able to get it working. The current code looks like this,

class CustomDataset(Dataset):
def __init__(self, data_folder, window_size):
    self.data_folder = data_folder
    self.data_file_list = [file for file in os.listdir(data_folder)]
    print(self.data_file_list)
    self.window_size = window_size
    
def __len__(self):
    return len(self.data_file_list[0])

def __getitem__(self, idx):
    filename = self.data_file_list[idx]
    data, label = read_file(filename)
    return data, label

def read_file(self, filename):
    data = pd.read_csv(filename)
    data = data.drop(["file_name", "class_name"], axis = 1)
    features = data.drop(["class_no"], axis = 1)
    labels = data["class_no"]
    x = [features[index:index+self.window_size].values for index in range(0, len(features))]
    y = [labels[index:index+self.window_size].values for index in range(0, len(labels))]
    
    return x, y

Note: I can’t merge all these CSV files into one.

I am getting this error, TypeError: object of type 'type' has no len()

The problem will be the two second last lines. The length of a DataFrame should be determined by features.shape[0] or len(len(features.index)). The same should hold for the labels DataFrame. This is in my opinion, what also the Error says. — TechnicTom, Commented Nov 9, 2022 at 15:22

itzortzis · Accepted Answer · 2022-11-10 14:55:13Z

I propose the following workaround. According to this, the getitem function retrieves a specific window which belongs to a csv file and not the file itself. Towards this direction, find_num_of_windows computes the number of windows occur for a given csv file. The len(self) function will return the sum of the windows of all files. In this way, the idx input of the getitem function will no longer have an upper limit equal to the number of files. Instead the upper limit would be the number of windows of all files. The create_dataset_dict function assigns to all potential idx values the corresponding filename and window index.

Comments:

The code needs optimization. Though, I chose a simple way for easier understanding.
I don't know how exatly the read_file function works, so I just tried something as an example.

Hope it helps!

import csv

class CustomDataset(Dataset):
def __init__(self, data_folder, data_list_filename, window_size):
    self.data_folder = data_folder
    self.data_file_list = [file for file in os.listdir(data_folder)]
    self.window_size = window_size
    self.total_windows, self.dataset_dict = create_dataset_dict()


def find_num_of_windows(self, path_to_file):
    rows = 0
    with open(path_to_file) as csv_file:
        csv_reader = csv.reader(csv_file, delimiter=',')
        line_count = 0
        rows = sum([1 for row in csv_reader])
    windows = rows // self.window_size


def create_dataset_dict(self):
    total_windows = 0
    idx = 0
    dataset_dict = {}
    for i in range(len(self.data_file_list)):
        windows_of_current_file = find_num_of_windows(self.data_file_list[i])
        total_windows += windows_of_current_file

        for j in range(idx:idx + windows_of_current_file):
            dataset_dict[idx] = {
                "filename": self.data_file_list[i],
                "window_index": j
            }

    return total_windows, dataset_dict


def read_file(filename, w_index):
    with open(filename) as csv_file:
        csv_reader = csv.reader(csv_file, delimiter=',')
        line_idx = 0
        data = []
        labels = []
        for row in csv_reader:
            if line_idx >= w_index and line_idx < w_index + self.window_size:
                data.append(row[0])
                labels.append(row[1])
                
        return data, labels
        
    
def __len__(self):
    return self.total_windows
    

def __getitem__(self, idx):
    
    filename = self.dataset_dict[idx]["filename"]
    w_index = self.dataset_dict[idx]["window_index"]
    data, label = read_file(filename, w_index)
    return data, label

Example:

Assuming we have 3 CSV files - csv_1, csv_2 and csv_3

From csv_1 we can extract 2 windows, from csv_2 we can extract 3 windows and from csv_3 we can extract 1 window.

Then:

find_num_of_windows(path_to_csv_1) returns 2
find_num_of_windows(path_to_csv_2) returns 3
find_num_of_windows(path_to_csv_3) returns 1

By calling create_dataset_dict() function, the dataset dictionary is created and looks like the one below:

{ 
  1: {
       "filename": path_to_csv_1,
       "window_index": 1
     },
  2: {
       "filename": path_to_csv_1,
       "window_index": 2
     },
  3: {
       "filename": path_to_csv_2,
       "window_index": 1
     },
  4: {
       "filename": path_to_csv_2,
       "window_index": 2
     },
  5: {
       "filename": path_to_csv_2,
       "window_index": 3
     },
  6: {
       "filename": path_to_csv_3,
       "window_index": 1
     }
}

If we now call the getitem function using an idx in [1, 2, 3, 4, 5, 6], we can retrieve the corresponding window using the aformentioned dictionary. For example, if we give 4 as input to getitem we will retrieve the second window of file csv_2.

Thanks @yannis, I changed the code and now I am having this error RuntimeError: stack expects each tensor to be equal size, but got [77516, 32, 17] at entry 0 and [64979, 32, 17] at entry 1 as I have multiple files and each file can be [TotalWindowSize, OneWindowSize, Features] where OneWindowSize, Features will be same all the files but first TotalWindowSize will be different. — Pythonic, Commented Nov 10, 2022 at 13:52
Hi @Pythonic! If you followed the exact logic of my solution, you shouldn't face such an issue. Let's assume you have 3 CSV files. From csv_1 we can extract 2 windows, from csv_2 - 3 windows and from csv_3 just 1 window. In total, we have 6 windows. Thus __len__(self) function will return 6. The idx of getitem should take values in [1, 2, 3, 4, 5, 6] that corresponds to windows ids, and not in [1, 2, 3] that would correspond to CSV ids. Probably, if you give CSV id to getitem you will end up with different batch sizes. That's why I use create_dataset_dict in order to fetch windows directly. — itzortzis, Commented Nov 10, 2022 at 14:33
I have just updated my answer by providing an explanation of how the code works — itzortzis, Commented Nov 10, 2022 at 14:57

Collectives™ on Stack Overflow

PyTorch Dataloader for multiple files with sliding window

1 Answer 1

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Related