I propose the following workaround. According to this, the getitem function retrieves a specific window which belongs to a csv file and not the file itself. Towards this direction, find_num_of_windows computes the number of windows occur for a given csv file. The len(self) function will return the sum of the windows of all files. In this way, the idx input of the getitem function will no longer have an upper limit equal to the number of files. Instead the upper limit would be the number of windows of all files. The create_dataset_dict function assigns to all potential idx values the corresponding filename and window index.
Comments:
The code needs optimization. Though, I chose a simple way for easier
understanding.
I don't know how exatly the read_file function works,
so I just tried something as an example.
Hope it helps!
import csv
class CustomDataset(Dataset):
def __init__(self, data_folder, data_list_filename, window_size):
self.data_folder = data_folder
self.data_file_list = [file for file in os.listdir(data_folder)]
self.window_size = window_size
self.total_windows, self.dataset_dict = create_dataset_dict()
def find_num_of_windows(self, path_to_file):
rows = 0
with open(path_to_file) as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
line_count = 0
rows = sum([1 for row in csv_reader])
windows = rows // self.window_size
def create_dataset_dict(self):
total_windows = 0
idx = 0
dataset_dict = {}
for i in range(len(self.data_file_list)):
windows_of_current_file = find_num_of_windows(self.data_file_list[i])
total_windows += windows_of_current_file
for j in range(idx:idx + windows_of_current_file):
dataset_dict[idx] = {
"filename": self.data_file_list[i],
"window_index": j
}
return total_windows, dataset_dict
def read_file(filename, w_index):
with open(filename) as csv_file:
csv_reader = csv.reader(csv_file, delimiter=',')
line_idx = 0
data = []
labels = []
for row in csv_reader:
if line_idx >= w_index and line_idx < w_index + self.window_size:
data.append(row[0])
labels.append(row[1])
return data, labels
def __len__(self):
return self.total_windows
def __getitem__(self, idx):
filename = self.dataset_dict[idx]["filename"]
w_index = self.dataset_dict[idx]["window_index"]
data, label = read_file(filename, w_index)
return data, label
Example:
Assuming we have 3 CSV files - csv_1, csv_2 and csv_3
From csv_1 we can extract 2 windows, from csv_2 we can extract 3 windows and from csv_3 we can extract 1 window.
Then:
- find_num_of_windows(path_to_csv_1) returns 2
- find_num_of_windows(path_to_csv_2) returns 3
- find_num_of_windows(path_to_csv_3) returns 1
By calling create_dataset_dict() function, the dataset dictionary is created and looks like the one below:
{
1: {
"filename": path_to_csv_1,
"window_index": 1
},
2: {
"filename": path_to_csv_1,
"window_index": 2
},
3: {
"filename": path_to_csv_2,
"window_index": 1
},
4: {
"filename": path_to_csv_2,
"window_index": 2
},
5: {
"filename": path_to_csv_2,
"window_index": 3
},
6: {
"filename": path_to_csv_3,
"window_index": 1
}
}
If we now call the getitem function using an idx in [1, 2, 3, 4, 5, 6], we can retrieve the corresponding window using the aformentioned dictionary. For example, if we give 4 as input to getitem we will retrieve the second window of file csv_2.
features.shape[0]
orlen(len(features.index))
. The same should hold for the labels DataFrame. This is in my opinion, what also the Error says.