I am learning how to create CNN models and thought that Kaggle hosted an interesting competition to help me learn it.
They provided a large JSON-like (BSON) file, around 50GB, that I am trying to process. I am trying to train a convolutional neural network using the Keras module. In the file I am iteratively reading the image data which has the array structure of (180 , 180, 3). The whole file contains around 7,000,000 images so the final array structure would look like (7000000, 180, 180, 3). However, I cannot read all of this data into memory so what I am aiming to do is read in only 100,000 images at a time to fit the neural network, save the model's weights, delete the array to free up memory, then continue reading the next 100,000 images into a new array to re-fit the previously trained model. I would do this iteratively until I reach the last image.
I initially tried to use 'np.append()', to append each image array together iteratively, however, this took a lot of time as I only got through 25,000 images, resulting in an array structure of (25000, 180, 180, 3), in 10 hours and it was appending very slow near the end due to size.
I then tried to use a different approach by using pandas dataframe structure. I appended each (1, 180, 180, 3) array into each cell into one column. I was able to iterate through 100,000 images in around 20 minutes using this method (most of the code is provided through Kaggle - https://www.kaggle.com/inversion/processing-bson-files) but I modified it below:
# Simple data processing
from bson.json_util import dumps
data = bson.decode_file_iter(open('train.bson', 'rb'))
prod_to_category = dict()
i = 0
j = 1000
# Loop through dataset
for c, d in enumerate(data):
product_id = d['_id']
category_id = d['category_id'] # This won't be in Test data
prod_to_category[product_id] = category_id
i+=1
# Create a counter to check how many records have been iterated through
if (i == 1):
print (i, "records loaded")
print(picture_1.shape)
j+=1000
for e, pic in enumerate(d['imgs']):
# Reshape the array and append image array data
if (i == 0):
picture_1 = np.reshape(imread(io.BytesIO(pic['picture'])), (1,180,180,3))
get = pd.DataFrame({'A': [product_id], 'B': [category_id], 'C':[picture_1]})
frames = get
break
else:
picture_2 = np.reshape(imread(io.BytesIO(pic['picture'])), (1,180,180,3))
get2 = pd.DataFrame({'A': [product_id], 'B': [category_id], 'C':[picture_2]})
frames = frames.append(get2)
break
So a header of the pandas data frame, 'frames' , looks like this. Note, in this example pretend that I stopped the loop exactly at 100,000 records:
How would I be able to convert this entire column 'C', with each cell appearing to have an array structure (1, 180, 180, 3), into a Numpy array of structure (100000, 180, 180, 3) so then I can feed this into my neural network? Preferably do not want to use a for loop to do this.
I have looked online and tried multiple things but could not find out how to do this. Once I figure this out, I should be able to re-train my network with a new array of 100,000 images, and do this over and over until I have fitted all the seven million images to my model. I am really new to this kind of stuff, so any other help or suggestions would be much appreciated.
frames.C.values.reshape(-1, 180, 180, 3)
(if it's already a numpy-array inside C).