Skip to main content
added comment about ThreadPoolExecutor
Source Link
RootTwo
  • 10.7k
  • 1
  • 14
  • 30

Avoid iterating over rows of a dataframe or array. Avoid copying data.

Process one file at a time...read then write data for each file. There is no need to build a list of all the data.

Perhaps use np.loadtxt() instead of pd.read_csv(). Use skip and max_rows to limit the amount of data read and parsed by np.loadtxt(). Use unpack=True and ndmin=2 so it returns a row instead of a column. Then np.savetxt() will append a '\n' after each row.

Something like this (untested):

import numpy as np
import glob
import natsort

with open('x_position.txt', 'a') as outfile:

    filenames = natsort.natsorted(glob.glob("CoordTestCode/ParticleCoordU*"))

    for f in filenames:
        data = np.loadtxt(f, skiprows=9, max_rows=20000, usecols=3, unpack=True, ndmin=2)

        np.savetxt(outfile, data, fmt="%7.3f")

Presuming the data is stored on hard drives, the average rotational latency for a 7200 rpm hard drive is 4.17ms. 100k files at 4.17ms each is about 417 seconds = almost 7 minutes just to seek to the first sector of all those files. Perhaps using concurrent.futures.ThreadPoolExecutor would let you overlap those accesses and cut down that 7 minutes.

Avoid iterating over rows of a dataframe or array. Avoid copying data.

Process one file at a time...read then write data for each file. There is no need to build a list of all the data.

Perhaps use np.loadtxt() instead of pd.read_csv(). Use skip and max_rows to limit the amount of data read and parsed by np.loadtxt(). Use unpack=True and ndmin=2 so it returns a row instead of a column. Then np.savetxt() will append a '\n' after each row.

Something like this (untested):

import numpy as np
import glob
import natsort

with open('x_position.txt', 'a') as outfile:

    filenames = natsort.natsorted(glob.glob("CoordTestCode/ParticleCoordU*"))

    for f in filenames:
        data = np.loadtxt(f, skiprows=9, max_rows=20000, usecols=3, unpack=True, ndmin=2)

        np.savetxt(outfile, data, fmt="%7.3f")

Avoid iterating over rows of a dataframe or array. Avoid copying data.

Process one file at a time...read then write data for each file. There is no need to build a list of all the data.

Perhaps use np.loadtxt() instead of pd.read_csv(). Use skip and max_rows to limit the amount of data read and parsed by np.loadtxt(). Use unpack=True and ndmin=2 so it returns a row instead of a column. Then np.savetxt() will append a '\n' after each row.

Something like this (untested):

import numpy as np
import glob
import natsort

with open('x_position.txt', 'a') as outfile:

    filenames = natsort.natsorted(glob.glob("CoordTestCode/ParticleCoordU*"))

    for f in filenames:
        data = np.loadtxt(f, skiprows=9, max_rows=20000, usecols=3, unpack=True, ndmin=2)

        np.savetxt(outfile, data, fmt="%7.3f")

Presuming the data is stored on hard drives, the average rotational latency for a 7200 rpm hard drive is 4.17ms. 100k files at 4.17ms each is about 417 seconds = almost 7 minutes just to seek to the first sector of all those files. Perhaps using concurrent.futures.ThreadPoolExecutor would let you overlap those accesses and cut down that 7 minutes.

Source Link
RootTwo
  • 10.7k
  • 1
  • 14
  • 30

Avoid iterating over rows of a dataframe or array. Avoid copying data.

Process one file at a time...read then write data for each file. There is no need to build a list of all the data.

Perhaps use np.loadtxt() instead of pd.read_csv(). Use skip and max_rows to limit the amount of data read and parsed by np.loadtxt(). Use unpack=True and ndmin=2 so it returns a row instead of a column. Then np.savetxt() will append a '\n' after each row.

Something like this (untested):

import numpy as np
import glob
import natsort

with open('x_position.txt', 'a') as outfile:

    filenames = natsort.natsorted(glob.glob("CoordTestCode/ParticleCoordU*"))

    for f in filenames:
        data = np.loadtxt(f, skiprows=9, max_rows=20000, usecols=3, unpack=True, ndmin=2)

        np.savetxt(outfile, data, fmt="%7.3f")