Return to Answer

added comment about ThreadPoolExecutor

Source Link

edited Aug 27, 2021 at 5:25

10.7k
1
14
30

Avoid iterating over rows of a dataframe or array. Avoid copying data.

Process one file at a time...read then write data for each file. There is no need to build a list of all the data.

Perhaps use np.loadtxt() instead of pd.read_csv(). Use skip and max_rows to limit the amount of data read and parsed by np.loadtxt(). Use unpack=True and ndmin=2 so it returns a row instead of a column. Then np.savetxt() will append a '\n' after each row.

Something like this (untested):

import numpy as np
import glob
import natsort

with open('x_position.txt', 'a') as outfile:

    filenames = natsort.natsorted(glob.glob("CoordTestCode/ParticleCoordU*"))

    for f in filenames:
        data = np.loadtxt(f, skiprows=9, max_rows=20000, usecols=3, unpack=True, ndmin=2)

        np.savetxt(outfile, data, fmt="%7.3f")

Presuming the data is stored on hard drives, the average rotational latency for a 7200 rpm hard drive is 4.17ms. 100k files at 4.17ms each is about 417 seconds = almost 7 minutes just to seek to the first sector of all those files. Perhaps using concurrent.futures.ThreadPoolExecutor would let you overlap those accesses and cut down that 7 minutes.

Avoid iterating over rows of a dataframe or array. Avoid copying data.

Process one file at a time...read then write data for each file. There is no need to build a list of all the data.

Something like this (untested):

import numpy as np
import glob
import natsort

with open('x_position.txt', 'a') as outfile:

    filenames = natsort.natsorted(glob.glob("CoordTestCode/ParticleCoordU*"))

    for f in filenames:
        data = np.loadtxt(f, skiprows=9, max_rows=20000, usecols=3, unpack=True, ndmin=2)

        np.savetxt(outfile, data, fmt="%7.3f")

Avoid iterating over rows of a dataframe or array. Avoid copying data.

Process one file at a time...read then write data for each file. There is no need to build a list of all the data.

Something like this (untested):

import numpy as np
import glob
import natsort

with open('x_position.txt', 'a') as outfile:

    filenames = natsort.natsorted(glob.glob("CoordTestCode/ParticleCoordU*"))

    for f in filenames:
        data = np.loadtxt(f, skiprows=9, max_rows=20000, usecols=3, unpack=True, ndmin=2)

        np.savetxt(outfile, data, fmt="%7.3f")

Source Link

answered Aug 27, 2021 at 1:35

RootTwo

10.7k
1
14
30

Avoid iterating over rows of a dataframe or array. Avoid copying data.

Process one file at a time...read then write data for each file. There is no need to build a list of all the data.

Something like this (untested):

import numpy as np
import glob
import natsort

with open('x_position.txt', 'a') as outfile:

    filenames = natsort.natsorted(glob.glob("CoordTestCode/ParticleCoordU*"))

    for f in filenames:
        data = np.loadtxt(f, skiprows=9, max_rows=20000, usecols=3, unpack=True, ndmin=2)

        np.savetxt(outfile, data, fmt="%7.3f")