how to efficiently read pq files - Python

Question

I have a list of files with .pq extension, whose names are stored in a list. My intention is to read these files, filter them based on pandas, and then merge them into a single pandas data frame.

Since there are thousands of files, the code currently runs super inefficiently. The biggest bottleneck is where I read the pq file. During the experiments, I commented out the filtering part. I've tried three different ways as shown below, however, it takes 1.5 seconds to reach each file which is quite slow. Are there alternative ways that I can perform these operations?

from tqdm import tqdm
from fastparquet import ParquetFile
import pandas as pd 
import pyarrow.parquet as pq

files = [.....]

#First way
for file in tqdm(files ):
    temp = pd.read_parquet(file)
    #filter temp and append 

#Second way
for file in tqdm(files):
    temp = ParquetFile(file).to_pandas()
    # filter temp and append

#Third way

for file in tqdm(files):
    temp = pq.read_table(source=file).to_pandas()
    # filter temp and append

Each line when I read file inside the for loop, it takes quite bit of a long time. For 24 files, I spend 28 seconds.

 24/24 [00:28<00:00,  1.19s/it]

 24/24 [00:25<00:00,  1.08s/it]

One sample file is on average 90MB that corresponds to 667858 rows and 48 columns. Data type is all numerical (i.e. float64). The number of rows may vary, but the number of columns remains the same.

You didn't tell us about the data volume: how many bytes per file? What df.shape and df.dtypes come back? Presumably "filter and append" were commented out during the timing run. Is this a local ext4fs linux filesystem, or something else? How many milliseconds does it take $ dd if=big.pq of=/dev/null bs=1M to read the bytes from disk? — J_H, Commented Feb 23, 2023 at 20:21
@J_H I updated my post accordingly. It's a windows machine. That's correst. I commented out the filterind and appending operations for now since reading the files one by one consumes too much time (at least what I think). — sergey_208, Commented Feb 23, 2023 at 20:26
@sergey_208, bottleneck is where I read the pq file - one of such iterations of the loop you posted besides reading a file includes filtering logic, why wouldn't you think that filtering could be a bottleneck? did you profile it? — RomanPerekhrest, Commented Feb 23, 2023 at 20:29
@RomanPerekhrest as I mentioned, filtering part is commented out. The time that I report is only for reading the files. — sergey_208, Commented Feb 23, 2023 at 20:33
@sergey_208, ok, in terms of efficiency and saving space: I can suggest you to read all files at once, but as those are parquet files - it's worth to check a possibility of applying filters on load phase. Also, are all files have the same schema? — RomanPerekhrest, Commented Feb 23, 2023 at 20:48

RomanPerekhrest · Accepted Answer · 2023-02-23 21:59:57Z

1

Read multiple parquet files(partitions) at once into pyarrow.parquet.ParquetDataset which accepts a directory name, single file name, or list of file names and conveniently allows filtering of scanned data:

import pyarrow.parquet as pq

dataset = pq.ParquetDataset(your_files,
                            use_legacy_dataset=False,
                            filters=[('columnName', 'in', filterList)])
df = dataset.read(use_threads=True).to_pandas()

answered Feb 23, 2023 at 21:59

RomanPerekhrest

93k4 gold badges74 silver badges112 bronze badges

Overall, it took 22 seconds. When I read and filter 48 files, the total time decreased from 55 seconds to 42 seconds
– sergey_208
Commented Feb 24, 2023 at 13:49

Add a comment |

J_H · Accepted Answer · 2023-02-24 03:00:07Z

The original post omits a reprex. I attempted to reproduce the reported symptoms with the enclosed code, but was unable to. The test system is a MacBook Air that mounts apfs SSD storage. The Intel Core i7 is clocked at 2.2 GHz.

#! /usr/bin/env python

# $ python -m cProfile -s tottime geo/ski/bench_parquet.py
from pathlib import Path
from time import time

from tqdm import tqdm
import numpy as np
import pandas as pd
import pyarrow.parquet as pq

K = 24
PQ_DIR = Path("/tmp/parquet.d")


def gen_dfs(k=K, dst_dir=PQ_DIR, shape=(667_858, 48)):
    dst_dir.mkdir(exist_ok=True)
    rng = np.random.default_rng()
    for i in range(k):
        df = pd.DataFrame(
            rng.integers(66_000, size=shape) / 1_000,
            columns=[f"col_{j}" for j in range(shape[1])],
        )
        print(i)
        df.to_parquet(dst_dir / f"{i}.parquet")


def read_dfs(src_dir=PQ_DIR):
    for file in src_dir.glob("*.parquet"):
        yield pd.read_parquet(file)


def main():
    gen_dfs()
    t0 = time()

    for df in tqdm(read_dfs()):
        assert len(df) > 0

    files = list(PQ_DIR.glob("*.parquet"))
    dataset = pq.ParquetDataset(files)
    assert dataset
    # df = dataset.read(use_threads=True).to_pandas()

    elapsed = time() - t0
    print(f"{elapsed:.3f} seconds elapsed, {elapsed / K:.3f} per file")


if __name__ == "__main__":
    main()

When reading two dozen files of FP data I observe timings like this:

7.879 seconds elapsed, 0.328 per file

Given that this is 2 M row / second, each row having more than a hundred bytes of data that needs to be decompressed, it seems like reasonable throughput to me.

Enabling the ParquetDataset .read() call shows about the same throughput, with one big caveat. Reading a handful of files works fine. If you want to read the full 2 GiB of compressed data, you'd better have plenty of free RAM to store it, or throughput will plummet by a factor of 10x. Often it's a win to operate on conveniently sized chunks, allocating and freeing as you go.

Specifying ... , compression=None on writes does not improve the read timings.

On the positive side, binary parquet format is demonstrably a big win. Reading the same data formatted as a 208 MiB plain text .CSV takes ~ 4 seconds -- an order of magnitude slower. Gzip'ing it yields the same size as parquet, at a cost of ~ 5 seconds to read it.

cProfile reveals that we spend ~ 20% of the time on disk I/O and ~ 80% of the time decompressing and marshaling the bits into array format, which sounds about right. I don't notice any terrible inefficiencies in this.

tl;dr: You're doing it correctly already.

Thanks for the detailed analysis. It's very usefull. So should I make peace with the peformance and not spend more time on this? — sergey_208, Commented Feb 24, 2023 at 13:42

Collectives™ on Stack Overflow

how to efficiently read pq files - Python

2 Answers 2

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Related