I’m trying to efficiently read about 10 million rows (single column) from a database table in Python and I’m not sure if my current approach is reasonable or if I’m missing some optimizations.

Approach 1: cursor + fetchmany On average, it takes around 1.2 minutes to read 10 million rows.

sql = f"SELECT {col_id} FROM {table_id}"
raw_conn = engine.raw_connection()
try:
    cursor = raw_conn.cursor()
    cursor.execute(sql)
    
    total_rows = 0
    while True:
        rows = cursor.fetchmany(chunk_size)
        if not rows:
            break
        
        # Direct string conversion - fastest approach
        values.extend(str(row[0]) for row in rows)

Approach 2: pandas read_sql with chunks
On average, this takes around 2 minutes to read 10 million rows.

sql = f"SELECT {col_id} FROM {table_id} WHERE {col_id} IS NOT NULL"
values: List[str] = []
for chunk in pd.read_sql(sql, engine, chunksize=CHUNK_SIZE):
    # .astype(str) keeps nulls out (already filtered in SQL)
    values.extend(chunk.iloc[:, 0].astype(str).tolist())

What is the most efficient way to read this many rows from the table into Python?
Are these timings (~1.2–2 minutes for 10 million rows) reasonable, or can this be significantly improved with a different pattern (e.g., driver settings, batching strategy, multiprocessing, or a different library)?

6 Replies 6

Any reason you are reading it into python and not executing in the DB?

I think this is related to sharding the data into chunks and delegate it to process with CPU cores or Memory available. As you mentioned it would related with Multiprocessing and Batching strategy topic.

Yes , ~1.2 minutes to read ~10M rows into Python is normal, and you’re already close to the practical limits. At that scale, the bottleneck isn’t the database but the combination of network transfer + Python needing to allocate millions of objects.

You can shave some time off with driver tweaks (server-side cursor, larger fetch sizes, binary protocol), but you won’t get a 5×–10× speedup unless you change the data handling model entirely (e.g., Arrow/Polars/NumPy to avoid Python object creation).

crore is an Indian unit btw, that people in other countries do not understand.

Following what Mohamed said, you could try using Pandas with an Arrow backing array to not have to allocate Python objects (IIRC), and no chunksize to not waste time on batching. Whether that will actually speed up the process, IDK, but it might be worth a shot.

Check out Andy Hayden's answer on "How to make good reproducible pandas examples" for a few tips on performance issues, and tips for if you want to make a reproducible example.

for performance issues, [...] definitely use %timeit and possibly %prun to profile your code.

[To provide an example,] you should generate:

df = pd.DataFrame(np.random.randn(100000000, 10))

Consider using np.random.seed so we have the exact same frame.

You could also take a look at the note from Enhancing performance in the Pandas user guide:

... users interested in enhancing performance are highly encouraged to install the recommended dependencies for pandas. These dependencies are often not installed by default, but will offer speed improvements if present.

Most of the recommended dependencies aren't related to what you're doing, but at least a few are.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.