Skip to main content
add plot of your results
Source Link
Gabriel Staples
  • 56.9k
  • 35
  • 302
  • 399
Update the names in the descriptions to match those in the print() statements
Source Link
Gabriel Staples
  • 56.9k
  • 35
  • 302
  • 399
  1. The usual iterrows() is convenient, but damn slow:

    start_time = time.clock()
    result = 0
    for _, row in df.iterrows():
        result += max(row['B'], row['C'])
    
    total_elapsed_time = round(time.clock() - start_time, 2)
    print("1. Iterrows done in {} seconds, result = {}".format(total_elapsed_time, result))
    
  2. TheUsing the default named itertuples() is already much faster, but it doesn't work with column names such as My Col-Name is very Strange (you should avoid this method if your columns are repeated or if a column name cannot be simply converted to a Python variable name).:

    start_time = time.clock()
    result = 0
    for row in df.itertuples(index=False):
        result += max(row.B, row.C)
    
    total_elapsed_time = round(time.clock() - start_time, 2)
    print("2. Named Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))
    
  3. The defaultUsing nameless itertuples() usingby setting name=None is even faster, but not really convenient, as you have to define a variable per column.

    start_time = time.clock()
    result = 0
    for(_, col1, col2, col3, col4) in df.itertuples(name=None):
        result += max(col2, col3)
    
    total_elapsed_time = round(time.clock() - start_time, 2)
    print("3. Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))
    
  4. Finally, theusing namedpolyvalent itertuples() is slower than the previous example, but you do not have to define a variable per column and it works with column names such as My Col-Name is very Strange.

    start_time = time.clock()
    result = 0
    for row in df.itertuples(index=False):
        result += max(row[df.columns.get_loc('B')], row[df.columns.get_loc('C')])
    
    total_elapsed_time = round(time.clock() - start_time, 2)
    print("4. Polyvalent Itertuples working even with special characters in the column name done in {} seconds, result = {}".format(total_elapsed_time, result))
    
  1. The usual iterrows() is convenient, but damn slow:

    start_time = time.clock()
    result = 0
    for _, row in df.iterrows():
        result += max(row['B'], row['C'])
    
    total_elapsed_time = round(time.clock() - start_time, 2)
    print("1. Iterrows done in {} seconds, result = {}".format(total_elapsed_time, result))
    
  2. The default itertuples() is already much faster, but it doesn't work with column names such as My Col-Name is very Strange (you should avoid this method if your columns are repeated or if a column name cannot be simply converted to a Python variable name).:

    start_time = time.clock()
    result = 0
    for row in df.itertuples(index=False):
        result += max(row.B, row.C)
    
    total_elapsed_time = round(time.clock() - start_time, 2)
    print("2. Named Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))
    
  3. The default itertuples() using name=None is even faster, but not really convenient, as you have to define a variable per column.

    start_time = time.clock()
    result = 0
    for(_, col1, col2, col3, col4) in df.itertuples(name=None):
        result += max(col2, col3)
    
    total_elapsed_time = round(time.clock() - start_time, 2)
    print("3. Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))
    
  4. Finally, the named itertuples() is slower than the previous example, but you do not have to define a variable per column and it works with column names such as My Col-Name is very Strange.

    start_time = time.clock()
    result = 0
    for row in df.itertuples(index=False):
        result += max(row[df.columns.get_loc('B')], row[df.columns.get_loc('C')])
    
    total_elapsed_time = round(time.clock() - start_time, 2)
    print("4. Polyvalent Itertuples working even with special characters in the column name done in {} seconds, result = {}".format(total_elapsed_time, result))
    
  1. The usual iterrows() is convenient, but damn slow:

    start_time = time.clock()
    result = 0
    for _, row in df.iterrows():
        result += max(row['B'], row['C'])
    
    total_elapsed_time = round(time.clock() - start_time, 2)
    print("1. Iterrows done in {} seconds, result = {}".format(total_elapsed_time, result))
    
  2. Using the default named itertuples() is already much faster, but it doesn't work with column names such as My Col-Name is very Strange (you should avoid this method if your columns are repeated or if a column name cannot be simply converted to a Python variable name).:

    start_time = time.clock()
    result = 0
    for row in df.itertuples(index=False):
        result += max(row.B, row.C)
    
    total_elapsed_time = round(time.clock() - start_time, 2)
    print("2. Named Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))
    
  3. Using nameless itertuples() by setting name=None is even faster, but not really convenient, as you have to define a variable per column.

    start_time = time.clock()
    result = 0
    for(_, col1, col2, col3, col4) in df.itertuples(name=None):
        result += max(col2, col3)
    
    total_elapsed_time = round(time.clock() - start_time, 2)
    print("3. Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))
    
  4. Finally, using polyvalent itertuples() is slower than the previous example, but you do not have to define a variable per column and it works with column names such as My Col-Name is very Strange.

    start_time = time.clock()
    result = 0
    for row in df.itertuples(index=False):
        result += max(row[df.columns.get_loc('B')], row[df.columns.get_loc('C')])
    
    total_elapsed_time = round(time.clock() - start_time, 2)
    print("4. Polyvalent Itertuples working even with special characters in the column name done in {} seconds, result = {}".format(total_elapsed_time, result))
    
Fix broken code formatting in all bullets; do general cleanup and minor formatting improvements
Source Link
Gabriel Staples
  • 56.9k
  • 35
  • 302
  • 399

How to iterate efficiently

How to iterate efficiently

If you really have to iterate a Pandas dataframeDataFrame, you will probably want to avoid using iterrows()iterrows(). There are different methods, and the usual iterrows() is far from being the best. itertuples`itertuples()`` can be 100 times faster.

  • As a general rule, use df.itertuples(name=None). In particular, when you have a fixed number columns and lessfewer than 255 columns. See pointbullet (3) below.
  • Otherwise, use df.itertuples(), except if your columns have special characters such as spaces or '-'-. See pointbullet (2) below.
  • It is possible to use itertuples() even if your dataframe has strange columns, by using the last example below. See pointbullet (4) below.
  • Only use iterrows() if you cannot use any of the previous solutions. See pointbullet (1) below.

Different methods to iterate over rows in a Pandas dataframe:

Different methods to iterate over rows in a Pandas DataFrame:

GenerateFirst, for use in all examples below, generate a random dataframe with a million rows and 4 columns, like this:

    df = pd.DataFrame(np.random.randint(0, 100, size=(1000000, 4)), columns=list('ABCD'))
    print(df)
df = pd.DataFrame(np.random.randint(0, 100, size=(1000000, 4)), columns=list('ABCD'))
print(df)

The output of all of these examples is shown at the bottom.

  1. The usual iterrows() is convenient, but damn slow:

    start_time = time.clock() result = 0 for _, row in df.iterrows(): result += max(row['B'], row['C'])

    total_elapsed_time = round(time.clock() - start_time, 2) print("1. Iterrows done in {} seconds, result = {}".format(total_elapsed_time, result))

    start_time = time.clock()
    result = 0
    for _, row in df.iterrows():
        result += max(row['B'], row['C'])
    
    total_elapsed_time = round(time.clock() - start_time, 2)
    print("1. Iterrows done in {} seconds, result = {}".format(total_elapsed_time, result))
    
  2. The default itertuples() is already much faster, but it doesn't work with column names such as My Col-Name is very Strange (you should avoid this method if your columns are repeated or if a column name cannot be simply converted to a Python variable name).:

    start_time = time.clock() result = 0 for row in df.itertuples(index=False): result += max(row.B, row.C)

    total_elapsed_time = round(time.clock() - start_time, 2) print("2. Named Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))

    start_time = time.clock()
    result = 0
    for row in df.itertuples(index=False):
        result += max(row.B, row.C)
    
    total_elapsed_time = round(time.clock() - start_time, 2)
    print("2. Named Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))
    
  3. The default itertuples() using name=Nonename=None is even faster, but not really convenient, as you have to define a variable per column.

    start_time = time.clock() result = 0 for(_, col1, col2, col3, col4) in df.itertuples(name=None): result += max(col2, col3)

    total_elapsed_time = round(time.clock() - start_time, 2) print("3. Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))

    start_time = time.clock()
    result = 0
    for(_, col1, col2, col3, col4) in df.itertuples(name=None):
        result += max(col2, col3)
    
    total_elapsed_time = round(time.clock() - start_time, 2)
    print("3. Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))
    
  4. Finally, the named named itertuples() is slower than the previous pointexample, but you do not have to define a variable per column and it works with column names such as My Col-Name is very Strange.

    start_time = time.clock() result = 0 for row in df.itertuples(index=False): result += max(row[df.columns.get_loc('B')], row[df.columns.get_loc('C')])

    total_elapsed_time = round(time.clock() - start_time, 2) print("4. Polyvalent Itertuples working even with special characters in the column name done in {} seconds, result = {}".format(total_elapsed_time, result))

    start_time = time.clock()
    result = 0
    for row in df.itertuples(index=False):
        result += max(row[df.columns.get_loc('B')], row[df.columns.get_loc('C')])
    
    total_elapsed_time = round(time.clock() - start_time, 2)
    print("4. Polyvalent Itertuples working even with special characters in the column name done in {} seconds, result = {}".format(total_elapsed_time, result))
    

Output of all code and examples above:

         A   B   C   D
0       41  63  42  23
1       54   9  24  65
2       15  34  10   9
3       39  94  82  97
4        4  88  79  54
...     ..  ..  ..  ..
999995  48  27   4  25
999996  16  51  34  28
999997   1  39  61  14
999998  66  51  27  70
999999  51  53  47  99

[1000000 rows x 4 columns]

1. Iterrows done in 104.96 seconds, result = 66151519
2. Named Itertuples done in 1.26 seconds, result = 66151519
3. Itertuples done in 0.94 seconds, result = 66151519
4. Polyvalent Itertuples working even with special characters in the column name done in 2.94 seconds, result = 66151519
         A   B   C   D
0       41  63  42  23
1       54   9  24  65
2       15  34  10   9
3       39  94  82  97
4        4  88  79  54
...     ..  ..  ..  ..
999995  48  27   4  25
999996  16  51  34  28
999997   1  39  61  14
999998  66  51  27  70
999999  51  53  47  99

[1000000 rows x 4 columns]

1. Iterrows done in 104.96 seconds, result = 66151519
2. Named Itertuples done in 1.26 seconds, result = 66151519
3. Itertuples done in 0.94 seconds, result = 66151519
4. Polyvalent Itertuples working even with special characters in the column name done in 2.94 seconds, result = 66151519

This article is a very interesting comparison between iterrows and itertuples

See also

  1. This article is a very interesting comparison between iterrows() and itertuples()

How to iterate efficiently

If you really have to iterate a Pandas dataframe, you will probably want to avoid using iterrows(). There are different methods and the usual iterrows() is far from being the best. itertuples() can be 100 times faster.

  • As a general rule, use df.itertuples(name=None). In particular, when you have a fixed number columns and less than 255 columns. See point (3)
  • Otherwise, use df.itertuples() except if your columns have special characters such as spaces or '-'. See point (2)
  • It is possible to use itertuples() even if your dataframe has strange columns by using the last example. See point (4)
  • Only use iterrows() if you cannot use the previous solutions. See point (1)

Different methods to iterate over rows in a Pandas dataframe:

Generate a random dataframe with a million rows and 4 columns:

    df = pd.DataFrame(np.random.randint(0, 100, size=(1000000, 4)), columns=list('ABCD'))
    print(df)
  1. The usual iterrows() is convenient, but damn slow:

    start_time = time.clock() result = 0 for _, row in df.iterrows(): result += max(row['B'], row['C'])

    total_elapsed_time = round(time.clock() - start_time, 2) print("1. Iterrows done in {} seconds, result = {}".format(total_elapsed_time, result))

  2. The default itertuples() is already much faster, but it doesn't work with column names such as My Col-Name is very Strange (you should avoid this method if your columns are repeated or if a column name cannot be simply converted to a Python variable name).:

    start_time = time.clock() result = 0 for row in df.itertuples(index=False): result += max(row.B, row.C)

    total_elapsed_time = round(time.clock() - start_time, 2) print("2. Named Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))

  3. The default itertuples() using name=None is even faster but not really convenient as you have to define a variable per column.

    start_time = time.clock() result = 0 for(_, col1, col2, col3, col4) in df.itertuples(name=None): result += max(col2, col3)

    total_elapsed_time = round(time.clock() - start_time, 2) print("3. Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))

  4. Finally, the named itertuples() is slower than the previous point, but you do not have to define a variable per column and it works with column names such as My Col-Name is very Strange.

    start_time = time.clock() result = 0 for row in df.itertuples(index=False): result += max(row[df.columns.get_loc('B')], row[df.columns.get_loc('C')])

    total_elapsed_time = round(time.clock() - start_time, 2) print("4. Polyvalent Itertuples working even with special characters in the column name done in {} seconds, result = {}".format(total_elapsed_time, result))

Output:

         A   B   C   D
0       41  63  42  23
1       54   9  24  65
2       15  34  10   9
3       39  94  82  97
4        4  88  79  54
...     ..  ..  ..  ..
999995  48  27   4  25
999996  16  51  34  28
999997   1  39  61  14
999998  66  51  27  70
999999  51  53  47  99

[1000000 rows x 4 columns]

1. Iterrows done in 104.96 seconds, result = 66151519
2. Named Itertuples done in 1.26 seconds, result = 66151519
3. Itertuples done in 0.94 seconds, result = 66151519
4. Polyvalent Itertuples working even with special characters in the column name done in 2.94 seconds, result = 66151519

This article is a very interesting comparison between iterrows and itertuples

How to iterate efficiently

If you really have to iterate a Pandas DataFrame, you will probably want to avoid using iterrows(). There are different methods, and the usual iterrows() is far from being the best. `itertuples()`` can be 100 times faster.

  • As a general rule, use df.itertuples(name=None). In particular, when you have a fixed number columns and fewer than 255 columns. See bullet (3) below.
  • Otherwise, use df.itertuples(), except if your columns have special characters such as spaces or -. See bullet (2) below.
  • It is possible to use itertuples() even if your dataframe has strange columns, by using the last example below. See bullet (4) below.
  • Only use iterrows() if you cannot use any of the previous solutions. See bullet (1) below.

Different methods to iterate over rows in a Pandas DataFrame:

First, for use in all examples below, generate a random dataframe with a million rows and 4 columns, like this:

df = pd.DataFrame(np.random.randint(0, 100, size=(1000000, 4)), columns=list('ABCD'))
print(df)

The output of all of these examples is shown at the bottom.

  1. The usual iterrows() is convenient, but damn slow:

    start_time = time.clock()
    result = 0
    for _, row in df.iterrows():
        result += max(row['B'], row['C'])
    
    total_elapsed_time = round(time.clock() - start_time, 2)
    print("1. Iterrows done in {} seconds, result = {}".format(total_elapsed_time, result))
    
  2. The default itertuples() is already much faster, but it doesn't work with column names such as My Col-Name is very Strange (you should avoid this method if your columns are repeated or if a column name cannot be simply converted to a Python variable name).:

    start_time = time.clock()
    result = 0
    for row in df.itertuples(index=False):
        result += max(row.B, row.C)
    
    total_elapsed_time = round(time.clock() - start_time, 2)
    print("2. Named Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))
    
  3. The default itertuples() using name=None is even faster, but not really convenient, as you have to define a variable per column.

    start_time = time.clock()
    result = 0
    for(_, col1, col2, col3, col4) in df.itertuples(name=None):
        result += max(col2, col3)
    
    total_elapsed_time = round(time.clock() - start_time, 2)
    print("3. Itertuples done in {} seconds, result = {}".format(total_elapsed_time, result))
    
  4. Finally, the named itertuples() is slower than the previous example, but you do not have to define a variable per column and it works with column names such as My Col-Name is very Strange.

    start_time = time.clock()
    result = 0
    for row in df.itertuples(index=False):
        result += max(row[df.columns.get_loc('B')], row[df.columns.get_loc('C')])
    
    total_elapsed_time = round(time.clock() - start_time, 2)
    print("4. Polyvalent Itertuples working even with special characters in the column name done in {} seconds, result = {}".format(total_elapsed_time, result))
    

Output of all code and examples above:

         A   B   C   D
0       41  63  42  23
1       54   9  24  65
2       15  34  10   9
3       39  94  82  97
4        4  88  79  54
...     ..  ..  ..  ..
999995  48  27   4  25
999996  16  51  34  28
999997   1  39  61  14
999998  66  51  27  70
999999  51  53  47  99

[1000000 rows x 4 columns]

1. Iterrows done in 104.96 seconds, result = 66151519
2. Named Itertuples done in 1.26 seconds, result = 66151519
3. Itertuples done in 0.94 seconds, result = 66151519
4. Polyvalent Itertuples working even with special characters in the column name done in 2.94 seconds, result = 66151519

See also

  1. This article is a very interesting comparison between iterrows() and itertuples()
added 4 characters in body
Source Link
Gabriel Staples
  • 56.9k
  • 35
  • 302
  • 399
Loading
A question mark deserves a question formed as such - missing auxiliary (or helping) verb - see e.g. <https://www.youtube.com/watch?v=t4yWEt0OSpg&t=1m49s> (see also <https://www.youtube.com/watch?v=kS5NfSzXfrI> (QUASM)).
Source Link
Peter Mortensen
  • 31.2k
  • 22
  • 111
  • 134
Loading
added 3 characters in body
Source Link
Romain Capron
  • 1.8k
  • 1
  • 20
  • 26
Loading
deleted 3 characters in body
Source Link
Romain Capron
  • 1.8k
  • 1
  • 20
  • 26
Loading
Structure improved
Source Link
Romain Capron
  • 1.8k
  • 1
  • 20
  • 26
Loading
Structure improved
Source Link
Romain Capron
  • 1.8k
  • 1
  • 20
  • 26
Loading
Heading added
Source Link
Romain Capron
  • 1.8k
  • 1
  • 20
  • 26
Loading
Source Link
Romain Capron
  • 1.8k
  • 1
  • 20
  • 26
Loading