Revisions to How to make good reproducible pandas examples

link np.random.seed

Source Link

edited Apr 18 at 0:59

wjandrea

34.2k
10
71
108

Do include a small example DataFrame, either as runnable code:
```
In [1]: df = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])
```
or make it "copy and pasteable" using pd.read_clipboard(sep=r'\s\s+').
```
In [2]: df
Out[2]:
   A  B
0  1  2
1  1  3
2  4  6
```
Test it yourself to make sure it works and reproduces the issue.
- You can format the text for Stack Overflow by highlighting and using Ctrl+K (or prepend four spaces to each line), or place three backticks (```) above and below your code with your code unindented.
- I really do mean small. The vast majority of example DataFrames could be fewer than 6 rows and 6 columns,^{[citation needed]} and I bet I can do it in 5x3. Can you reproduce the error with df = df[relevant_columns].head()? If not, fiddle around to see if you can make up a small DataFrame which exhibits the issue you are facing.
  
  But every rule has an exception, the obvious one being for performance issues (in which case definitely use %timeit and possibly %prun to profile your code), where you should generate:
```
df = pd.DataFrame(np.random.randn(100000000, 10))
```
  Consider using np.random.seednp.random.seed so we have the exact same frame. Having said that, "make this code fast for me" is not strictly on topic for the site.
- For getting runnable code, df.to_dict is often useful, with the different orient options for different cases. In the example above, I could have grabbed the columns and values from df.to_dict('split').
Write out the outcome you desire (similarly to above)
```
In [3]: iwantthis
Out[3]:
   A  B
0  1  5
1  4  6
```
Explain where the numbers come from:

The 5 is the sum of the B column for the rows where A is 1.
Do show the code you've tried:
```
In [4]: df.groupby('A').sum()
Out[4]:
   B
A
1  5
4  6
```
But say what's incorrect:

The A column is in the index rather than a column.
Do show you've done some research (search the documentation, search Stack Overflow), and give a summary:

The docstring for sum simply states "Compute sum of group values"

The groupby documentation doesn't give any examples for this.

Aside: the answer here is to use df.groupby('A', as_index=False).sum().
If it's relevant that you have Timestamp columns, e.g. you're resampling or something, then be explicit and apply pd.to_datetime to them for good measure.
```
df['date'] = pd.to_datetime(df['date']) # this column ought to be date.
```
Sometimes this is the issue itself: they were strings.

Do include a small example DataFrame, either as runnable code:
```
In [1]: df = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])
```
or make it "copy and pasteable" using pd.read_clipboard(sep=r'\s\s+').
```
In [2]: df
Out[2]:
   A  B
0  1  2
1  1  3
2  4  6
```
Test it yourself to make sure it works and reproduces the issue.
- You can format the text for Stack Overflow by highlighting and using Ctrl+K (or prepend four spaces to each line), or place three backticks (```) above and below your code with your code unindented.
- I really do mean small. The vast majority of example DataFrames could be fewer than 6 rows and 6 columns,^{[citation needed]} and I bet I can do it in 5x3. Can you reproduce the error with df = df[relevant_columns].head()? If not, fiddle around to see if you can make up a small DataFrame which exhibits the issue you are facing.
  
  But every rule has an exception, the obvious one being for performance issues (in which case definitely use %timeit and possibly %prun to profile your code), where you should generate:
```
df = pd.DataFrame(np.random.randn(100000000, 10))
```
  Consider using np.random.seed so we have the exact same frame. Having said that, "make this code fast for me" is not strictly on topic for the site.
- For getting runnable code, df.to_dict is often useful, with the different orient options for different cases. In the example above, I could have grabbed the columns and values from df.to_dict('split').
Write out the outcome you desire (similarly to above)
```
In [3]: iwantthis
Out[3]:
   A  B
0  1  5
1  4  6
```
Explain where the numbers come from:

The 5 is the sum of the B column for the rows where A is 1.
Do show the code you've tried:
```
In [4]: df.groupby('A').sum()
Out[4]:
   B
A
1  5
4  6
```
But say what's incorrect:

The A column is in the index rather than a column.
Do show you've done some research (search the documentation, search Stack Overflow), and give a summary:

The docstring for sum simply states "Compute sum of group values"

The groupby documentation doesn't give any examples for this.

Aside: the answer here is to use df.groupby('A', as_index=False).sum().
If it's relevant that you have Timestamp columns, e.g. you're resampling or something, then be explicit and apply pd.to_datetime to them for good measure.
```
df['date'] = pd.to_datetime(df['date']) # this column ought to be date.
```
Sometimes this is the issue itself: they were strings.

Do include a small example DataFrame, either as runnable code:
```
In [1]: df = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])
```
or make it "copy and pasteable" using pd.read_clipboard(sep=r'\s\s+').
```
In [2]: df
Out[2]:
   A  B
0  1  2
1  1  3
2  4  6
```
Test it yourself to make sure it works and reproduces the issue.
- You can format the text for Stack Overflow by highlighting and using Ctrl+K (or prepend four spaces to each line), or place three backticks (```) above and below your code with your code unindented.
- I really do mean small. The vast majority of example DataFrames could be fewer than 6 rows and 6 columns,^{[citation needed]} and I bet I can do it in 5x3. Can you reproduce the error with df = df[relevant_columns].head()? If not, fiddle around to see if you can make up a small DataFrame which exhibits the issue you are facing.
  
  But every rule has an exception, the obvious one being for performance issues (in which case definitely use %timeit and possibly %prun to profile your code), where you should generate:
```
df = pd.DataFrame(np.random.randn(100000000, 10))
```
  Consider using np.random.seed so we have the exact same frame. Having said that, "make this code fast for me" is not strictly on topic for the site.
- For getting runnable code, df.to_dict is often useful, with the different orient options for different cases. In the example above, I could have grabbed the columns and values from df.to_dict('split').
Write out the outcome you desire (similarly to above)
```
In [3]: iwantthis
Out[3]:
   A  B
0  1  5
1  4  6
```
Explain where the numbers come from:

The 5 is the sum of the B column for the rows where A is 1.
Do show the code you've tried:
```
In [4]: df.groupby('A').sum()
Out[4]:
   B
A
1  5
4  6
```
But say what's incorrect:

The A column is in the index rather than a column.
Do show you've done some research (search the documentation, search Stack Overflow), and give a summary:

The docstring for sum simply states "Compute sum of group values"

The groupby documentation doesn't give any examples for this.

Aside: the answer here is to use df.groupby('A', as_index=False).sum().
If it's relevant that you have Timestamp columns, e.g. you're resampling or something, then be explicit and apply pd.to_datetime to them for good measure.
```
df['date'] = pd.to_datetime(df['date']) # this column ought to be date.
```
Sometimes this is the issue itself: they were strings.

+link on "not strictly on topic for the site"

Source Link

edited Feb 16 at 17:34

wjandrea

34.2k
10
71
108

Do include a small example DataFrame, either as runnable code:
```
In [1]: df = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])
```
or make it "copy and pasteable" using pd.read_clipboard(sep=r'\s\s+').
```
In [2]: df
Out[2]:
   A  B
0  1  2
1  1  3
2  4  6
```
Test it yourself to make sure it works and reproduces the issue.
- You can format the text for Stack Overflow by highlighting and using Ctrl+K (or prepend four spaces to each line), or place three backticks (```) above and below your code with your code unindented.
- I really do mean small. The vast majority of example DataFrames could be fewer than 6 rows and 6 columns,^{[citation needed]} and I bet I can do it in 5x3. Can you reproduce the error with df = df[relevant_columns].head()? If not, fiddle around to see if you can make up a small DataFrame which exhibits the issue you are facing.
  
  But every rule has an exception, the obvious one being for performance issues (in which case definitely use %timeit and possibly %prun to profile your code), where you should generate:
```
df = pd.DataFrame(np.random.randn(100000000, 10))
```
  Consider using np.random.seed so we have the exact same frame. Having said that, "make this code fast for me" is not strictly on topic for the sitenot strictly on topic for the site.
- For getting runnable code, df.to_dict is often useful, with the different orient options for different cases. In the example above, I could have grabbed the columns and values from df.to_dict('split').
Write out the outcome you desire (similarly to above)
```
In [3]: iwantthis
Out[3]:
   A  B
0  1  5
1  4  6
```
Explain where the numbers come from:

The 5 is the sum of the B column for the rows where A is 1.
Do show the code you've tried:
```
In [4]: df.groupby('A').sum()
Out[4]:
   B
A
1  5
4  6
```
But say what's incorrect:

The A column is in the index rather than a column.
Do show you've done some research (search the documentation, search Stack Overflow), and give a summary:

The docstring for sum simply states "Compute sum of group values"

The groupby documentation doesn't give any examples for this.

Aside: the answer here is to use df.groupby('A', as_index=False).sum().
If it's relevant that you have Timestamp columns, e.g. you're resampling or something, then be explicit and apply pd.to_datetime to them for good measure.
```
df['date'] = pd.to_datetime(df['date']) # this column ought to be date.
```
Sometimes this is the issue itself: they were strings.

Do include a small example DataFrame, either as runnable code:
```
In [1]: df = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])
```
or make it "copy and pasteable" using pd.read_clipboard(sep=r'\s\s+').
```
In [2]: df
Out[2]:
   A  B
0  1  2
1  1  3
2  4  6
```
Test it yourself to make sure it works and reproduces the issue.
- You can format the text for Stack Overflow by highlighting and using Ctrl+K (or prepend four spaces to each line), or place three backticks (```) above and below your code with your code unindented.
- I really do mean small. The vast majority of example DataFrames could be fewer than 6 rows and 6 columns,^{[citation needed]} and I bet I can do it in 5x3. Can you reproduce the error with df = df[relevant_columns].head()? If not, fiddle around to see if you can make up a small DataFrame which exhibits the issue you are facing.
  
  But every rule has an exception, the obvious one being for performance issues (in which case definitely use %timeit and possibly %prun to profile your code), where you should generate:
```
df = pd.DataFrame(np.random.randn(100000000, 10))
```
  Consider using np.random.seed so we have the exact same frame. Having said that, "make this code fast for me" is not strictly on topic for the site.
- For getting runnable code, df.to_dict is often useful, with the different orient options for different cases. In the example above, I could have grabbed the columns and values from df.to_dict('split').
Write out the outcome you desire (similarly to above)
```
In [3]: iwantthis
Out[3]:
   A  B
0  1  5
1  4  6
```
Explain where the numbers come from:

The 5 is the sum of the B column for the rows where A is 1.
Do show the code you've tried:
```
In [4]: df.groupby('A').sum()
Out[4]:
   B
A
1  5
4  6
```
But say what's incorrect:

The A column is in the index rather than a column.
Do show you've done some research (search the documentation, search Stack Overflow), and give a summary:

The docstring for sum simply states "Compute sum of group values"

The groupby documentation doesn't give any examples for this.

Aside: the answer here is to use df.groupby('A', as_index=False).sum().
If it's relevant that you have Timestamp columns, e.g. you're resampling or something, then be explicit and apply pd.to_datetime to them for good measure.
```
df['date'] = pd.to_datetime(df['date']) # this column ought to be date.
```
Sometimes this is the issue itself: they were strings.

Do include a small example DataFrame, either as runnable code:
```
In [1]: df = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])
```
or make it "copy and pasteable" using pd.read_clipboard(sep=r'\s\s+').
```
In [2]: df
Out[2]:
   A  B
0  1  2
1  1  3
2  4  6
```
Test it yourself to make sure it works and reproduces the issue.
- You can format the text for Stack Overflow by highlighting and using Ctrl+K (or prepend four spaces to each line), or place three backticks (```) above and below your code with your code unindented.
- I really do mean small. The vast majority of example DataFrames could be fewer than 6 rows and 6 columns,^{[citation needed]} and I bet I can do it in 5x3. Can you reproduce the error with df = df[relevant_columns].head()? If not, fiddle around to see if you can make up a small DataFrame which exhibits the issue you are facing.
  
  But every rule has an exception, the obvious one being for performance issues (in which case definitely use %timeit and possibly %prun to profile your code), where you should generate:
```
df = pd.DataFrame(np.random.randn(100000000, 10))
```
  Consider using np.random.seed so we have the exact same frame. Having said that, "make this code fast for me" is not strictly on topic for the site.
- For getting runnable code, df.to_dict is often useful, with the different orient options for different cases. In the example above, I could have grabbed the columns and values from df.to_dict('split').
Write out the outcome you desire (similarly to above)
```
In [3]: iwantthis
Out[3]:
   A  B
0  1  5
1  4  6
```
Explain where the numbers come from:

The 5 is the sum of the B column for the rows where A is 1.
Do show the code you've tried:
```
In [4]: df.groupby('A').sum()
Out[4]:
   B
A
1  5
4  6
```
But say what's incorrect:

The A column is in the index rather than a column.
Do show you've done some research (search the documentation, search Stack Overflow), and give a summary:

The docstring for sum simply states "Compute sum of group values"

The groupby documentation doesn't give any examples for this.

Aside: the answer here is to use df.groupby('A', as_index=False).sum().
If it's relevant that you have Timestamp columns, e.g. you're resampling or something, then be explicit and apply pd.to_datetime to them for good measure.
```
df['date'] = pd.to_datetime(df['date']) # this column ought to be date.
```
Sometimes this is the issue itself: they were strings.

no sense getting `.head()` of irrelevant columns

Source Link

edited Feb 16 at 17:26

wjandrea

34.2k
10
71
108

Do include a small example DataFrame, either as runnable code:
```
In [1]: df = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])
```
or make it "copy and pasteable" using pd.read_clipboard(sep=r'\s\s+').
```
In [2]: df
Out[2]:
   A  B
0  1  2
1  1  3
2  4  6
```
Test it yourself to make sure it works and reproduces the issue.
- You can format the text for Stack Overflow by highlighting and using Ctrl+K (or prepend four spaces to each line), or place three backticks (```) above and below your code with your code unindented.
- I really do mean small. The vast majority of example DataFrames could be fewer than 6 rows and 6 columns,^{[citation needed]} and I bet I can do it in 5x3. Can you reproduce the error with df = dfdf[relevant_columns].head()[relevant_columns]? If not, fiddle around to see if you can make up a small DataFrame which exhibits the issue you are facing.
  
  But every rule has an exception, the obvious one being for performance issues (in which case definitely use %timeit and possibly %prun to profile your code), where you should generate:
```
df = pd.DataFrame(np.random.randn(100000000, 10))
```
  Consider using np.random.seed so we have the exact same frame. Having said that, "make this code fast for me" is not strictly on topic for the site.
- For getting runnable code, df.to_dict is often useful, with the different orient options for different cases. In the example above, I could have grabbed the columns and values from df.to_dict('split').
Write out the outcome you desire (similarly to above)
```
In [3]: iwantthis
Out[3]:
   A  B
0  1  5
1  4  6
```
Explain where the numbers come from:

The 5 is the sum of the B column for the rows where A is 1.
Do show the code you've tried:
```
In [4]: df.groupby('A').sum()
Out[4]:
   B
A
1  5
4  6
```
But say what's incorrect:

The A column is in the index rather than a column.
Do show you've done some research (search the documentation, search Stack Overflow), and give a summary:

The docstring for sum simply states "Compute sum of group values"

The groupby documentation doesn't give any examples for this.

Aside: the answer here is to use df.groupby('A', as_index=False).sum().
If it's relevant that you have Timestamp columns, e.g. you're resampling or something, then be explicit and apply pd.to_datetime to them for good measure.
```
df['date'] = pd.to_datetime(df['date']) # this column ought to be date.
```
Sometimes this is the issue itself: they were strings.

Do include a small example DataFrame, either as runnable code:
```
In [1]: df = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])
```
or make it "copy and pasteable" using pd.read_clipboard(sep=r'\s\s+').
```
In [2]: df
Out[2]:
   A  B
0  1  2
1  1  3
2  4  6
```
Test it yourself to make sure it works and reproduces the issue.
- You can format the text for Stack Overflow by highlighting and using Ctrl+K (or prepend four spaces to each line), or place three backticks (```) above and below your code with your code unindented.
- I really do mean small. The vast majority of example DataFrames could be fewer than 6 rows and 6 columns,^{[citation needed]} and I bet I can do it in 5x3. Can you reproduce the error with df = df.head()[relevant_columns]? If not, fiddle around to see if you can make up a small DataFrame which exhibits the issue you are facing.
  
  But every rule has an exception, the obvious one being for performance issues (in which case definitely use %timeit and possibly %prun to profile your code), where you should generate:
```
df = pd.DataFrame(np.random.randn(100000000, 10))
```
  Consider using np.random.seed so we have the exact same frame. Having said that, "make this code fast for me" is not strictly on topic for the site.
- For getting runnable code, df.to_dict is often useful, with the different orient options for different cases. In the example above, I could have grabbed the columns and values from df.to_dict('split').
Write out the outcome you desire (similarly to above)
```
In [3]: iwantthis
Out[3]:
   A  B
0  1  5
1  4  6
```
Explain where the numbers come from:

The 5 is the sum of the B column for the rows where A is 1.
Do show the code you've tried:
```
In [4]: df.groupby('A').sum()
Out[4]:
   B
A
1  5
4  6
```
But say what's incorrect:

The A column is in the index rather than a column.
Do show you've done some research (search the documentation, search Stack Overflow), and give a summary:

The docstring for sum simply states "Compute sum of group values"

The groupby documentation doesn't give any examples for this.

Aside: the answer here is to use df.groupby('A', as_index=False).sum().
If it's relevant that you have Timestamp columns, e.g. you're resampling or something, then be explicit and apply pd.to_datetime to them for good measure.
```
df['date'] = pd.to_datetime(df['date']) # this column ought to be date.
```
Sometimes this is the issue itself: they were strings.

Do include a small example DataFrame, either as runnable code:
```
In [1]: df = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])
```
or make it "copy and pasteable" using pd.read_clipboard(sep=r'\s\s+').
```
In [2]: df
Out[2]:
   A  B
0  1  2
1  1  3
2  4  6
```
Test it yourself to make sure it works and reproduces the issue.
- You can format the text for Stack Overflow by highlighting and using Ctrl+K (or prepend four spaces to each line), or place three backticks (```) above and below your code with your code unindented.
- I really do mean small. The vast majority of example DataFrames could be fewer than 6 rows and 6 columns,^{[citation needed]} and I bet I can do it in 5x3. Can you reproduce the error with df = df[relevant_columns].head()? If not, fiddle around to see if you can make up a small DataFrame which exhibits the issue you are facing.
  
  But every rule has an exception, the obvious one being for performance issues (in which case definitely use %timeit and possibly %prun to profile your code), where you should generate:
```
df = pd.DataFrame(np.random.randn(100000000, 10))
```
  Consider using np.random.seed so we have the exact same frame. Having said that, "make this code fast for me" is not strictly on topic for the site.
- For getting runnable code, df.to_dict is often useful, with the different orient options for different cases. In the example above, I could have grabbed the columns and values from df.to_dict('split').
Write out the outcome you desire (similarly to above)
```
In [3]: iwantthis
Out[3]:
   A  B
0  1  5
1  4  6
```
Explain where the numbers come from:

The 5 is the sum of the B column for the rows where A is 1.
Do show the code you've tried:
```
In [4]: df.groupby('A').sum()
Out[4]:
   B
A
1  5
4  6
```
But say what's incorrect:

The A column is in the index rather than a column.
Do show you've done some research (search the documentation, search Stack Overflow), and give a summary:

The docstring for sum simply states "Compute sum of group values"

The groupby documentation doesn't give any examples for this.

Aside: the answer here is to use df.groupby('A', as_index=False).sum().
If it's relevant that you have Timestamp columns, e.g. you're resampling or something, then be explicit and apply pd.to_datetime to them for good measure.
```
df['date'] = pd.to_datetime(df['date']) # this column ought to be date.
```
Sometimes this is the issue itself: they were strings.

Add points about number of columns and length of scalars. Generalize "relevant DataFrame" → "relevant data". Minor clarification about "split".

Source Link

edited Sep 1, 2025 at 19:48

wjandrea

34.2k
10
71
108

Loading

Avoid "SyntaxWarning: invalid escape sequence '\s' ".

Source Link

edited Dec 29, 2023 at 0:22

wjandrea

34.2k
10
71
108

Loading

Clarify "Test it yourself."

Source Link

edited Dec 11, 2023 at 21:32

wjandrea

34.2k
10
71
108

Loading

Mention Pandas 1.0 changes too. Clarify version numbers.

Source Link

edited Nov 23, 2023 at 16:39

wjandrea

34.2k
10
71
108

Loading

Mention what `%prun` does

Source Link

edited Oct 22, 2023 at 17:52

wjandrea

34.2k
10
71
108

Loading

Add `pd.show_versions()` as an alternative to `session_info`.

Source Link

edited Sep 13, 2023 at 14:54

wjandrea

34.2k
10
71
108

Loading

Add session_info, following from revision 13.

Source Link

edited Sep 9, 2023 at 20:09

wjandrea

34.2k
10
71
108

Loading

Move version point from "ugly" to "bad"

Source Link

edited Sep 8, 2023 at 16:20

wjandrea

34.2k
10
71
108

Loading

Move code formatting help to its own bullet and link the guide. Cover `to_dict`. Add link about "entire stack trace". Add point about version, following from revision 13. Other minor changes. Remove unnecessary CSV link.

Source Link

edited Sep 7, 2023 at 17:47

wjandrea

34.2k
10
71
108

Loading

Rollback to Revision 12

Source Link

edited Sep 7, 2023 at 16:44

wjandrea

34.2k
10
71
108

Loading

provide guideline to get session information

Source Link

edit approved Aug 30, 2023 at 10:40

Brian Tran

140
1
9

Loading

Link to specific magics. Improve formatting: avoid footnotes and tons of italics; use consistent quote formatting. Other minor improvements like grammar.

Source Link

edited Jan 27, 2023 at 18:03

wjandrea

34.2k
10
71
108

Loading

Clarify hatnote and add link to MRE.

Source Link

edited Jan 27, 2023 at 17:14

wjandrea

34.2k
10
71
108

Loading

Improve grammar and formatting (including reducing overused italics). Update IPython docs link. Remove noise.

Source Link

edited Sep 28, 2022 at 19:44

wjandrea

34.2k
10
71
108

Loading

Simplify formatting and grammar in notes for readability.

Source Link

edited Apr 3, 2022 at 17:29

wjandrea

34.2k
10
71
108

Loading

Fix formatting for CommonMark

Source Link

edited Mar 23, 2022 at 17:32

wjandrea

34.2k
10
71
108

Loading

small grammar fixes and minor correction

Source Link

edited Sep 13, 2021 at 20:07

ddejohn

9k
3
21
31

Loading

Active reading [<https://en.wikipedia.org/wiki/Pandas_%28software%29> <https://en.wikipedia.org/wiki/Comma-separated_values> <https://en.wikipedia.org/wiki/Sentence_clause_structure#Run-on_sentences> <http://stackoverflow.com/legal/trademark-guidance> (the last section)]. Expanded.

Source Link