Skip to main content
link np.random.seed
Source Link
wjandrea
  • 34.2k
  • 10
  • 71
  • 108
  • Do include a small example DataFrame, either as runnable code:

    In [1]: df = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])
    

    or make it "copy and pasteable" using pd.read_clipboard(sep=r'\s\s+').

    In [2]: df
    Out[2]:
       A  B
    0  1  2
    1  1  3
    2  4  6
    

    Test it yourself to make sure it works and reproduces the issue.

    • You can format the text for Stack Overflow by highlighting and using Ctrl+K (or prepend four spaces to each line), or place three backticks (```) above and below your code with your code unindented.

    • I really do mean small. The vast majority of example DataFrames could be fewer than 6 rows and 6 columns,[citation needed] and I bet I can do it in 5x3. Can you reproduce the error with df = df[relevant_columns].head()? If not, fiddle around to see if you can make up a small DataFrame which exhibits the issue you are facing.

      But every rule has an exception, the obvious one being for performance issues (in which case definitely use %timeit and possibly %prun to profile your code), where you should generate:

      df = pd.DataFrame(np.random.randn(100000000, 10))
      

      Consider using np.random.seednp.random.seed so we have the exact same frame. Having said that, "make this code fast for me" is not strictly on topic for the site.

    • For getting runnable code, df.to_dict is often useful, with the different orient options for different cases. In the example above, I could have grabbed the columns and values from df.to_dict('split').

  • Write out the outcome you desire (similarly to above)

    In [3]: iwantthis
    Out[3]:
       A  B
    0  1  5
    1  4  6
    

    Explain where the numbers come from:

    The 5 is the sum of the B column for the rows where A is 1.

  • Do show the code you've tried:

    In [4]: df.groupby('A').sum()
    Out[4]:
       B
    A
    1  5
    4  6
    

    But say what's incorrect:

    The A column is in the index rather than a column.

  • Do show you've done some research (search the documentation, search Stack Overflow), and give a summary:

    The docstring for sum simply states "Compute sum of group values"

    The groupby documentation doesn't give any examples for this.

    Aside: the answer here is to use df.groupby('A', as_index=False).sum().

  • If it's relevant that you have Timestamp columns, e.g. you're resampling or something, then be explicit and apply pd.to_datetime to them for good measure.

    df['date'] = pd.to_datetime(df['date']) # this column ought to be date.
    

    Sometimes this is the issue itself: they were strings.

  • Do include a small example DataFrame, either as runnable code:

    In [1]: df = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])
    

    or make it "copy and pasteable" using pd.read_clipboard(sep=r'\s\s+').

    In [2]: df
    Out[2]:
       A  B
    0  1  2
    1  1  3
    2  4  6
    

    Test it yourself to make sure it works and reproduces the issue.

    • You can format the text for Stack Overflow by highlighting and using Ctrl+K (or prepend four spaces to each line), or place three backticks (```) above and below your code with your code unindented.

    • I really do mean small. The vast majority of example DataFrames could be fewer than 6 rows and 6 columns,[citation needed] and I bet I can do it in 5x3. Can you reproduce the error with df = df[relevant_columns].head()? If not, fiddle around to see if you can make up a small DataFrame which exhibits the issue you are facing.

      But every rule has an exception, the obvious one being for performance issues (in which case definitely use %timeit and possibly %prun to profile your code), where you should generate:

      df = pd.DataFrame(np.random.randn(100000000, 10))
      

      Consider using np.random.seed so we have the exact same frame. Having said that, "make this code fast for me" is not strictly on topic for the site.

    • For getting runnable code, df.to_dict is often useful, with the different orient options for different cases. In the example above, I could have grabbed the columns and values from df.to_dict('split').

  • Write out the outcome you desire (similarly to above)

    In [3]: iwantthis
    Out[3]:
       A  B
    0  1  5
    1  4  6
    

    Explain where the numbers come from:

    The 5 is the sum of the B column for the rows where A is 1.

  • Do show the code you've tried:

    In [4]: df.groupby('A').sum()
    Out[4]:
       B
    A
    1  5
    4  6
    

    But say what's incorrect:

    The A column is in the index rather than a column.

  • Do show you've done some research (search the documentation, search Stack Overflow), and give a summary:

    The docstring for sum simply states "Compute sum of group values"

    The groupby documentation doesn't give any examples for this.

    Aside: the answer here is to use df.groupby('A', as_index=False).sum().

  • If it's relevant that you have Timestamp columns, e.g. you're resampling or something, then be explicit and apply pd.to_datetime to them for good measure.

    df['date'] = pd.to_datetime(df['date']) # this column ought to be date.
    

    Sometimes this is the issue itself: they were strings.

  • Do include a small example DataFrame, either as runnable code:

    In [1]: df = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])
    

    or make it "copy and pasteable" using pd.read_clipboard(sep=r'\s\s+').

    In [2]: df
    Out[2]:
       A  B
    0  1  2
    1  1  3
    2  4  6
    

    Test it yourself to make sure it works and reproduces the issue.

    • You can format the text for Stack Overflow by highlighting and using Ctrl+K (or prepend four spaces to each line), or place three backticks (```) above and below your code with your code unindented.

    • I really do mean small. The vast majority of example DataFrames could be fewer than 6 rows and 6 columns,[citation needed] and I bet I can do it in 5x3. Can you reproduce the error with df = df[relevant_columns].head()? If not, fiddle around to see if you can make up a small DataFrame which exhibits the issue you are facing.

      But every rule has an exception, the obvious one being for performance issues (in which case definitely use %timeit and possibly %prun to profile your code), where you should generate:

      df = pd.DataFrame(np.random.randn(100000000, 10))
      

      Consider using np.random.seed so we have the exact same frame. Having said that, "make this code fast for me" is not strictly on topic for the site.

    • For getting runnable code, df.to_dict is often useful, with the different orient options for different cases. In the example above, I could have grabbed the columns and values from df.to_dict('split').

  • Write out the outcome you desire (similarly to above)

    In [3]: iwantthis
    Out[3]:
       A  B
    0  1  5
    1  4  6
    

    Explain where the numbers come from:

    The 5 is the sum of the B column for the rows where A is 1.

  • Do show the code you've tried:

    In [4]: df.groupby('A').sum()
    Out[4]:
       B
    A
    1  5
    4  6
    

    But say what's incorrect:

    The A column is in the index rather than a column.

  • Do show you've done some research (search the documentation, search Stack Overflow), and give a summary:

    The docstring for sum simply states "Compute sum of group values"

    The groupby documentation doesn't give any examples for this.

    Aside: the answer here is to use df.groupby('A', as_index=False).sum().

  • If it's relevant that you have Timestamp columns, e.g. you're resampling or something, then be explicit and apply pd.to_datetime to them for good measure.

    df['date'] = pd.to_datetime(df['date']) # this column ought to be date.
    

    Sometimes this is the issue itself: they were strings.

+link on "not strictly on topic for the site"
Source Link
wjandrea
  • 34.2k
  • 10
  • 71
  • 108
  • Do include a small example DataFrame, either as runnable code:

    In [1]: df = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])
    

    or make it "copy and pasteable" using pd.read_clipboard(sep=r'\s\s+').

    In [2]: df
    Out[2]:
       A  B
    0  1  2
    1  1  3
    2  4  6
    

    Test it yourself to make sure it works and reproduces the issue.

    • You can format the text for Stack Overflow by highlighting and using Ctrl+K (or prepend four spaces to each line), or place three backticks (```) above and below your code with your code unindented.

    • I really do mean small. The vast majority of example DataFrames could be fewer than 6 rows and 6 columns,[citation needed] and I bet I can do it in 5x3. Can you reproduce the error with df = df[relevant_columns].head()? If not, fiddle around to see if you can make up a small DataFrame which exhibits the issue you are facing.

      But every rule has an exception, the obvious one being for performance issues (in which case definitely use %timeit and possibly %prun to profile your code), where you should generate:

      df = pd.DataFrame(np.random.randn(100000000, 10))
      

      Consider using np.random.seed so we have the exact same frame. Having said that, "make this code fast for me" is not strictly on topic for the sitenot strictly on topic for the site.

    • For getting runnable code, df.to_dict is often useful, with the different orient options for different cases. In the example above, I could have grabbed the columns and values from df.to_dict('split').

  • Write out the outcome you desire (similarly to above)

    In [3]: iwantthis
    Out[3]:
       A  B
    0  1  5
    1  4  6
    

    Explain where the numbers come from:

    The 5 is the sum of the B column for the rows where A is 1.

  • Do show the code you've tried:

    In [4]: df.groupby('A').sum()
    Out[4]:
       B
    A
    1  5
    4  6
    

    But say what's incorrect:

    The A column is in the index rather than a column.

  • Do show you've done some research (search the documentation, search Stack Overflow), and give a summary:

    The docstring for sum simply states "Compute sum of group values"

    The groupby documentation doesn't give any examples for this.

    Aside: the answer here is to use df.groupby('A', as_index=False).sum().

  • If it's relevant that you have Timestamp columns, e.g. you're resampling or something, then be explicit and apply pd.to_datetime to them for good measure.

    df['date'] = pd.to_datetime(df['date']) # this column ought to be date.
    

    Sometimes this is the issue itself: they were strings.

  • Do include a small example DataFrame, either as runnable code:

    In [1]: df = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])
    

    or make it "copy and pasteable" using pd.read_clipboard(sep=r'\s\s+').

    In [2]: df
    Out[2]:
       A  B
    0  1  2
    1  1  3
    2  4  6
    

    Test it yourself to make sure it works and reproduces the issue.

    • You can format the text for Stack Overflow by highlighting and using Ctrl+K (or prepend four spaces to each line), or place three backticks (```) above and below your code with your code unindented.

    • I really do mean small. The vast majority of example DataFrames could be fewer than 6 rows and 6 columns,[citation needed] and I bet I can do it in 5x3. Can you reproduce the error with df = df[relevant_columns].head()? If not, fiddle around to see if you can make up a small DataFrame which exhibits the issue you are facing.

      But every rule has an exception, the obvious one being for performance issues (in which case definitely use %timeit and possibly %prun to profile your code), where you should generate:

      df = pd.DataFrame(np.random.randn(100000000, 10))
      

      Consider using np.random.seed so we have the exact same frame. Having said that, "make this code fast for me" is not strictly on topic for the site.

    • For getting runnable code, df.to_dict is often useful, with the different orient options for different cases. In the example above, I could have grabbed the columns and values from df.to_dict('split').

  • Write out the outcome you desire (similarly to above)

    In [3]: iwantthis
    Out[3]:
       A  B
    0  1  5
    1  4  6
    

    Explain where the numbers come from:

    The 5 is the sum of the B column for the rows where A is 1.

  • Do show the code you've tried:

    In [4]: df.groupby('A').sum()
    Out[4]:
       B
    A
    1  5
    4  6
    

    But say what's incorrect:

    The A column is in the index rather than a column.

  • Do show you've done some research (search the documentation, search Stack Overflow), and give a summary:

    The docstring for sum simply states "Compute sum of group values"

    The groupby documentation doesn't give any examples for this.

    Aside: the answer here is to use df.groupby('A', as_index=False).sum().

  • If it's relevant that you have Timestamp columns, e.g. you're resampling or something, then be explicit and apply pd.to_datetime to them for good measure.

    df['date'] = pd.to_datetime(df['date']) # this column ought to be date.
    

    Sometimes this is the issue itself: they were strings.

  • Do include a small example DataFrame, either as runnable code:

    In [1]: df = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])
    

    or make it "copy and pasteable" using pd.read_clipboard(sep=r'\s\s+').

    In [2]: df
    Out[2]:
       A  B
    0  1  2
    1  1  3
    2  4  6
    

    Test it yourself to make sure it works and reproduces the issue.

    • You can format the text for Stack Overflow by highlighting and using Ctrl+K (or prepend four spaces to each line), or place three backticks (```) above and below your code with your code unindented.

    • I really do mean small. The vast majority of example DataFrames could be fewer than 6 rows and 6 columns,[citation needed] and I bet I can do it in 5x3. Can you reproduce the error with df = df[relevant_columns].head()? If not, fiddle around to see if you can make up a small DataFrame which exhibits the issue you are facing.

      But every rule has an exception, the obvious one being for performance issues (in which case definitely use %timeit and possibly %prun to profile your code), where you should generate:

      df = pd.DataFrame(np.random.randn(100000000, 10))
      

      Consider using np.random.seed so we have the exact same frame. Having said that, "make this code fast for me" is not strictly on topic for the site.

    • For getting runnable code, df.to_dict is often useful, with the different orient options for different cases. In the example above, I could have grabbed the columns and values from df.to_dict('split').

  • Write out the outcome you desire (similarly to above)

    In [3]: iwantthis
    Out[3]:
       A  B
    0  1  5
    1  4  6
    

    Explain where the numbers come from:

    The 5 is the sum of the B column for the rows where A is 1.

  • Do show the code you've tried:

    In [4]: df.groupby('A').sum()
    Out[4]:
       B
    A
    1  5
    4  6
    

    But say what's incorrect:

    The A column is in the index rather than a column.

  • Do show you've done some research (search the documentation, search Stack Overflow), and give a summary:

    The docstring for sum simply states "Compute sum of group values"

    The groupby documentation doesn't give any examples for this.

    Aside: the answer here is to use df.groupby('A', as_index=False).sum().

  • If it's relevant that you have Timestamp columns, e.g. you're resampling or something, then be explicit and apply pd.to_datetime to them for good measure.

    df['date'] = pd.to_datetime(df['date']) # this column ought to be date.
    

    Sometimes this is the issue itself: they were strings.

no sense getting `.head()` of irrelevant columns
Source Link
wjandrea
  • 34.2k
  • 10
  • 71
  • 108
  • Do include a small example DataFrame, either as runnable code:

    In [1]: df = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])
    

    or make it "copy and pasteable" using pd.read_clipboard(sep=r'\s\s+').

    In [2]: df
    Out[2]:
       A  B
    0  1  2
    1  1  3
    2  4  6
    

    Test it yourself to make sure it works and reproduces the issue.

    • You can format the text for Stack Overflow by highlighting and using Ctrl+K (or prepend four spaces to each line), or place three backticks (```) above and below your code with your code unindented.

    • I really do mean small. The vast majority of example DataFrames could be fewer than 6 rows and 6 columns,[citation needed] and I bet I can do it in 5x3. Can you reproduce the error with df = dfdf[relevant_columns].head()[relevant_columns]? If not, fiddle around to see if you can make up a small DataFrame which exhibits the issue you are facing.

      But every rule has an exception, the obvious one being for performance issues (in which case definitely use %timeit and possibly %prun to profile your code), where you should generate:

      df = pd.DataFrame(np.random.randn(100000000, 10))
      

      Consider using np.random.seed so we have the exact same frame. Having said that, "make this code fast for me" is not strictly on topic for the site.

    • For getting runnable code, df.to_dict is often useful, with the different orient options for different cases. In the example above, I could have grabbed the columns and values from df.to_dict('split').

  • Write out the outcome you desire (similarly to above)

    In [3]: iwantthis
    Out[3]:
       A  B
    0  1  5
    1  4  6
    

    Explain where the numbers come from:

    The 5 is the sum of the B column for the rows where A is 1.

  • Do show the code you've tried:

    In [4]: df.groupby('A').sum()
    Out[4]:
       B
    A
    1  5
    4  6
    

    But say what's incorrect:

    The A column is in the index rather than a column.

  • Do show you've done some research (search the documentation, search Stack Overflow), and give a summary:

    The docstring for sum simply states "Compute sum of group values"

    The groupby documentation doesn't give any examples for this.

    Aside: the answer here is to use df.groupby('A', as_index=False).sum().

  • If it's relevant that you have Timestamp columns, e.g. you're resampling or something, then be explicit and apply pd.to_datetime to them for good measure.

    df['date'] = pd.to_datetime(df['date']) # this column ought to be date.
    

    Sometimes this is the issue itself: they were strings.

  • Do include a small example DataFrame, either as runnable code:

    In [1]: df = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])
    

    or make it "copy and pasteable" using pd.read_clipboard(sep=r'\s\s+').

    In [2]: df
    Out[2]:
       A  B
    0  1  2
    1  1  3
    2  4  6
    

    Test it yourself to make sure it works and reproduces the issue.

    • You can format the text for Stack Overflow by highlighting and using Ctrl+K (or prepend four spaces to each line), or place three backticks (```) above and below your code with your code unindented.

    • I really do mean small. The vast majority of example DataFrames could be fewer than 6 rows and 6 columns,[citation needed] and I bet I can do it in 5x3. Can you reproduce the error with df = df.head()[relevant_columns]? If not, fiddle around to see if you can make up a small DataFrame which exhibits the issue you are facing.

      But every rule has an exception, the obvious one being for performance issues (in which case definitely use %timeit and possibly %prun to profile your code), where you should generate:

      df = pd.DataFrame(np.random.randn(100000000, 10))
      

      Consider using np.random.seed so we have the exact same frame. Having said that, "make this code fast for me" is not strictly on topic for the site.

    • For getting runnable code, df.to_dict is often useful, with the different orient options for different cases. In the example above, I could have grabbed the columns and values from df.to_dict('split').

  • Write out the outcome you desire (similarly to above)

    In [3]: iwantthis
    Out[3]:
       A  B
    0  1  5
    1  4  6
    

    Explain where the numbers come from:

    The 5 is the sum of the B column for the rows where A is 1.

  • Do show the code you've tried:

    In [4]: df.groupby('A').sum()
    Out[4]:
       B
    A
    1  5
    4  6
    

    But say what's incorrect:

    The A column is in the index rather than a column.

  • Do show you've done some research (search the documentation, search Stack Overflow), and give a summary:

    The docstring for sum simply states "Compute sum of group values"

    The groupby documentation doesn't give any examples for this.

    Aside: the answer here is to use df.groupby('A', as_index=False).sum().

  • If it's relevant that you have Timestamp columns, e.g. you're resampling or something, then be explicit and apply pd.to_datetime to them for good measure.

    df['date'] = pd.to_datetime(df['date']) # this column ought to be date.
    

    Sometimes this is the issue itself: they were strings.

  • Do include a small example DataFrame, either as runnable code:

    In [1]: df = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])
    

    or make it "copy and pasteable" using pd.read_clipboard(sep=r'\s\s+').

    In [2]: df
    Out[2]:
       A  B
    0  1  2
    1  1  3
    2  4  6
    

    Test it yourself to make sure it works and reproduces the issue.

    • You can format the text for Stack Overflow by highlighting and using Ctrl+K (or prepend four spaces to each line), or place three backticks (```) above and below your code with your code unindented.

    • I really do mean small. The vast majority of example DataFrames could be fewer than 6 rows and 6 columns,[citation needed] and I bet I can do it in 5x3. Can you reproduce the error with df = df[relevant_columns].head()? If not, fiddle around to see if you can make up a small DataFrame which exhibits the issue you are facing.

      But every rule has an exception, the obvious one being for performance issues (in which case definitely use %timeit and possibly %prun to profile your code), where you should generate:

      df = pd.DataFrame(np.random.randn(100000000, 10))
      

      Consider using np.random.seed so we have the exact same frame. Having said that, "make this code fast for me" is not strictly on topic for the site.

    • For getting runnable code, df.to_dict is often useful, with the different orient options for different cases. In the example above, I could have grabbed the columns and values from df.to_dict('split').

  • Write out the outcome you desire (similarly to above)

    In [3]: iwantthis
    Out[3]:
       A  B
    0  1  5
    1  4  6
    

    Explain where the numbers come from:

    The 5 is the sum of the B column for the rows where A is 1.

  • Do show the code you've tried:

    In [4]: df.groupby('A').sum()
    Out[4]:
       B
    A
    1  5
    4  6
    

    But say what's incorrect:

    The A column is in the index rather than a column.

  • Do show you've done some research (search the documentation, search Stack Overflow), and give a summary:

    The docstring for sum simply states "Compute sum of group values"

    The groupby documentation doesn't give any examples for this.

    Aside: the answer here is to use df.groupby('A', as_index=False).sum().

  • If it's relevant that you have Timestamp columns, e.g. you're resampling or something, then be explicit and apply pd.to_datetime to them for good measure.

    df['date'] = pd.to_datetime(df['date']) # this column ought to be date.
    

    Sometimes this is the issue itself: they were strings.

Add points about number of columns and length of scalars. Generalize "relevant DataFrame" → "relevant data". Minor clarification about "split".
Source Link
wjandrea
  • 34.2k
  • 10
  • 71
  • 108
Loading
Avoid "SyntaxWarning: invalid escape sequence '\s' ".
Source Link
wjandrea
  • 34.2k
  • 10
  • 71
  • 108
Loading
Clarify "Test it yourself."
Source Link
wjandrea
  • 34.2k
  • 10
  • 71
  • 108
Loading
Mention Pandas 1.0 changes too. Clarify version numbers.
Source Link
wjandrea
  • 34.2k
  • 10
  • 71
  • 108
Loading
Mention what `%prun` does
Source Link
wjandrea
  • 34.2k
  • 10
  • 71
  • 108
Loading
Add `pd.show_versions()` as an alternative to `session_info`.
Source Link
wjandrea
  • 34.2k
  • 10
  • 71
  • 108
Loading
Add session_info, following from revision 13.
Source Link
wjandrea
  • 34.2k
  • 10
  • 71
  • 108
Loading
Move version point from "ugly" to "bad"
Source Link
wjandrea
  • 34.2k
  • 10
  • 71
  • 108
Loading
Move code formatting help to its own bullet and link the guide. Cover `to_dict`. Add link about "entire stack trace". Add point about version, following from revision 13. Other minor changes. Remove unnecessary CSV link.
Source Link
wjandrea
  • 34.2k
  • 10
  • 71
  • 108
Loading
Rollback to Revision 12
Source Link
wjandrea
  • 34.2k
  • 10
  • 71
  • 108
Loading
Link to specific magics. Improve formatting: avoid footnotes and tons of italics; use consistent quote formatting. Other minor improvements like grammar.
Source Link
wjandrea
  • 34.2k
  • 10
  • 71
  • 108
Loading
Clarify hatnote and add link to MRE.
Source Link
wjandrea
  • 34.2k
  • 10
  • 71
  • 108
Loading
Improve grammar and formatting (including reducing overused italics). Update IPython docs link. Remove noise.
Source Link
wjandrea
  • 34.2k
  • 10
  • 71
  • 108
Loading
Simplify formatting and grammar in notes for readability.
Source Link
wjandrea
  • 34.2k
  • 10
  • 71
  • 108
Loading
Fix formatting for CommonMark
Source Link
wjandrea
  • 34.2k
  • 10
  • 71
  • 108
Loading
small grammar fixes and minor correction
Source Link
ddejohn
  • 9k
  • 3
  • 21
  • 31
Loading
Active reading [<https://en.wikipedia.org/wiki/Pandas_%28software%29> <https://en.wikipedia.org/wiki/Comma-separated_values> <https://en.wikipedia.org/wiki/Sentence_clause_structure#Run-on_sentences> <http://stackoverflow.com/legal/trademark-guidance> (the last section)]. Expanded.
Source Link
Peter Mortensen
  • 31.1k
  • 22
  • 111
  • 134
Loading
added 22 characters in body
Source Link
MarianD
  • 14.4k
  • 12
  • 51
  • 62
Loading
Updating some information and fixing spelling and capitalization
Source Link
TylerH
  • 21.3k
  • 87
  • 85
  • 123
Loading
added 3 characters in body
Source Link
coldspeed95
  • 407.2k
  • 106
  • 746
  • 799
Loading
replaced http://stackoverflow.com/ with https://stackoverflow.com/
Source Link
URL Rewriter Bot
URL Rewriter Bot
Loading