Revisions to How to make good reproducible pandas examples

Adapted for CommonMark.

Source Link

edited Jul 24, 2021 at 21:01

Peter Mortensen

31.1k
22
111
134

##How to create sample datasets

How to create sample datasets

###A kitchen sink example

A kitchen sink example

###Fake stock market data

Fake stock market data

Active reading [<https://en.wikipedia.org/wiki/NumPy> <https://en.wikipedia.org/wiki/Pandas_%28software%29> <https://en.wikipedia.org/wiki/R_%28programming_language%29>]. Used a more direct cross reference (as user names can change at any time).

Source Link

edited Jul 24, 2021 at 20:49

Peter Mortensen

31.1k
22
111
134

This is to mainly to expand on @AndyHayden's answerAndyHayden's answer by providing examples of how you can create sample dataframes. Pandas and (especially) numpyNumPy give you a variety of tools for this such that you can generally create a reasonable facsimile of any real dataset with just a few lines of code.

After importing numpyNumPy and pandasPandas, be sure to provide a random seed if you want folks to be able to exactly reproduce your data and results.

df = pd.DataFrame({ 

    # some ways to create random data
    'a':np.random.randn(6),
    'b':np.random.choice( [5,7,np.nan], 6),
    'c':np.random.choice( ['panda','python','shark'], 6),

    # some ways to create systematic groups for indexing or groupby
    # this is similar to r'sR's expand.grid(), see note 2 below
    'd':np.repeat( range(3), 2 ),
    'e':np.tile(   range(2), 3 ),

    # a date range and set of random dates
    'f':pd.date_range('1/1/2011', periods=6, freq='D'),
    'g':np.random.choice( pd.date_range('1/1/2011', periods=365, 
                          freq='D'), 6, replace=False) 
    })

np.repeat and np.tile (columns d and e) are very useful for creating groups and indices in a very regular way. For 2 columns, this can be used to easily duplicate r's expand.grid() but is also more flexible in ability to provide a subset of all permutations. However, for 3 or more columns the syntax quickly becomes unwieldy.
For a more direct replacement for r'sR's expand.grid() see the itertools solution in the pandas cookbook or the np.meshgrid solution shown here. Those will allow any number of dimensions.
You can do quite a bit with np.random.choice. For example, in column g, we have a random selection of 6six dates from 2011. Additionally, by setting replace=False we can assure these dates are unique -- very handy if we want to use this as an index with unique values.

stocks = pd.DataFrame({ 
    'ticker':np.repeat( ['aapl','goog','yhoo','msft'], 25 ),
    'date':np.tile( pd.date_range('1/1/2011', periods=25, freq='D'), 4 ),
    'price':(np.random.randn(100).cumsum() + 10) })

>>> stocks.head(5)
 
        date      price ticker
0 2011-01-01   9.497412   aapl
1 2011-01-02  10.261908   aapl
2 2011-01-03   9.438538   aapl
3 2011-01-04   9.515958   aapl
4 2011-01-05   7.554070   aapl

>>> stocks.groupby('ticker').head(2)
 
         date      price ticker
0  2011-01-01   9.497412   aapl
1  2011-01-02  10.261908   aapl
25 2011-01-01   8.277772   goog
26 2011-01-02   7.714916   goog
50 2011-01-01   5.613023   yhoo
51 2011-01-02   6.397686   yhoo
75 2011-01-01  11.736584   msft
76 2011-01-02  11.944519   msft

This is to mainly to expand on @AndyHayden's answer by providing examples of how you can create sample dataframes. Pandas and (especially) numpy give you a variety of tools for this such that you can generally create a reasonable facsimile of any real dataset with just a few lines of code.

After importing numpy and pandas, be sure to provide a random seed if you want folks to be able to exactly reproduce your data and results.

df = pd.DataFrame({ 

    # some ways to create random data
    'a':np.random.randn(6),
    'b':np.random.choice( [5,7,np.nan], 6),
    'c':np.random.choice( ['panda','python','shark'], 6),

    # some ways to create systematic groups for indexing or groupby
    # this is similar to r's expand.grid(), see note 2 below
    'd':np.repeat( range(3), 2 ),
    'e':np.tile(   range(2), 3 ),

    # a date range and set of random dates
    'f':pd.date_range('1/1/2011', periods=6, freq='D'),
    'g':np.random.choice( pd.date_range('1/1/2011', periods=365, 
                          freq='D'), 6, replace=False) 
    })

np.repeat and np.tile (columns d and e) are very useful for creating groups and indices in a very regular way. For 2 columns, this can be used to easily duplicate r's expand.grid() but is also more flexible in ability to provide a subset of all permutations. However, for 3 or more columns the syntax quickly becomes unwieldy.
For a more direct replacement for r's expand.grid() see the itertools solution in the pandas cookbook or the np.meshgrid solution shown here. Those will allow any number of dimensions.
You can do quite a bit with np.random.choice. For example, in column g, we have a random selection of 6 dates from 2011. Additionally, by setting replace=False we can assure these dates are unique -- very handy if we want to use this as an index with unique values.

stocks = pd.DataFrame({ 
    'ticker':np.repeat( ['aapl','goog','yhoo','msft'], 25 ),
    'date':np.tile( pd.date_range('1/1/2011', periods=25, freq='D'), 4 ),
    'price':(np.random.randn(100).cumsum() + 10) })

>>> stocks.head(5)
 
        date      price ticker
0 2011-01-01   9.497412   aapl
1 2011-01-02  10.261908   aapl
2 2011-01-03   9.438538   aapl
3 2011-01-04   9.515958   aapl
4 2011-01-05   7.554070   aapl

>>> stocks.groupby('ticker').head(2)
 
         date      price ticker
0  2011-01-01   9.497412   aapl
1  2011-01-02  10.261908   aapl
25 2011-01-01   8.277772   goog
26 2011-01-02   7.714916   goog
50 2011-01-01   5.613023   yhoo
51 2011-01-02   6.397686   yhoo
75 2011-01-01  11.736584   msft
76 2011-01-02  11.944519   msft

This is to mainly to expand on AndyHayden's answer by providing examples of how you can create sample dataframes. Pandas and (especially) NumPy give you a variety of tools for this such that you can generally create a reasonable facsimile of any real dataset with just a few lines of code.

After importing NumPy and Pandas, be sure to provide a random seed if you want folks to be able to exactly reproduce your data and results.

df = pd.DataFrame({

    # some ways to create random data
    'a':np.random.randn(6),
    'b':np.random.choice( [5,7,np.nan], 6),
    'c':np.random.choice( ['panda','python','shark'], 6),

    # some ways to create systematic groups for indexing or groupby
    # this is similar to R's expand.grid(), see note 2 below
    'd':np.repeat( range(3), 2 ),
    'e':np.tile(   range(2), 3 ),

    # a date range and set of random dates
    'f':pd.date_range('1/1/2011', periods=6, freq='D'),
    'g':np.random.choice( pd.date_range('1/1/2011', periods=365,
                          freq='D'), 6, replace=False)
    })

np.repeat and np.tile (columns d and e) are very useful for creating groups and indices in a very regular way. For 2 columns, this can be used to easily duplicate r's expand.grid() but is also more flexible in ability to provide a subset of all permutations. However, for 3 or more columns the syntax quickly becomes unwieldy.
For a more direct replacement for R's expand.grid() see the itertools solution in the pandas cookbook or the np.meshgrid solution shown here. Those will allow any number of dimensions.
You can do quite a bit with np.random.choice. For example, in column g, we have a random selection of six dates from 2011. Additionally, by setting replace=False we can assure these dates are unique -- very handy if we want to use this as an index with unique values.

stocks = pd.DataFrame({
    'ticker':np.repeat( ['aapl','goog','yhoo','msft'], 25 ),
    'date':np.tile( pd.date_range('1/1/2011', periods=25, freq='D'), 4 ),
    'price':(np.random.randn(100).cumsum() + 10) })

>>> stocks.head(5)

        date      price ticker
0 2011-01-01   9.497412   aapl
1 2011-01-02  10.261908   aapl
2 2011-01-03   9.438538   aapl
3 2011-01-04   9.515958   aapl
4 2011-01-05   7.554070   aapl

>>> stocks.groupby('ticker').head(2)

         date      price ticker
0  2011-01-01   9.497412   aapl
1  2011-01-02  10.261908   aapl
25 2011-01-01   8.277772   goog
26 2011-01-02   7.714916   goog
50 2011-01-01   5.613023   yhoo
51 2011-01-02   6.397686   yhoo
75 2011-01-01  11.736584   msft
76 2011-01-02  11.944519   msft

replaced http://stackoverflow.com/ with https://stackoverflow.com/

Source Link

edited May 23, 2017 at 11:47

URL Rewriter Bot

np.repeat and np.tile (columns d and e) are very useful for creating groups and indices in a very regular way. For 2 columns, this can be used to easily duplicate r's expand.grid() but is also more flexible in ability to provide a subset of all permutations. However, for 3 or more columns the syntax quickly becomes unwieldy.
For a more direct replacement for r's expand.grid() see the itertools solution in the pandas cookbook or the np.meshgrid solution shown here here. Those will allow any number of dimensions.
You can do quite a bit with np.random.choice. For example, in column g, we have a random selection of 6 dates from 2011. Additionally, by setting replace=False we can assure these dates are unique -- very handy if we want to use this as an index with unique values.

np.repeat and np.tile (columns d and e) are very useful for creating groups and indices in a very regular way. For 2 columns, this can be used to easily duplicate r's expand.grid() but is also more flexible in ability to provide a subset of all permutations. However, for 3 or more columns the syntax quickly becomes unwieldy.
For a more direct replacement for r's expand.grid() see the itertools solution in the pandas cookbook or the np.meshgrid solution shown here. Those will allow any number of dimensions.
You can do quite a bit with np.random.choice. For example, in column g, we have a random selection of 6 dates from 2011. Additionally, by setting replace=False we can assure these dates are unique -- very handy if we want to use this as an index with unique values.

np.repeat and np.tile (columns d and e) are very useful for creating groups and indices in a very regular way. For 2 columns, this can be used to easily duplicate r's expand.grid() but is also more flexible in ability to provide a subset of all permutations. However, for 3 or more columns the syntax quickly becomes unwieldy.
For a more direct replacement for r's expand.grid() see the itertools solution in the pandas cookbook or the np.meshgrid solution shown here. Those will allow any number of dimensions.
You can do quite a bit with np.random.choice. For example, in column g, we have a random selection of 6 dates from 2011. Additionally, by setting replace=False we can assure these dates are unique -- very handy if we want to use this as an index with unique values.

deleted 56 characters in body

Source Link

edited Jan 14, 2017 at 14:32

JohnE

30.7k
9
88
116

Loading

added 104 characters in body

Source Link

edited May 25, 2015 at 4:12

JohnE

30.7k
9
88
116

Loading

added 361 characters in body

Source Link

edited May 24, 2015 at 15:38

JohnE

30.7k
9
88
116

Loading

added 361 characters in body

Source Link

edited May 24, 2015 at 15:32

JohnE

30.7k
9
88
116

Loading

added 361 characters in body

Source Link

edited May 24, 2015 at 15:18

JohnE

30.7k
9
88
116

Loading

added 22 characters in body

Source Link

edited May 24, 2015 at 14:29

JohnE

30.7k
9
88
116

Loading

Source Link

answered May 24, 2015 at 14:22

JohnE

30.7k
9
88
116

Loading

Collectives™ on Stack Overflow

Return to Answer

Post Timeline

How to create sample datasets

A kitchen sink example

Fake stock market data

How to create sample datasets

A kitchen sink example

Fake stock market data