Skip to main content
Adapted for CommonMark.
Source Link
Peter Mortensen
  • 31.1k
  • 22
  • 111
  • 134

##How to create sample datasets

How to create sample datasets

###A kitchen sink example

A kitchen sink example

###Fake stock market data

Fake stock market data

##How to create sample datasets

###A kitchen sink example

###Fake stock market data

How to create sample datasets

A kitchen sink example

Fake stock market data

Active reading [<https://en.wikipedia.org/wiki/NumPy> <https://en.wikipedia.org/wiki/Pandas_%28software%29> <https://en.wikipedia.org/wiki/R_%28programming_language%29>]. Used a more direct cross reference (as user names can change at any time).
Source Link
Peter Mortensen
  • 31.1k
  • 22
  • 111
  • 134

This is to mainly to expand on @AndyHayden's answerAndyHayden's answer by providing examples of how you can create sample dataframes. Pandas and (especially) numpyNumPy give you a variety of tools for this such that you can generally create a reasonable facsimile of any real dataset with just a few lines of code.

After importing numpyNumPy and pandasPandas, be sure to provide a random seed if you want folks to be able to exactly reproduce your data and results.

df = pd.DataFrame({ 

    # some ways to create random data
    'a':np.random.randn(6),
    'b':np.random.choice( [5,7,np.nan], 6),
    'c':np.random.choice( ['panda','python','shark'], 6),

    # some ways to create systematic groups for indexing or groupby
    # this is similar to r'sR's expand.grid(), see note 2 below
    'd':np.repeat( range(3), 2 ),
    'e':np.tile(   range(2), 3 ),

    # a date range and set of random dates
    'f':pd.date_range('1/1/2011', periods=6, freq='D'),
    'g':np.random.choice( pd.date_range('1/1/2011', periods=365, 
                          freq='D'), 6, replace=False) 
    })
  1. np.repeat and np.tile (columns d and e) are very useful for creating groups and indices in a very regular way. For 2 columns, this can be used to easily duplicate r's expand.grid() but is also more flexible in ability to provide a subset of all permutations. However, for 3 or more columns the syntax quickly becomes unwieldy.
  2. For a more direct replacement for r'sR's expand.grid() see the itertools solution in the pandas cookbook or the np.meshgrid solution shown here. Those will allow any number of dimensions.
  3. You can do quite a bit with np.random.choice. For example, in column g, we have a random selection of 6six dates from 2011. Additionally, by setting replace=False we can assure these dates are unique -- very handy if we want to use this as an index with unique values.
stocks = pd.DataFrame({ 
    'ticker':np.repeat( ['aapl','goog','yhoo','msft'], 25 ),
    'date':np.tile( pd.date_range('1/1/2011', periods=25, freq='D'), 4 ),
    'price':(np.random.randn(100).cumsum() + 10) })
>>> stocks.head(5)
 
        date      price ticker
0 2011-01-01   9.497412   aapl
1 2011-01-02  10.261908   aapl
2 2011-01-03   9.438538   aapl
3 2011-01-04   9.515958   aapl
4 2011-01-05   7.554070   aapl

>>> stocks.groupby('ticker').head(2)
 
         date      price ticker
0  2011-01-01   9.497412   aapl
1  2011-01-02  10.261908   aapl
25 2011-01-01   8.277772   goog
26 2011-01-02   7.714916   goog
50 2011-01-01   5.613023   yhoo
51 2011-01-02   6.397686   yhoo
75 2011-01-01  11.736584   msft
76 2011-01-02  11.944519   msft

This is to mainly to expand on @AndyHayden's answer by providing examples of how you can create sample dataframes. Pandas and (especially) numpy give you a variety of tools for this such that you can generally create a reasonable facsimile of any real dataset with just a few lines of code.

After importing numpy and pandas, be sure to provide a random seed if you want folks to be able to exactly reproduce your data and results.

df = pd.DataFrame({ 

    # some ways to create random data
    'a':np.random.randn(6),
    'b':np.random.choice( [5,7,np.nan], 6),
    'c':np.random.choice( ['panda','python','shark'], 6),

    # some ways to create systematic groups for indexing or groupby
    # this is similar to r's expand.grid(), see note 2 below
    'd':np.repeat( range(3), 2 ),
    'e':np.tile(   range(2), 3 ),

    # a date range and set of random dates
    'f':pd.date_range('1/1/2011', periods=6, freq='D'),
    'g':np.random.choice( pd.date_range('1/1/2011', periods=365, 
                          freq='D'), 6, replace=False) 
    })
  1. np.repeat and np.tile (columns d and e) are very useful for creating groups and indices in a very regular way. For 2 columns, this can be used to easily duplicate r's expand.grid() but is also more flexible in ability to provide a subset of all permutations. However, for 3 or more columns the syntax quickly becomes unwieldy.
  2. For a more direct replacement for r's expand.grid() see the itertools solution in the pandas cookbook or the np.meshgrid solution shown here. Those will allow any number of dimensions.
  3. You can do quite a bit with np.random.choice. For example, in column g, we have a random selection of 6 dates from 2011. Additionally, by setting replace=False we can assure these dates are unique -- very handy if we want to use this as an index with unique values.
stocks = pd.DataFrame({ 
    'ticker':np.repeat( ['aapl','goog','yhoo','msft'], 25 ),
    'date':np.tile( pd.date_range('1/1/2011', periods=25, freq='D'), 4 ),
    'price':(np.random.randn(100).cumsum() + 10) })
>>> stocks.head(5)
 
        date      price ticker
0 2011-01-01   9.497412   aapl
1 2011-01-02  10.261908   aapl
2 2011-01-03   9.438538   aapl
3 2011-01-04   9.515958   aapl
4 2011-01-05   7.554070   aapl

>>> stocks.groupby('ticker').head(2)
 
         date      price ticker
0  2011-01-01   9.497412   aapl
1  2011-01-02  10.261908   aapl
25 2011-01-01   8.277772   goog
26 2011-01-02   7.714916   goog
50 2011-01-01   5.613023   yhoo
51 2011-01-02   6.397686   yhoo
75 2011-01-01  11.736584   msft
76 2011-01-02  11.944519   msft

This is to mainly to expand on AndyHayden's answer by providing examples of how you can create sample dataframes. Pandas and (especially) NumPy give you a variety of tools for this such that you can generally create a reasonable facsimile of any real dataset with just a few lines of code.

After importing NumPy and Pandas, be sure to provide a random seed if you want folks to be able to exactly reproduce your data and results.

df = pd.DataFrame({

    # some ways to create random data
    'a':np.random.randn(6),
    'b':np.random.choice( [5,7,np.nan], 6),
    'c':np.random.choice( ['panda','python','shark'], 6),

    # some ways to create systematic groups for indexing or groupby
    # this is similar to R's expand.grid(), see note 2 below
    'd':np.repeat( range(3), 2 ),
    'e':np.tile(   range(2), 3 ),

    # a date range and set of random dates
    'f':pd.date_range('1/1/2011', periods=6, freq='D'),
    'g':np.random.choice( pd.date_range('1/1/2011', periods=365,
                          freq='D'), 6, replace=False)
    })
  1. np.repeat and np.tile (columns d and e) are very useful for creating groups and indices in a very regular way. For 2 columns, this can be used to easily duplicate r's expand.grid() but is also more flexible in ability to provide a subset of all permutations. However, for 3 or more columns the syntax quickly becomes unwieldy.
  2. For a more direct replacement for R's expand.grid() see the itertools solution in the pandas cookbook or the np.meshgrid solution shown here. Those will allow any number of dimensions.
  3. You can do quite a bit with np.random.choice. For example, in column g, we have a random selection of six dates from 2011. Additionally, by setting replace=False we can assure these dates are unique -- very handy if we want to use this as an index with unique values.
stocks = pd.DataFrame({
    'ticker':np.repeat( ['aapl','goog','yhoo','msft'], 25 ),
    'date':np.tile( pd.date_range('1/1/2011', periods=25, freq='D'), 4 ),
    'price':(np.random.randn(100).cumsum() + 10) })
>>> stocks.head(5)

        date      price ticker
0 2011-01-01   9.497412   aapl
1 2011-01-02  10.261908   aapl
2 2011-01-03   9.438538   aapl
3 2011-01-04   9.515958   aapl
4 2011-01-05   7.554070   aapl

>>> stocks.groupby('ticker').head(2)

         date      price ticker
0  2011-01-01   9.497412   aapl
1  2011-01-02  10.261908   aapl
25 2011-01-01   8.277772   goog
26 2011-01-02   7.714916   goog
50 2011-01-01   5.613023   yhoo
51 2011-01-02   6.397686   yhoo
75 2011-01-01  11.736584   msft
76 2011-01-02  11.944519   msft
replaced http://stackoverflow.com/ with https://stackoverflow.com/
Source Link
URL Rewriter Bot
URL Rewriter Bot
  1. np.repeat and np.tile (columns d and e) are very useful for creating groups and indices in a very regular way. For 2 columns, this can be used to easily duplicate r's expand.grid() but is also more flexible in ability to provide a subset of all permutations. However, for 3 or more columns the syntax quickly becomes unwieldy.
  2. For a more direct replacement for r's expand.grid() see the itertools solution in the pandas cookbook or the np.meshgrid solution shown herehere. Those will allow any number of dimensions.
  3. You can do quite a bit with np.random.choice. For example, in column g, we have a random selection of 6 dates from 2011. Additionally, by setting replace=False we can assure these dates are unique -- very handy if we want to use this as an index with unique values.
  1. np.repeat and np.tile (columns d and e) are very useful for creating groups and indices in a very regular way. For 2 columns, this can be used to easily duplicate r's expand.grid() but is also more flexible in ability to provide a subset of all permutations. However, for 3 or more columns the syntax quickly becomes unwieldy.
  2. For a more direct replacement for r's expand.grid() see the itertools solution in the pandas cookbook or the np.meshgrid solution shown here. Those will allow any number of dimensions.
  3. You can do quite a bit with np.random.choice. For example, in column g, we have a random selection of 6 dates from 2011. Additionally, by setting replace=False we can assure these dates are unique -- very handy if we want to use this as an index with unique values.
  1. np.repeat and np.tile (columns d and e) are very useful for creating groups and indices in a very regular way. For 2 columns, this can be used to easily duplicate r's expand.grid() but is also more flexible in ability to provide a subset of all permutations. However, for 3 or more columns the syntax quickly becomes unwieldy.
  2. For a more direct replacement for r's expand.grid() see the itertools solution in the pandas cookbook or the np.meshgrid solution shown here. Those will allow any number of dimensions.
  3. You can do quite a bit with np.random.choice. For example, in column g, we have a random selection of 6 dates from 2011. Additionally, by setting replace=False we can assure these dates are unique -- very handy if we want to use this as an index with unique values.
deleted 56 characters in body
Source Link
JohnE
  • 30.7k
  • 9
  • 88
  • 116
Loading
added 104 characters in body
Source Link
JohnE
  • 30.7k
  • 9
  • 88
  • 116
Loading
added 361 characters in body
Source Link
JohnE
  • 30.7k
  • 9
  • 88
  • 116
Loading
added 361 characters in body
Source Link
JohnE
  • 30.7k
  • 9
  • 88
  • 116
Loading
added 361 characters in body
Source Link
JohnE
  • 30.7k
  • 9
  • 88
  • 116
Loading
added 22 characters in body
Source Link
JohnE
  • 30.7k
  • 9
  • 88
  • 116
Loading
Source Link
JohnE
  • 30.7k
  • 9
  • 88
  • 116
Loading