Questions tagged [dataset]
Requests for datasets are off-topic on this site. Use this tag for questions concerning creating, processing, or maintaining datasets.
1,934 questions
187
votes
15
answers
57k
views
Are large data sets inappropriate for hypothesis testing?
In a recent article of Amstat News, the authors (Mark van der Laan and Sherri Rose) stated that "We know that for large enough sample sizes, every study—including ones in which the null hypothesis of ...
103
votes
25
answers
43k
views
Locating freely available data samples
I've been working on a new method for analyzing and parsing datasets to identify and isolate subgroups of a population without foreknowledge of any subgroup's characteristics. While the method works ...
96
votes
6
answers
9k
views
Essential data checking tests
In my job role I often work with other people's datasets; non-experts bring me clinical data and I help them summarise it and perform statistical tests.
The problem I am having is that the datasets I ...
95
votes
2
answers
193k
views
How to normalize data between -1 and 1?
I have seen the min-max normalization formula but that normalizes values between 0 and 1. How would I normalize my data between -1 and 1? I have both negative and positive values in my data matrix.
72
votes
8
answers
123k
views
How to simulate data that satisfy specific constraints such as having specific mean and standard deviation?
This question is motivated by my question on meta-analysis. But I imagine that it would also be useful in teaching contexts where you want to create a dataset that exactly mirrors an existing ...
53
votes
3
answers
21k
views
Data APIs/feeds available as packages in R
EDIT: The Web Technologies and Services CRAN task view contains a much more comprehensive list of data sources and APIs available in R. You can submit a pull request on github if you wish to add a ...
44
votes
9
answers
40k
views
Tiny (real) datasets for giving examples in class?
When teaching an introductory level class, the teachers I know tend to invent some numbers and a story in order to exemplify the method they are teaching.
What I would prefer is to tell a real story ...
43
votes
8
answers
1k
views
How do I get people to take better care of data?
My workplace has employees from a very wide range of disciplines, so we generate data in lots of different forms. Consequently, each team has developed its own system for storing data. Some use ...
41
votes
2
answers
7k
views
How to draw valid conclusions from "big data"?
"Big data" is everywhere in the media. Everybody says that "big data" is the big thing for 2012, e.g. KDNuggets poll on hot topics for 2012. However, I have deep concerns here. With big data, ...
38
votes
5
answers
21k
views
Free data set for very high dimensional classification [closed]
What are the freely available data set for classification with more than 1000 features (or sample points if it contains curves)?
There is already a community wiki about free data sets:
Locating ...
36
votes
5
answers
3k
views
What if my linear regression data contains several co-mingled linear relationships?
Let's say I am studying how daffodils respond to various soil conditions. I have collected data on the pH of the soil versus the mature height of the daffodil. I'm expecting a linear relationship, ...
36
votes
3
answers
4k
views
Datasets constructed for a purpose similar to that of Anscombe's quartet
I've just come across Anscombe's quartet (four datasets that have almost indistinguishable descriptive statistics but look very different when plotted) and I am curious if there are other more or less ...
36
votes
3
answers
22k
views
Visualizing the intersections of many sets
Is there a visualization model that is good for showing the intersection overlap of many sets?
I am thinking something like Venn diagrams but that somehow might lend itself better to a larger number ...
36
votes
2
answers
2k
views
Performing a statistical test after visualizing data - data dredging?
I'll propose this question by means of an example.
Suppose I have a data set, such as the boston housing price data set, in which I have continuous and categorical variables. Here, we have a "quality"...
31
votes
5
answers
89k
views
What impact does increasing the training data have on the overall system accuracy?
Can someone summarize for me with possible examples, at what situations increasing the training data improves the overall system?
When do we detect that adding more training data could possibly over-...