Questions tagged [dataset]
A dataset is a collection of data, often in tabular or matrix form. This tag is NOT intended for data requests ("where can I find a dataset about ...") --> see OpenData
1,514 questions
203
votes
35
answers
34k
views
Publicly Available Datasets
One of the common problems in data science is gathering data from various sources in a somehow cleaned (semi-structured) format and combining metrics from various sources for making a higher level ...
63
votes
5
answers
39k
views
Is it always better to use the whole dataset to train the final model?
A common technique after training, validating and testing the Machine Learning model of preference is to use the complete dataset, including the testing subset, to train a final model to deploy it on, ...
59
votes
6
answers
17k
views
Should I go for a 'balanced' dataset or a 'representative' dataset?
My 'machine learning' task is of separating benign Internet traffic from malicious traffic. In the real world scenario, most (say 90% or more) of Internet traffic is benign. Thus I felt that I should ...
36
votes
10
answers
19k
views
Why is it wrong to train and test a model on the same dataset?
What are the pitfalls of doing so and why is it a bad practice? Is it possible that the model starts to learn the images "by heart" instead of understanding the underlying logic?
35
votes
4
answers
16k
views
Quick guide into training highly imbalanced data sets
I have a classification problem with approximately 1000 positive and 10000 negative samples in training set. So this data set is quite unbalanced. Plain random forest is just trying to mark all test ...
28
votes
7
answers
28k
views
Publicly available social network datasets/APIs
As an extension to our great list of publicly available datasets, I'd like to know if there is any list of publicly available social network datasets/crawling APIs. It would be very nice if alongside ...
28
votes
3
answers
43k
views
Data Science Project Ideas [closed]
I don't know if this is a right place to ask this question, but a community dedicated to Data Science should be the most appropriate place in my opinion.
I have just started with Data Science and ...
25
votes
4
answers
17k
views
Is there any data tidying tool for python/pandas similar to R tidyr tool?
I'm working on a Kaggle challenge where some variables are represented by rows instead of columns (Telstra Network Disruption). I am currently searching for the equivalent of ...
23
votes
6
answers
135k
views
Uploading images folder from my system into Google Colab
I want to train a deep learning model on a dataset containing around 3000 images. Since the dataset is huge, I want to use Google colab since it's GPU supported. How do I upload this full image folder ...
23
votes
2
answers
50k
views
Loading own train data and labels in dataloader using pytorch?
I have x_data and labels separately. How can I combine and load them in the model using torch.utils.data.DataLoader?
I have a dataset that I created and the ...
22
votes
3
answers
14k
views
How to generate synthetic dataset using machine learning model learnt with original dataset?
Generally, the machine learning model is built on datasets. I'd like to know if there is any way to generate synthetic dataset using such trained machine learning model preserving original dataset ...
22
votes
3
answers
10k
views
Dataset for Named Entity Recognition on Informal Text
I'm currently searching for labeled datasets to train a model to extract named entities from informal text (something similar to tweets). Because capitalization and grammar are often lacking in the ...
18
votes
5
answers
28k
views
Downloading a large dataset on the web directly into AWS S3
Does anyone know if it's possible to import a large dataset into Amazon S3 from a URL?
Basically, I want to avoid downloading a huge file and then reuploading it to S3 through the web portal. I ...
18
votes
3
answers
25k
views
When should we consider a dataset as imbalanced?
I'm facing a situation where the numbers of positive and negative examples in a dataset are imbalanced.
My question is, are there any rules of thumb that tell us when we should subsample the large ...
18
votes
4
answers
31k
views
One hot encoding alternatives for large categorical values
I have a data frame with large categorical values over 1600 categories. Is there any way I can find alternatives so that I don't have over 1600 columns?
I found this interesting link.
But they are ...