Highest scored 'dataset' questions - Data Science Stack Exchange

203 votes

35 answers

34k views

Publicly Available Datasets

One of the common problems in data science is gathering data from various sources in a somehow cleaned (semi-structured) format and combining metrics from various sources for making a higher level ...

Community wiki

6 revs, 4 users 50%
Amir Ali Akbari

63 votes

5 answers

39k views

Is it always better to use the whole dataset to train the final model?

A common technique after training, validating and testing the Machine Learning model of preference is to use the complete dataset, including the testing subset, to train a final model to deploy it on, ...

pcko1

4,050

asked Jun 12, 2018 at 9:54

59 votes

6 answers

17k views

Should I go for a 'balanced' dataset or a 'representative' dataset?

My 'machine learning' task is of separating benign Internet traffic from malicious traffic. In the real world scenario, most (say 90% or more) of Internet traffic is benign. Thus I felt that I should ...

pnp

693

asked Jul 22, 2014 at 12:29

36 votes

10 answers

19k views

Why is it wrong to train and test a model on the same dataset?

What are the pitfalls of doing so and why is it a bad practice? Is it possible that the model starts to learn the images "by heart" instead of understanding the underlying logic?

karalis1

471

asked Dec 13, 2020 at 14:11

35 votes

4 answers

16k views

Quick guide into training highly imbalanced data sets

I have a classification problem with approximately 1000 positive and 10000 negative samples in training set. So this data set is quite unbalanced. Plain random forest is just trying to mark all test ...

IgorS

5,484

asked Sep 12, 2014 at 15:20

28 votes

7 answers

28k views

Publicly available social network datasets/APIs

As an extension to our great list of publicly available datasets, I'd like to know if there is any list of publicly available social network datasets/crawling APIs. It would be very nice if alongside ...

Rubens

4,117

asked Jun 17, 2014 at 5:29

28 votes

3 answers

43k views

Data Science Project Ideas [closed]

I don't know if this is a right place to ask this question, but a community dedicated to Data Science should be the most appropriate place in my opinion. I have just started with Data Science and ...

Kevin Desai

383

asked Jul 25, 2014 at 18:36

25 votes

4 answers

17k views

Is there any data tidying tool for python/pandas similar to R tidyr tool?

I'm working on a Kaggle challenge where some variables are represented by rows instead of columns (Telstra Network Disruption). I am currently searching for the equivalent of ...

cpumar

815

asked Mar 2, 2016 at 8:54

23 votes

6 answers

135k views

Uploading images folder from my system into Google Colab

I want to train a deep learning model on a dataset containing around 3000 images. Since the dataset is huge, I want to use Google colab since it's GPU supported. How do I upload this full image folder ...

chatbot_chakra

341

asked Mar 23, 2018 at 18:52

23 votes

2 answers

50k views

Loading own train data and labels in dataloader using pytorch?

I have x_data and labels separately. How can I combine and load them in the model using torch.utils.data.DataLoader? I have a dataset that I created and the ...

Amarnath

361

asked Feb 20, 2019 at 21:13

22 votes

3 answers

14k views

How to generate synthetic dataset using machine learning model learnt with original dataset?

Generally, the machine learning model is built on datasets. I'd like to know if there is any way to generate synthetic dataset using such trained machine learning model preserving original dataset ...

m-bhole

323

asked Apr 1, 2015 at 15:23

22 votes

3 answers

10k views

Dataset for Named Entity Recognition on Informal Text

I'm currently searching for labeled datasets to train a model to extract named entities from informal text (something similar to tweets). Because capitalization and grammar are often lacking in the ...

Madison May

2,039

asked Jun 30, 2014 at 21:02

18 votes

5 answers

28k views

Downloading a large dataset on the web directly into AWS S3

Does anyone know if it's possible to import a large dataset into Amazon S3 from a URL? Basically, I want to avoid downloading a huge file and then reuploading it to S3 through the web portal. I ...

Will Stedden

183

asked Apr 22, 2015 at 18:00

18 votes

3 answers

25k views

When should we consider a dataset as imbalanced?

I'm facing a situation where the numbers of positive and negative examples in a dataset are imbalanced. My question is, are there any rules of thumb that tell us when we should subsample the large ...

Rami

604

asked May 16, 2016 at 11:36

18 votes

4 answers

31k views

One hot encoding alternatives for large categorical values

I have a data frame with large categorical values over 1600 categories. Is there any way I can find alternatives so that I don't have over 1600 columns? I found this interesting link. But they are ...

vinaykva

283

asked Nov 14, 2017 at 17:20

Stack Exchange Network

Questions tagged [dataset]

Publicly Available Datasets

Is it always better to use the whole dataset to train the final model?

Should I go for a 'balanced' dataset or a 'representative' dataset?

Why is it wrong to train and test a model on the same dataset?

Quick guide into training highly imbalanced data sets

Publicly available social network datasets/APIs

Data Science Project Ideas [closed]

Is there any data tidying tool for python/pandas similar to R tidyr tool?

Uploading images folder from my system into Google Colab

Loading own train data and labels in dataloader using pytorch?

How to generate synthetic dataset using machine learning model learnt with original dataset?

Dataset for Named Entity Recognition on Informal Text

Downloading a large dataset on the web directly into AWS S3

When should we consider a dataset as imbalanced?

One hot encoding alternatives for large categorical values

Hot Network Questions