Skip to main content

Questions tagged [dataset]

A dataset is a collection of data, often in tabular or matrix form. This tag is NOT intended for data requests ("where can I find a dataset about ...") --> see OpenData

203 votes
35 answers
34k views

One of the common problems in data science is gathering data from various sources in a somehow cleaned (semi-structured) format and combining metrics from various sources for making a higher level ...
63 votes
5 answers
39k views

A common technique after training, validating and testing the Machine Learning model of preference is to use the complete dataset, including the testing subset, to train a final model to deploy it on, ...
pcko1's user avatar
  • 4,050
59 votes
6 answers
17k views

My 'machine learning' task is of separating benign Internet traffic from malicious traffic. In the real world scenario, most (say 90% or more) of Internet traffic is benign. Thus I felt that I should ...
pnp's user avatar
  • 693
36 votes
10 answers
19k views

What are the pitfalls of doing so and why is it a bad practice? Is it possible that the model starts to learn the images "by heart" instead of understanding the underlying logic?
karalis1's user avatar
  • 471
35 votes
4 answers
16k views

I have a classification problem with approximately 1000 positive and 10000 negative samples in training set. So this data set is quite unbalanced. Plain random forest is just trying to mark all test ...
IgorS's user avatar
  • 5,484
28 votes
7 answers
28k views

As an extension to our great list of publicly available datasets, I'd like to know if there is any list of publicly available social network datasets/crawling APIs. It would be very nice if alongside ...
Rubens's user avatar
  • 4,117
28 votes
3 answers
43k views

I don't know if this is a right place to ask this question, but a community dedicated to Data Science should be the most appropriate place in my opinion. I have just started with Data Science and ...
Kevin Desai's user avatar
25 votes
4 answers
17k views

I'm working on a Kaggle challenge where some variables are represented by rows instead of columns (Telstra Network Disruption). I am currently searching for the equivalent of ...
cpumar's user avatar
  • 815
23 votes
6 answers
135k views

I want to train a deep learning model on a dataset containing around 3000 images. Since the dataset is huge, I want to use Google colab since it's GPU supported. How do I upload this full image folder ...
chatbot_chakra's user avatar
23 votes
2 answers
50k views

I have x_data and labels separately. How can I combine and load them in the model using torch.utils.data.DataLoader? I have a dataset that I created and the ...
Amarnath's user avatar
  • 361
22 votes
3 answers
14k views

Generally, the machine learning model is built on datasets. I'd like to know if there is any way to generate synthetic dataset using such trained machine learning model preserving original dataset ...
m-bhole's user avatar
  • 323
22 votes
3 answers
10k views

I'm currently searching for labeled datasets to train a model to extract named entities from informal text (something similar to tweets). Because capitalization and grammar are often lacking in the ...
Madison May's user avatar
  • 2,039
18 votes
5 answers
28k views

Does anyone know if it's possible to import a large dataset into Amazon S3 from a URL? Basically, I want to avoid downloading a huge file and then reuploading it to S3 through the web portal. I ...
Will Stedden's user avatar
18 votes
3 answers
25k views

I'm facing a situation where the numbers of positive and negative examples in a dataset are imbalanced. My question is, are there any rules of thumb that tell us when we should subsample the large ...
Rami's user avatar
  • 604
18 votes
4 answers
31k views

I have a data frame with large categorical values over 1600 categories. Is there any way I can find alternatives so that I don't have over 1600 columns? I found this interesting link. But they are ...
vinaykva's user avatar
  • 283

15 30 50 per page
1
2 3 4 5
101