Questions tagged [dataset]
A dataset is a collection of data, often in tabular or matrix form. This tag is NOT intended for data requests ("where can I find a dataset about ...") --> see OpenData
1,514 questions
0
votes
0
answers
35
views
Guide me with my major project titled Satellite-Based Agricultural Vulnerability Monitoring
I am working on a major project titled Utilizing Satellite Data and Deep Learning to Monitor Agricultural Vulnerabilities to Climate Change. My goal is to develop a system to monitor agricultural ...
8
votes
2
answers
113
views
When should we avoid balancing an imbalanced dataset?
I am working on a network security-related project, in which I have to build a deep learning model to detect a specific attack. It's about detecting whether a network system of an organisation is a ...
0
votes
0
answers
24
views
How to extract my fingerprint from my laptop's finger sensor
So like I have a bunch of fingerprint as a data set (my college gave me). Now I want to use these fingerprint as datasets and train a model to understand the different things. That is beside the point....
2
votes
1
answer
51
views
What could be a dataset in which the presence of an outlier or a null value dramatically affects the performance of the decision tree?
I am tasked with giving an example of a dataset in which the presence of an outlier or a null value dramatically affects the performance of
a decision tree. I've searched and searched the web and I ...
3
votes
2
answers
91
views
Imbalanced classes and ML set up
I’m working on a MarTech use case (predict customers conversions to a certain product). Not really used to work within this domain, therefore I’m seeking some critical questions on my set up.
Context: ...
4
votes
0
answers
36
views
Time-efficient parallelization of masks for pre-processing a dataset
I have a large dataset (~10M points) in python and I want to filter it using a large number of different custom masks, as part of calculations to create a new but related dataset. Because the dataset ...
3
votes
0
answers
106
views
How can I constrain a fitting parameter to be the same across multiple datasets?
I am doing nonlinear fits on multiple datasets with several fitting parameters. Each dataset is fit with the same equation and same fitting parameters. Specifically, I am using the curve fitting ...
1
vote
0
answers
60
views
Why are there date discrepencies in 2024 North Carolina absentee ballot data?
I've been working with North Carolina's mail-in/absentee ballot data for the 2024 general election. There are 327 rows with ballot request dates prior to 2024, including a few marked in years much ...
2
votes
1
answer
69
views
Large, historical, international news corpus for NLP; open access and Python workflow?
I need a large, historical, international news/articles dataset for an NLP project. Ideal features:
• the earlier the better–present; multilingual; public/academic access.
• Full text preferred; URLs +...
6
votes
2
answers
84
views
How to handle irrelevant categorical variables in aggregated data?
I’m working with ad server data where I can’t get user-level data — only aggregated reports. The data is aggregated on multiple categorical dimensions (e.g., day × product × medium × source × campaign ...
2
votes
0
answers
59
views
What is the best approach for future proofing research data against new parameters?
For my research I regularly perform parameter searches. Suppose I have a set of hyperparameters $\textbf{X}=\{X_0, X_1, \dots X_n \}$ and some function $f(\textbf{X}) = \textbf{Y}$ where $\textbf{Y}=\{...
2
votes
0
answers
32
views
How to improve fine-tuning for task dependency extraction?
I'm trying to fine-tune a LLaMA 3.1 Instruct model to adapt it to a specific industrial domain. The goal is to have the model extract direct dependencies between tasks from a list of operational steps ...
4
votes
1
answer
101
views
Sample size distribution for a dataset
This is a more general question regarding to the nature of a dataset for any statistical method used afterwards.
Let's say you have a nice,clean dataset that contains values for predicting the maximum ...
5
votes
1
answer
107
views
How to measure that my dataset is good for the training?
I wanted to train a model for this dataset. the Inputs dataset is here:https://drive.google.com/file/d/1bbMa7auwYjYxyCB72UMBNv5kaojqV7WH/view?usp=sharing and the outputs dataset is here:https://drive....
0
votes
1
answer
48
views
Matching BDD100K semantic segmentations to the original image
BDD100K is a dataset for autonomous driving. I downloaded the images + labels, and also the semantic segmentations, but I am facing an issue: The image names don't match between the original images ...