Questions tagged [preprocessing]
Data preprocessing is a data mining technique that involves transforming raw data into a better understandable or more useful format.
539 questions
0
votes
0
answers
10
views
Stylegan preprocess
i have a dataset and in each picture there are many things. What should i do for train of GANs (styleGAN) for Preprocessing that the model distinguish the things in the Picture. Now the result is not ...
0
votes
1
answer
44
views
Correlated Features In Classificatification Problem
I'm working on binary classification problem to identify struggling students in university. I have some features that are correlated such as high_school_grade_1 that represents 75% of ...
1
vote
0
answers
40
views
Splitting the ISIC 2018 Skin Lesion Segmentation Dataset
In the official paper "Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC)", they provided a dataset including over ...
4
votes
0
answers
29
views
Time-efficient parallelization of masks for pre-processing a dataset
I have a large dataset (~10M points) in python and I want to filter it using a large number of different custom masks, as part of calculations to create a new but related dataset. Because the dataset ...
7
votes
1
answer
135
views
Data Drift & Model Comparison in Production MLOps: Handling Scale Changes with AutoML
Background
I'm implementing a production MLOps pipeline for part classification using Databricks AutoML. My pipeline automatically retrains models when new data arrives and compares performance with ...
7
votes
1
answer
100
views
Effects of resizing training images during preprocessing CNN classification model
I'm trying to train a CNN model to identify phytoplankton species from a training set. During preprocessing, the images are resized to 224x224, which seems to be stretching or compressing the object ...
0
votes
0
answers
31
views
Discrete Feature Imputation: How to Choose an Appropriate Data Distribution Model?
I am working on a dataset containing features that are discrete frequency counts. I understand that knowing the underlying data distribution is important for selecting an appropriate imputation method....
1
vote
1
answer
68
views
Is it valid to filter features using t-tests before train/test split in high-dimensional biological data
I'm working with high-dimensional biological data (∼41,000 features × 3,979 samples from RNA-seq for 2 conditions).
Here’s a simplified version of my preprocessing and filtering pipeline before ...
2
votes
0
answers
28
views
Question about preprocessing two time-series datasets from different measurement devices
I have a question regarding the preprocessing step in a project I'm working on. I have two different measurement devices that both collect time-series data. My goal is to analyze the similarity ...
7
votes
1
answer
97
views
Difference between transform('min) vs min() in pandas
I am currently working on a dataset that has two columns: customerID and date.
I want to find the minimum date for each customerID.
Initially, I used the following code:
...
0
votes
0
answers
83
views
Opinions on the practice of removing stop words before VADER
I know there is already a question on this topic, but it doesn’t fully address my concerns. I am currently writing my master's thesis and will use VADER for sentiment analysis (the vader package by ...
1
vote
0
answers
29
views
How can I efficiently process and load a large Protobuf dataset for machine learning model training?
I am training a model on multiple cache miss examples from various trace simulations. For every trace I have thousands of miss examples stored and I have many traces. I'm storing the examples in ...
0
votes
0
answers
25
views
Is Negation Handling Necessary in Topic Modeling?
I'm a fourth-semester Informatics Engineering student. Currently, I'm working on a topic modelling project using a Twitter dataset for college assignment. I've encountered a difficulty where, in one ...
0
votes
0
answers
31
views
String to number in case of having millions of unique values
I am currently working on preprocessing big data dataset for ML purposes. I am struggling with encoding strings as numbers. I have a dataset of multiple blockchain transactions and I have addresses of ...
0
votes
0
answers
35
views
Efficient way to clean 8752 pictures from the very similar one
I have 8752 pictures that was converted from, more or less, an hour long CCTV video with Python script screenshotting. My supervisor told me to clean the data from the roughly similar one. At first I ...