Frequent 'r' Questions - Data Science Stack Exchange

132 votes

1 answer

386k views

How to get correlation between two categorical variable and a categorical variable and continuous variable?

I am building a regression model and I need to calculate the below to check for correlations Correlation between 2 Multi level categorical variables Correlation between a Multi level categorical ...

GeorgeOfTheRF

2,088

asked Aug 3, 2014 at 13:07

3 votes

1 answer

9k views

How to interpret Variance Inflation Factor (VIF) results?

From various books and blog posts, I understood that the Variance Inflation Factor (VIF) is used to calculate collinearity. They say that VIF till 10 is good. But I have a question. As we can see in ...

thewhitetulip

153

asked Dec 8, 2019 at 4:35

55 votes

9 answers

11k views

Is the R language suitable for Big Data

R has many libraries which are aimed at Data Analysis (e.g. JAGS, BUGS, ARULES etc..), and is mentioned in popular textbooks such as: J.Krusche, Doing Bayesian Data Analysis; B.Lantz, "Machine ...

akellyirl

723

asked May 14, 2014 at 11:15

7 votes

1 answer

3k views

Random Forest significantly outperforms XGBoost - problem or possible?

I have dataset of around 180k observations of 13 variables (mix of numerical and categorical features). It is binary classification problem, but classes are imbalanced (25:1 for negative ones). I ...

Filip

73

asked Jan 6, 2022 at 11:09

6 votes

2 answers

14k views

Python or R for implementing machine learning algorithms for fraud detection [closed]

I was wondering which language can I use: R or Python, for my internship in fraud detection in an online banking system: I have to build machine learning algorithms (NN, etc.) that predict transaction ...

Hamza

61

asked Feb 20, 2015 at 14:09

37 votes

7 answers

6k views

Organized processes to clean data

From my limited dabbling with data science using R, I realized that cleaning bad data is a very important part of preparing data for analysis. Are there any best practices or processes for cleaning ...

Jay Godse

471

asked May 14, 2014 at 15:25

33 votes

3 answers

45k views

Hypertuning XGBoost parameters

XGBoost have been doing a great job, when it comes to dealing with both categorical and continuous dependant variables. But, how do I select the optimized parameters for an XGBoost problem? This is ...

Dawny33

8,506

asked Dec 13, 2015 at 14:19

21 votes

6 answers

17k views

Do modern R and/or Python libraries make SQL obsolete?

I work in an office where SQL Server is the backbone of everything we do, from data processing to cleaning to munging. My colleague specializes in writing complex functions and stored procedures to ...

AffableAmbler

383

asked Feb 24, 2017 at 19:33

17 votes

2 answers

9k views

Recommending movies with additional features using collaborative filtering

I am trying to build a recommendation system using collaborative filtering. I have the usual [user, movie, rating] information. I would like to incorporate an ...

Sidhha

397

asked Jul 25, 2014 at 0:58

16 votes

1 answer

10k views

What is the difference in xgboost binary:logistic and reg:logistic

What is the difference in R in xgboost between binary:logistic and reg:logistic? Is it only in evaluation metric? If yes, how does RMSE on binary classification compare to error rate? Is the ...

user2530062

327

asked Jan 15, 2016 at 11:00

16 votes

1 answer

32k views

Do you have to normalize data when building decision trees using R?

So, our data set this week has 14 attributes and each column has very different values. One column has values below 1 while another column has values that go from three to four whole digits. We ...

Jae

163

asked Mar 4, 2015 at 8:05

10 votes

1 answer

4k views

XGBoost custom objective for regression in R

I implemented a custom objective and metric for a xgboost regression. In order to see if I'm doing this correctly, I started with a quadratic loss. The ...

Peter

8,044

asked Sep 9, 2020 at 12:29

7 votes

6 answers

5k views

Is there any way to explicitly measure the complexity of a Machine Learning Model in Python

I'm interested in model debugging, and one of the points that it mentions is to compare your model with a "less complex" one to check if the performance is substantially better on the most ...

Multivac

3,519

asked Aug 19, 2020 at 19:22

7 votes

2 answers

8k views

Recommender system based on purchase history, not ratings

I'm exploring options for recommender systems optimized for the insurance industry, which would take into account i) product holdings ii) user characteristics (segment, age, affluence, etc.). I ...

Kasia Kulma

193

asked Jun 7, 2017 at 7:39

7 votes

1 answer

4k views

Why does logistic regression in Spark and R return different models for the same data?

I've compared the logistic regression models on R (glm) and on Spark (LogisticRegressionWithLBFGS) on a dataset of 390 obs. of ...

SparkUser

113

asked May 7, 2015 at 13:23

Stack Exchange Network

Questions tagged [r]

How to get correlation between two categorical variable and a categorical variable and continuous variable?

How to interpret Variance Inflation Factor (VIF) results?

Is the R language suitable for Big Data

Random Forest significantly outperforms XGBoost - problem or possible?

Python or R for implementing machine learning algorithms for fraud detection [closed]

Organized processes to clean data

Hypertuning XGBoost parameters

Do modern R and/or Python libraries make SQL obsolete?

Recommending movies with additional features using collaborative filtering

What is the difference in xgboost binary:logistic and reg:logistic

Do you have to normalize data when building decision trees using R?

XGBoost custom objective for regression in R

Is there any way to explicitly measure the complexity of a Machine Learning Model in Python

Recommender system based on purchase history, not ratings

Why does logistic regression in Spark and R return different models for the same data?

Hot Network Questions