Questions tagged [r]
R is a free, open-source programming language and software environment for statistical computing, bioinformatics, and graphics.
60 questions
132
votes
1
answer
386k
views
How to get correlation between two categorical variable and a categorical variable and continuous variable?
I am building a regression model and I need to calculate the below to check for correlations
Correlation between 2 Multi level categorical variables
Correlation between a Multi level categorical ...
3
votes
1
answer
9k
views
How to interpret Variance Inflation Factor (VIF) results?
From various books and blog posts, I understood that the Variance Inflation Factor (VIF) is used to calculate collinearity. They say that VIF till 10 is good. But I have a question.
As we can see in ...
55
votes
9
answers
11k
views
Is the R language suitable for Big Data
R has many libraries which are aimed at Data Analysis (e.g. JAGS, BUGS, ARULES etc..), and is mentioned in popular textbooks such as: J.Krusche, Doing Bayesian Data Analysis; B.Lantz, "Machine ...
7
votes
1
answer
3k
views
Random Forest significantly outperforms XGBoost - problem or possible?
I have dataset of around 180k observations of 13 variables (mix of numerical and categorical features). It is binary classification problem, but classes are imbalanced (25:1 for negative ones). I ...
6
votes
2
answers
14k
views
Python or R for implementing machine learning algorithms for fraud detection [closed]
I was wondering which language can I use: R or Python, for my internship in fraud detection in an online banking system: I have to build machine learning algorithms (NN, etc.) that predict transaction ...
37
votes
7
answers
6k
views
Organized processes to clean data
From my limited dabbling with data science using R, I realized that cleaning bad data is a very important part of preparing data for analysis.
Are there any best practices or processes for cleaning ...
33
votes
3
answers
45k
views
Hypertuning XGBoost parameters
XGBoost have been doing a great job, when it comes to dealing with both categorical and continuous dependant variables. But, how do I select the optimized parameters for an XGBoost problem?
This is ...
21
votes
6
answers
17k
views
Do modern R and/or Python libraries make SQL obsolete?
I work in an office where SQL Server is the backbone of everything we do, from data processing to cleaning to munging. My colleague specializes in writing complex functions and stored procedures to ...
17
votes
2
answers
9k
views
Recommending movies with additional features using collaborative filtering
I am trying to build a recommendation system using collaborative filtering. I have the usual [user, movie, rating] information. I would like to incorporate an ...
16
votes
1
answer
10k
views
What is the difference in xgboost binary:logistic and reg:logistic
What is the difference in R in xgboost between binary:logistic and reg:logistic? Is it only in evaluation metric?
If yes, how does RMSE on binary classification compare to error rate? Is the ...
16
votes
1
answer
32k
views
Do you have to normalize data when building decision trees using R?
So, our data set this week has 14 attributes and each column has very different values. One column has values below 1 while another column has values that go from three to four whole digits.
We ...
10
votes
1
answer
4k
views
XGBoost custom objective for regression in R
I implemented a custom objective and metric for a xgboost regression. In order to see if I'm doing this correctly, I started with a quadratic loss. The ...
7
votes
6
answers
5k
views
Is there any way to explicitly measure the complexity of a Machine Learning Model in Python
I'm interested in model debugging, and one of the points that it mentions is to compare your model with a "less complex" one to check if the performance is substantially better on the most ...
7
votes
2
answers
8k
views
Recommender system based on purchase history, not ratings
I'm exploring options for recommender systems optimized for the insurance industry, which would take into account
i) product holdings
ii) user characteristics (segment, age, affluence, etc.).
I ...
7
votes
1
answer
4k
views
Why does logistic regression in Spark and R return different models for the same data?
I've compared the logistic regression models on R (glm) and on Spark (LogisticRegressionWithLBFGS) on a dataset of 390 obs. of ...