I have a dataset (300 samples vs 4000 features), I'm trying to extract meaningful features related to two condition A and B. Both conditions have levels 0, 1 and 2 that stand for none, mild, severe. Since features don't follow a normal distribution I used a non-parametric Kruskal test, now I have two lists of p-values pValA and pValB that contains significance levels of each feature in my dataset for condition A and condition B. How can I merge both lists and select meaningful features? I used to do the average but my supervisor faked a heart attack when he saw my code.
$\begingroup$
$\endgroup$
6
-
1$\begingroup$ I don't think anything meaningful can come out of doing 4000 kruskal-wallace tests. You have massive type II error inflation issues.... If you are trying to determine which covariates are related to your conditions you should be using a hypothesis generation technique (i.e. some kind of dimensional compression) rather than t-tests, although even then it wont be able to handle the disparity between you sample size and the number of covariates you have. Can you post a sample of your data to show folks what you are working with? $\endgroup$André.B– André.B2019-09-19 21:14:56 +00:00Commented Sep 19, 2019 at 21:14
-
$\begingroup$ I don't think I can since this is a private dataset. Another idea would be to use the feature importance of a random forest $\endgroup$Giuseppe Minardi– Giuseppe Minardi2019-09-19 21:47:05 +00:00Commented Sep 19, 2019 at 21:47
-
$\begingroup$ If you can't release the data then try creating a fake set to post here (preferably one that can be directly imported into a stats program). It is very hard to help without seeing what one is dealing with. You could try a random forest but you would need to standardise the units and ensure that you do not have any overly noisy variables or they will drive your results as you have a very small number of observations relative to covariates. $\endgroup$André.B– André.B2019-09-19 22:00:58 +00:00Commented Sep 19, 2019 at 22:00
-
$\begingroup$ Ok then, I will try to build fake data $\endgroup$Giuseppe Minardi– Giuseppe Minardi2019-09-19 22:04:02 +00:00Commented Sep 19, 2019 at 22:04
-
1$\begingroup$ I would try ordinal LASSO / RIDGE regression. See cran.r-project.org/web/packages/glmnetcr/vignettes/glmnetcr.pdf. It doesn't give you p-values, but in this case you shouldn't really look for them anyway. $\endgroup$user2974951– user29749512019-09-20 06:18:15 +00:00Commented Sep 20, 2019 at 6:18
|
Show 1 more comment