0
$\begingroup$

I have a dataset (300 samples vs 4000 features), I'm trying to extract meaningful features related to two condition A and B. Both conditions have levels 0, 1 and 2 that stand for none, mild, severe. Since features don't follow a normal distribution I used a non-parametric Kruskal test, now I have two lists of p-values pValA and pValB that contains significance levels of each feature in my dataset for condition A and condition B. How can I merge both lists and select meaningful features? I used to do the average but my supervisor faked a heart attack when he saw my code.

$\endgroup$
6
  • 1
    $\begingroup$ I don't think anything meaningful can come out of doing 4000 kruskal-wallace tests. You have massive type II error inflation issues.... If you are trying to determine which covariates are related to your conditions you should be using a hypothesis generation technique (i.e. some kind of dimensional compression) rather than t-tests, although even then it wont be able to handle the disparity between you sample size and the number of covariates you have. Can you post a sample of your data to show folks what you are working with? $\endgroup$ Commented Sep 19, 2019 at 21:14
  • $\begingroup$ I don't think I can since this is a private dataset. Another idea would be to use the feature importance of a random forest $\endgroup$ Commented Sep 19, 2019 at 21:47
  • $\begingroup$ If you can't release the data then try creating a fake set to post here (preferably one that can be directly imported into a stats program). It is very hard to help without seeing what one is dealing with. You could try a random forest but you would need to standardise the units and ensure that you do not have any overly noisy variables or they will drive your results as you have a very small number of observations relative to covariates. $\endgroup$ Commented Sep 19, 2019 at 22:00
  • $\begingroup$ Ok then, I will try to build fake data $\endgroup$ Commented Sep 19, 2019 at 22:04
  • 1
    $\begingroup$ I would try ordinal LASSO / RIDGE regression. See cran.r-project.org/web/packages/glmnetcr/vignettes/glmnetcr.pdf. It doesn't give you p-values, but in this case you shouldn't really look for them anyway. $\endgroup$ Commented Sep 20, 2019 at 6:18

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.