4
$\begingroup$

I have a question about how to analyse my dataset and would really appreciate your advice.

My data consist of observations of a set of target plant species collected during field surveys. The surveys were not strictly standardized: observers walked through the study area and recorded species when encountered. Therefore, the dataset mainly contains presence records.

In terms of structure, the dataset includes:

  • SPECIESCODE: species identity
  • Code: sampling unit (there are only two distinct codes, corresponding to two survey areas)
  • Management intensity: Extensive, No management, Moderately intensive, Intensive
  • R10: raster cell (10 m × 10 m grid) in which the observation was located
  • Abundance: number of individuals recorded per observation (if two of the same species were present in the same R10, then the max abundance was taken)

Each row corresponds to a species observation within a given sampling unit, management type, and raster cell.

The goal is to estimate the preference per species per management intensity. That is why I transformed my data into a presence–absence format by assuming that if a species was not recorded within a given sampling unit (Code + management), it was absent.

data_full_heide <- data_full_heide %>%
  mutate(presence = 1) %>%
  complete(
    SOORTCODE,
    nesting(Code, management), #nesting met R10 needed?
    fill = list(presence = 0)
  )

Then I performed a logistic regression, with code as fixed variable instead of a random variable because it consists of only two categories.

model <- glmer(
  presence ~ management + Code + (1 | SOORTCODE), 
  family = binomial,
  data = data_full_heide
)

But how can I now test the individual preference per species? Do I conduct a GLM per species? But this gives very high standard errors and many p-values are not significant.

glm_per_species <- data_full_heide %>%
  group_by(SOORTCODE) %>%
  group_modify(~ {
    mod <- glm(
      presence ~ management,
      family = binomial,
      data = .x
    )
    broom::tidy(mod)
  })

Let me know if you need additional information!

SOORTCODE Code management R10 abundance
carexpan A mid 105 12
carexpan A mid 106 8
carexpan B extensive 210 3
carexnig A mid 107 5
carexnig B no management 305 2
New contributor
fleur is a new contributor to this site. Take care in asking for clarification, commenting, and answering. Check out our Code of Conduct.
$\endgroup$
7
  • $\begingroup$ Preference means whether a species prefers a certain management style? What is SOORTCODE? For those who don't work with the tidyverse much, showing how the resulting data (including size, no idea how many raster cells there are) look like would help. Standard errors and p-values can be high because of low effective sample size, but I can't tell easily. $\endgroup$ Commented 14 hours ago
  • 2
    $\begingroup$ Please edit the question with your helpful new comments. An edited question is easier to read (and easier to edit too). $\endgroup$ Commented 13 hours ago
  • 1
    $\begingroup$ One issue that is apparent to me (but I am not an ecologist) is that you seem to be modelling probability of presence with absent species contributing just a single failed trial. This should probably be some kind of rate instead, with either the species count per area or the probability of finding a species every time a study area was surveyed as response. $\endgroup$ Commented 12 hours ago
  • 1
    $\begingroup$ Any reason why glm_per_species doesn't use Code? If Code makes a difference (and management isn't strictly nested within it, not sure whether that's the case) I think it should be taken into account also at species level. Note further that any modelling that doesn't use spatial information will ignore potential dependences from the neighbourhood structure of raster cells. Furthermore, there may be interactions between species, which are ignored if you model a single species separately. (I'm not saying all this can be incorporated easily...) $\endgroup$ Commented 12 hours ago
  • 1
    $\begingroup$ Information on number of raster cells is still missing as far as I can see. The distribution of management styles may also be informative as the power that any test can have may depend on it. Furthermore I don't quite understand why you think reducing abundance information to presence/absence is appropriate and useful. $\endgroup$ Commented 12 hours ago

1 Answer 1

2
$\begingroup$

This is not an answer, but a series of remarks which would not fit well in comments.

  1. My biggest issue with your analysis is how you dichotomize the abundance information to simply absence/presence. If your “goal is to estimate the preference per species per management intensity”, you are losing a lot of information (e.g. in your example data you have 1 abundance at 12, another at 2, and you are equating both as “1”, i.e. present). I definitely would not do that.
  2. Then there is how you deal with “unobserved”. You say that “The surveys were not strictly standardized: observers walked through the study area and recorded species when encountered”. And then, you assume “that if a species was not recorded within a given sampling unit (Code + management), it was absent”. That is a bold assumption; lack of observation is not the same as “absence”. I would not try to “fabricate” data on absence; you just do not know that. Your surveys should have been more carefully planned to actually record complete absence. After the fact, you will just have to use the data you have (and not “invent” it).
  3. Moreover, you seem to detect presence/absence only at the Code+Management level, and not at the raster level. Why? And you seem to run your models without using the raster cell? You are losing all the observations at the raster cell level; for a given site (Code) and Management type, you only have 1 observation. Again, lost information.
  4. You run your model on “overall” abundance, as if all species behaved the same. That does not seem to make sense; not all species will perform the same under a given management style.
  5. And yes, if you want to test the abundance of individual species, you will need to run separate regressions. You say that you get “very high standard errors and many p-values are not significant”. That is most likely due to low power (not enough observations); this is not surprising given all the information you lost along the way (aggregating raster cells; dichotomizing abundance).
  6. Your location Code can only take 2 values: I would not use it in the model. Instead I would bake the Code in the raster cell ID (e.g. all cells for location 1 start with A, or 1, similar for location 2). With only 2 locations, there is not enough data to analyze the outcomes per location (and that is not even a goal of yours).
  7. I would run the model on abundance~Speciescode+Raster+Management, and then individually by species. Hopefully you would have enough observations (at least one per raster?) to get decent power. However see question below on Raster vs Management.
  8. And finally a question; is the management style the same per raster? Or can the cell be divided into different management styles? If it is the same, then you can drop the raster factor from the models, and just keep Code.
$\endgroup$

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.