So this quarter, I took Sarah Converse’s QERM 514 class on linear modeling. As the final project, we chose a data set to model. My choice was, naturally, some data on Hematodinium and Tanner crab - specifically, I chose the 2007-2012 Southeast Alaska Tanner crab survey data we obtained from Pam Jensen. For these surveys, each row equals one crab, with approximately 14,500 crab in our dataset total. A variety of physical and environmental factors are tracked - most notably, the following:
- Hematodinum infection status (infected or uninfected)
- Location code
- Day and time
- Pot depth
- Carapace width
- Shell condition
- Black Mat disease infection status
- Missing legs code
Of these, I treated year, carapace width, day, and pot depth as continuous, and all others as categorical. I also decided to include the random effects of both year and location, and - of course - included Hematodinium infection status as my response variable. Importantly, Hematodinium infection was determined visually in this survey, so only crabs whose infections had developed to a certain point would be determined to be infected.
I combined all these datasets together and removed any evident entry errors or rows with missing data. I then cleaned up my variables, removing extraneous ones and changing some others. Day and time got changed to Julian day, shell condition 1 and 2 - (new and fresh molts) got grouped together, and missing legs code was changed to binary - were crabs missing legs or not.
Now that my data were clean, I checked for correlation between variables with a Pearson’s test (for continuous vs. continuous), a Cramer’s V test (for categorical vs. categorical), or Spearman rank-order correlation (for continuous vs. categorical). I found a correlation between sex and carapace width, which was unsurprising, as mature male Tanners are much larger than females. This meant that I would only choose one of these effects to put in the full model. Since I had a Bernoulli-distributed response variable (Hematodinium infection status) and random effects (both year and location), I chose to build a generalized linear mixed model (GLMM).
I had some struggles, as the model was too big to run directly using Laplace approximation. I scaled all my continuous variables, and then built a series of Laplace models. Each contained one fixed effect plus my two random effects (note: fixed effect = variable that isn’t a random effect or the response variable). This gave me a rough idea of how important each variable was. I then built two nearly-full models using the update() function. One included all variables except carapace width and missing legs status, and the other included all except sex and missing legs status. This was to test whether sex or carapace width was a better variable to include. Missing legs status was also excluded, as the model just wouldn’t run with it included. When I took the AIC of each model, the model that included carapace width was far better, and so was treated as my new “full model”.
Once I had this “full model”, I used MuMIn::dredge() to test all possible combinations of fixed effects and compare them using AIC. This produced 2 models with weights > 0.001, which I averaged by model weight. Of these 2 models, the better one didn’t include the effect of day, while the worse one did.
Unfortunately, when I tested my model using the Hosmer-Lemeshow test, it failed, indicating that there’s variance within the data that isn’t accounted for in the model. In other words, we’re missing at least one important predictor variable, and the conclusions of this model should be taken with caution. With that said, let’s talk about each of the variables in our model!
Black Mat Code
Included in both models, but not judged to be significant (p-value = 0.54). However, when I examined the data a bit more, I saw that despite sampling ~800 crab with Black Mat, zero crab with both Black Mat and Hematodinium were observed in this survey. Unlike many other surveys, Black Mat status and Hematodinium infection status are tracked in separate columns on the data sheet, so this is unlikely to be a data entry artifact. To me, this indicates that despite all our data, we just didn’t have enough to draw decent conclusions on Black Mat and Hematodinium, as both diseases are somewhat rare.
However, this is extremely intriguing to me. There could certainly be some exclusion effect going on between the two infections. Should do more research into fungal infections impeding dinoflagellate infections and vice-versa!
It sure looks like there’s a strong link to molt time (p-value < 2x10^-16). Fresh-shell crabs have a really low infection rate, it then spikes for new-shell crabs, and then declines as shell age increases. This seems to correspond to initial infection at molt -> infection becomes visible after a period of time -> crab death (or perhaps fighting off infection?).
Really interestingly, missing legs status didn’t seem to matter in our model. In fact, we left it out of our full model because it was the second-least-important variable. This seems to indicate some specific vulnerability associated with molting that isn’t present when the carapace is otherwise compromised
Crab in shallower waters were more likely to be infected (p-value < 2x10^-16). Due to the long time between initial infection and symptoms, this likely indicates movement into shallower waters post-infection, rather than increased infection rates in shallow waters.
This has implications for fishery optimization - if fishing deeper helps avoid bitter crab, boats would probably prefer this. It also raises the question of why this is happening. It could be either habitat preference on the part of the crab or the parasite. Plenty of parasites affect crab behavior, and this could be one of them. On the other hand, it could be simply infected crab trying to deal with infection by moving to areas where something (temperature, food abundance, etc) differs.
Importantly, these are pot surveys, so we really aren’t getting the full breadth of crab width here. But based on this, larger crab were more likely to be infected (p-value < 2x10^-16). Unsure of why this is, but definitely interesting! Maybe sexual maturity means more energy gets devoted to reproduction?
Included in the average model, but likely unimportant - not included in the model with 75% of the weight, and p-value = 0.85.
Importantly, this survey only took place from June to October, so there’s a large chunk of the year that’s not included in this analysis. There certainly could still be an effect of time of year, and our surveys just didn’t occur during enough seasons!
The model was missing at least 1 important source of variation. At first glance, I think it’s most likely to be one of three things - temperature, salinity, or predator abundance. Unfortunately, none of these are tracked by the SE Alaska pot survey. However, the NOAA EBS trawl survey does track temperature and predator abundance, and salinity is much less likely to vary substantially on the eastern Bering Sea than within the fjords of SE Alaska.
Also this really seems like something that could be developed into a thesis chapter - definitely want to start heading that way!
Next step 1: Try to re-run model on NOAA EBS data, see if that locates missing variation
Next step 2: Look more into the potential exclusion of Hematodinium by Black Mat or vice-versa!
Next step 3: Start working on developing into thesis chapter