Completed GO-MWU for new analyses

Speeding Things Up

As of my last post (yesterday), I was trying to figure out a way to rapidly obtain large numbers of GO IDs. Last time, I used Sam’s shell script by calling it inside a Jupyter notebook. However, Sam’s script took an extremely long time to run. A rough ballpark: for my 2 analyses (each with ~120,000 accession IDs), it would take a total of about 8 days on my local machine. That’s a huge pipeline bottleneck, and so we found an alternative.

New Script for Uniprot to GO

I created an R script for obtaining GO terms from accession IDs. Downside: it requires you to manually download the SwissProt database - with all GO terms included - from https://www.uniprot.org/uniprot/. Upside: it’s much, much, much faster. It ran in a few minutes, which by my calculation, is a bit faster than 8 days.

After getting GO terms, you need to remove duplicate lines prior to inputting to GO-MWU. So after using the R script (if you choose that path), switch over to the other script and begin running the analyses.

Using Sam’s shell script is probably still preferable if you’re looking at small numbers of genes or have a really fast internet connection. Otherwise, the new R script is likely preferable.

Running GO-MWU

Once I got past that bottleneck, I ran GO-MWU on my two analyses.

Reminder of what those analyses consisted of:

Elevated Day 2 individual libraries vs. Ambient Day 0+2 individual libraries
Ambient Day 0+2+17 libraries + Elevated Day 0 libraries + Lowered Day 0 libraries vs. Elevated Day 2 libraries. Reminder: Day 0 samples were taken while all crabs were still held at ambient temperatures.

Here are my results for each comparison:

Elevated Day 2 vs. Ambient Day 0+2, individual libraries

[Hierarchical clustering tree](https://github.com/afcoyle/hemat_bairdi_transcriptome/blob/main/graphs/GOMWU_output/cbai_transcriptomev2.0/elev2_vs_amb02_indiv.png.

Red = up-regulated, blue = down-regulated (up-regulated = higher at elevated)
Fraction: number of “good candidates” relative to total num of genes in category. A “good gene” is one with an unadjusted p-value <0.05.

Table of reformatted and augmented GO terms

Dissimilarity matrix of GO categories

Results of MWU test

Ambient Day 0+2+17 + Elevated Day 0 + Lowered Day 0 vs. Elevated Day 2

Hierarchical clustering tree

Table of reformatted and augmented GO terms

Dissimilarity matrix of GO categories

Results of MWU test

Quick Conclusions:

It sure looks like the first comparison gave more meaningful results than the second comparison. This could be because we sampled the same crabs in both groups (e.g. a sample of Crab G on Day 0 is part of the Amb Day 0/2/17 + Elev Day 0 + Low Day 0 group, a sample of Crab G on Day 2 is part of the Elevated Day 2 group).

It might be worth redoing this latter analysis with the Elevated Day 0 crabs dropped from the ambient treatment group altogether. Alternately, it may be neat to look at Elevated Day 0 vs. Elevated Day 2 - essentially, a pure examination of the effects of temperature on the same crab