Using NCBI Genomes

Transcriptome 2.1 update…

Last time I mentioned potentially using Sam’s transcriptomes created by taking cbai_transcriptome_v2.0 and filtering Alveolata and non-_Alveolata_ by taxa, thus producing cbai_transcriptome_v2.1 and hemat_transcriptome_v2.1. However, when I downloaded those two transcriptomes, they had ~230,000 and ~30,000 sequences respectively - far less than the ~1.4 million found in cbai_transcriptome_v2.0. Therefore, for the time being, we are neglecting this alternative route. Alright, on to my main update!

Last time..

In my last post, I characterized the method of choice for obtaining taxa information. The plan was to do the following:

Download all Alveolata and Arthropoda nucleotide sequences from the NCBI database, available from the NCBI Taxonomy Browser
Obtain a FASTA file of all DEGs for our two meaningful conditions - elev day 2 vs. amb. day 2, and elev day 0 vs. elev day 2 - by cross-referencing transcript IDs and transcriptome 2.0
BLASTn those obtained FASTA files twice - once against all Alveolata sequences and again against our Arthropoda sequences
Figure out a good e-value to set as our bar for taxa determination
Within each set of DEGs, see how many sequences appear as matches for both Alveolata and Arthropoda, and how many failed to match either

All five of the above steps were completed using this script.

Here are my results for the alveolata BLAST, and for the arthropoda BLAST

Here’s what we got.

	Amb. Day 2 vs. Elev. Day 2	Elev. Day 0 vs. Elev. Day 2
Base Sequences	2069	338
Alveolata (unfiltered)	2391	351
Arthropoda (unfiltered)	2736	349
Alveolata (eval <= 10^-4)	569	60
Arthropoda (eval <= 10^-4)	1137	181
Eval: Alveolata > Arthropoda (presumably Arthropoda)	1175	217
Eval: Alveolata < Arthropoda (presumably Alveolata)	741	66

Note: For the last two rows, duplicate transcript IDs were removed (some transcript IDs mapped to multiple genes within a single BLAST). The transcript ID with the higher e-value was retained

As a quick initial scan, this…is pretty good!! Definitely a good guide to which genes are likely Hematodinium and which are C. bairdi. However, it has a downside - it exclusively examines differentially-expressed genes. After considering some whole-transcriptome alternatives, we made a discovery - as of January 2021, someone has uploaded a fairly complete Chionoecetes opilio genome to the NCBI database - complete with a separate file of presumed protein sequences! We also located a full genome of a relatively-close species to Hematodinium - Amoebophrya sp., a dinoflagellate that parasitizes other dinoflagellates. Therefore, our next steps are as follows:

BLASTx of transcriptome 2.0 against the protein sequences from the C. opilio genome downloaded from NCBI
BLASTn of transcriptome 2.0 against the Amoebophrya sp. genome downloaded from NCBI
Use those two BLAST results to determine which genes are Hematodinium in origin and which are C. bairdi for all of Transcriptome 2.0
When time allows, BLAST all of Transcriptome 2.0 against the whole NCBI database with a taxonomy filter (run this on Mox).
Potential alternative step if needed: Take all C. bairdi libraries of uninfected crab, and assemble a C. bairdi transcriptome. Use that to determine which sequences are Hematodinium and which are C. bairdi.

Steps 1-2 have already been completed - a script for those, including all data locations, is available here! The slurm scripts are currently running on Mox. Once they’re complete, time to move on to Step 3!