A Answer Guide

This page contains the answers to the reflection questions asked in the student guide. If you discover a mistake or have a suggestion for additional or alternate reflection questions, please contact us through our Discourse Channel or submit an issue on our GitHub repository.

What is some information saved in a .fasta object that RStudio tells us that we don’t get from a .phyDat object?

The .fasta object also tells us at least some of the names of the samples in the fasta file, as well as the average base composition of all the sequences and how big the fasta file is.

How many omicron sequences are there in the second file (loaded into the object spike.omicron)?

There are four omicron sequences in the second file.

Do you think there is more variability in the omicron sequences than in other variants (alpha, beta, delta, and gamma)? Why or why not?

While adding the omicron sequences to the fasta file increases the number of site patterns in the sequences, this doesn’t necessarily mean the omicron sequences have more variability among each other than the other sequences. The omicron sequences may just be very different from the other variants.

Which variant(s) is the most distantly related to the original Wuhan reference sequence?

All the variants other than the alpha variant share the same most recent common ancestor with the original Wuhan reference sequence, thus they are all equally distant relatives of the Wuhan reference sequence.

Which variant shares the most recent common ancestor with the omicron variants? What does this mean for determining where the omicron variant came from?

The beta variant shares the most recent common ancestor with the omicron variants. This suggests that the omicron variant probably evolved from an ancestor of the beta branch.

what is the branch length distance between the beta variant and the alpha variant?

The branch length distance between the beta variant and the alpha variant is 13.

What is the longest branch length in the tree? What does this mean for the number of mutations (compared to the Wuhan reference sequence) seen in that variant versus the others?

The longest branch length is 12, leading to the gamma variant. The gamma variant has a greater number of mutations compared to the Wuhan reference sequence than the others.

Did adding the omicron samples change the branch length distances between the original five samples? What is the branch length distance between the alpha and beta variants now?

Adding the omicron samples did not really change the branch lengths between the original five sequences. The distance between the alpha and beta variants is still 13.

What is the length of the branch connecting the omicron group to the rest of the tree?

The length of the branch connecting the omicron variants to the rest of the tree is 15.

The covid vaccines were originally designed based on the Wuhan reference sequence of the spike protein. The immune system learns to recognize the spike protein from the vaccine and can identify and destroy any invading Covid-19 viruses. Using what you have learned about the phylogenetic tree of SARS-CoV-2 variants, can you explain why these initial vaccines were less effective at protecting against the omicron variants than they were against the delta variant?

The distance between any of the omicron variants and the Wuhan reference variant is much larger than the distance between the delta variant and the Wuhan reference sequence. This tells us the omicron spike protein sequence has mutated more in the omicron variants, so the spike protein probably doesn’t look as much like the spike protein the vaccines train the immune system to recognize (compared to the delta variant spike proteins).

Based on the information from the phydat files for the spike protein dataset and the membrane protein dataset, is there the same amount of variation in each protein-coding region?

No, there is much less variation in the membrane protein-coding region than in the spike protein-coding region.

Do you think the spike protein dataset or the membrane protein dataset contains a greater number of phylogenetically-informative sites?

The spike protein dataset contains a greater number of phylogenetically-informative sites.

Is the tree built from the membrane protein data the same as the tree built from the spike protein data?

The tree built from the membrance protein data looks very different from the tree built from the spike protein data. The relationships are much less resolved.

Why is a greater number of phylogenetically-informative sites better for tree building?

A greater number of phylogenetically-informative sites results in more resolution among the taxa in the tree.