---
jupytext:
  formats: md:myst
  text_representation:
    extension: .md
    format_name: myst
    format_version: 0.13
    jupytext_version: 1.11.5
kernelspec:
  display_name: Python 3
  language: python
  name: python3
---
# Functional annotation
## Required databases
In order to perform the functional annotation, we will need a couple of different reference databases. Below you will find instructions on how to download these databases using MOSHPIT.

```{code-cell}
mosh annotate fetch-diamond-db \
    --o-db ./cache:diamond_db \
    --verbose
```

```{code-cell}
mosh annotate fetch-eggnog-db \
    --o-db ./cache:eggnog_db \
    --verbose
```
Alternatively, you can use:
- `mosh annotate build-eggnog-diamond-db` to create a DIAMOND formatted reference database for the specified taxon.
- `mosh annotate build-custom-diamond-db` to create a DIAMOND formatted reference database from a FASTA input file.

## EggNOG search using Diamond aligner
We will search the dereplicated MAGs against the EggNOG database using the Diamond aligner to identify functional annotations.
```{code-cell}
mosh annotate search-orthologs-diamond \
    --i-seqs ./cache:mags_derep \
    --i-db ./cache:diamond_db \
    --p-num-cpus 16 \
    --p-db-in-memory \
    --o-eggnog-hits ./cache:eggnog_hits \
    --o-table ./cache:eggnog_ft  \
    --o-loci ./cache:eggnog_loci \
    --verbose
```
## Annotate orthologs against eggNOG database
Orthologs from dereplicated MAGs are annotated against the EggNOG database, providing functional insights into the genes 
and gene products present in the MAGs.
```{code-cell}
mosh annotate map-eggnog \
    --i-eggnog-hits ./cache:eggnog_hits \
    --i-db ./cache:eggnog_db \
    --p-num-cpus 16 \
    --p-db-in-memory \
    --o-ortholog-annotations ./cache:eggnog_annotations \
    --verbose
```
## Extract annotations
This method extracts a specific annotation from the table generated by EggNOG and calculates its frequencies across all MAGs.
```{note}
The `mosh annotate extract-annotations` method allows us to extract specific types of functional annotations, such as 
**CAZymes**, **KEGG pathways**, **COG categories**, or other functional elements, and calculate their frequency across 
all dereplicated MAGs. 

In this tutorial, we focus on demonstrating the extraction of **CAZymes**.
```
```{code-cell}
mosh annotate extract-annotations \
    --i-ortholog-annotations ./cache:eggnog_annotations \
    --p-annotation caz \
    --p-max-evalue 0.0001 \
    --o-annotation-frequency ./cache:caz_annot_ft \
    --verbose
```

## Multiply tables
This step simply calculates the dot product of the `mags_derep_ft` and `caz_annot_ft` feature tables. This is useful for 
combining the annotation data (e.g., **CAZymes**) with MAG abundance to determine how specific functional annotations 
are distributed across MAGs, and use this information to estimate the total frequency of each annotation in each sample. 

```{code-cell}
mosh annotate multiply-tables \
    --i-table1 ./cache:mags_derep_ft \
    --i-table2 ./cache:caz_annot_ft \
    --o-result-table ./cache:caz_ft \
    --verbose
```

## Let's have a look at our CAZymes functional diversity!
We will start by calculating a Bray-curtis dissimilarity matrix to measure the dissimilarity between each sample, based on 
observed frequency of different CAZyme annotations in each sample.
```{code-cell}
mosh diversity beta \
    --i-table ./cache:caz_ft \
    --p-metric braycurtis \
    --o-distance-matrix ./cache:caz_braycurtis_dist
```

Next, we will perform principal coordinate analysis (PCoA) from the obtained Bray-curtis matrix.
```{code-cell}
mosh diversity pcoa \
    --i-distance-matrix ./cache:caz_braycurtis_dist  \
    --o-pcoa ./cache:caz_braycurtis_pcoa
```
Visualization time! Let's plot the PCoA results.
```{code-cell}
mosh emperor plot \
    --i-pcoa ./cache:caz_braycurtis_dist \
    --m-metadata-file ./metadata.tsv \
    --o-visualization caz-pcoa.qzv
```

Your visualization should look similar to [this one](https://view.qiime2.org/visualization/?src=https://raw.githubusercontent.com/bokulich-lab/moshpit-docs/main/moshpit_docs/data/bray-curtis-emperor.qzv).
```{tip}
Once your visualization is ready, click on the `Color` tab at the top right and select `scatter:seed` on the first tab 
to color your samples by seed type. Then click on the `Animations` tab and choose `timepoint` as gradient and `seed` 
as trajectory. Now, press play! You should see the progression of samples over time.
```