Quickstart Guide
Installation
The first steps are to clone this repo and then install conda and snakemake. We can then install the Core Genes Extraction (CorGE) library and run tests to make sure everything is properly installed.
git clone git@github.com:Ulthran/ShotgunUnifrac.git
cd ShotgunUnifrac/CorGE
pip install .
cd ../
pytest CorGE/tests/
pytest .tests/
Tip
If you’ve never installed Conda before, you’ll need to add it to your shell’s
path. If you’re running Bash (the most common terminal shell), the following
command will add it to your path: echo 'export
PATH=$PATH:$HOME/miniconda3/bin' > ~/.bashrc
If you see “Tests failed”, file an issue on GitHub.
Setup
We’ll start by creating some dummy inputs to work with using Escherichia coli, Buchnera aphidicola, Cellulomonas gilvus, Dictyoglomus thermophilum, and Methanobrevibacter smithii (a randomly selected group of gut bacteria plus Methanobrevibacter smithii as an outgroup for tree rooting).
echo $'7\n9\n2173' > EX_TXIDS.txt
echo $'GCF_000218545.1\nGCF_000020965.1' > EX_ACCS.txt
This creates two dummy input files
EX_TXIDS.txt
contains species-level taxon ids which CorGE will fetch genes and proteins for from NCBI.EX_ACCS.txt
contains genome accessions which CorGE will fetch from NCBI.
Tip
You can curate genomes/proteins from NCBI using taxon ids and genome accessions but you can also (in the same command) gather local genomes/proteins as well using the --local
flag.
Data curation
To gather the genes and proteins we need for tree building using the files we just created we use CorGE collect_genomes
. Then to filter single copy core genes (SCCGs) from each file and merge nucleotide/amino acid sequences by SCCG we use CorGE extract_genes
.
CorGE collect_genomes --ncbi_species EX_TXIDS.txt --ncbi_accessions EX_ACCS.txt ./
CorGE extract_genes ./
This should create the following directories and files from root
assembly_summary.txt
is downloaded from NCBI to find the best genome accessions for each taxon id.config.yml
is provided to the snakemake pipeline to specify what it should look for and where.nucleotide
is a directory containing all gathered nucleotide-encoded genomes (saved as .fna).protein
is a directory containing all gathered protein-encoded genomes (saved as .faa).outgroup
is a directory containing the nucleotide and protein files for the outgroup (if there is an outgroup).filtered-sequences
is a directory containing each SCCG from each genome (protein-encoded) in their own files.merged-sequences
is a directory containing each SCCG from each genome this time in per-SCCG files.
Tree building
To build the per-SCCG phylogenies and then merge them together we use the snakemake pipeline. Everything should be properly set up from running CorGE so we can just go ahead and run the pipeline.
snakemake all -c --use-conda --conda-prefix .snakemake/
This should create the following directories and files from root
RAxML_outgroupRootedTree.final
is the final consensus tree.aligned-sequences
is a directory containing alignments for the merged-sequences.trees
is a directory containing phylogenies built from each SCCG alignment as well as some intermediates in the merging process.
Tip
--use-conda
causes snakemake to use per-rule defined conda environments while it runs the pipeline. --conda-prefix .snakemake/
tells conda where to put/look for these environments.
Viewing results
The output is RAxML_outgroupRootedTree.final
which can be viewed using any newick-format tree viewer (like ETE Toolkit).
tl;dr
Follow instructions to install anaconda / miniconda and snakemake then
git clone git@github.com:Ulthran/ShotgunUnifrac.git
cd ShotgunUnifrac
echo $'7\n9\n2173' > EX_TXIDS.txt
echo $'GCF_000218545.1\nGCF_000020965.1' > EX_ACCS.txt
cd CorGE
pip install .
cd ..
CorGE collect_genomes --ncbi_species EX_TXIDS.txt --ncbi_accessions EX_ACCS.txt ./
CorGE extract_genes ./
snakemake all -c --use-conda --conda-prefix .snakemake/
You should now have an output called RAxML_outgroupRootedTree.final
.