version: 02 October, 2023

A B O U T   T H I S   D O C U M E N T

This walks through the steps to process 2bRAD reads without an existing genome.

Be sure to read through what you are doing and follow instructions before copy/pasting code chunks.

Copy the code chunks into your terminal, taking care to change the necessary portions to fit your data/ your directory structure, etc.

First, you will need to replace reckert2017 with your user name and reckert2017@fau.edu with your email throughout this code.


S E T U P


Download all necessary scripts and load modules for processing/analysis

Login to KoKo

ssh reckert2017@koko-login.hpc.fau.edu

Load necessary modules

or you can add to ~/.bashrc to load at login (use nano .bashrc)

module load angsd-0.933-gcc-9.2.0-65d64pp
module load bayescan-2.1-gcc-8.3.0-7gakqmd
module load qt-5.15.2-gcc-9.2.0-zi7wcem BayeScEnv/1.1
module load bcftools-1.9-gcc-8.3.0-il4d373
module load bowtie2-2.3.5.1-gcc-8.3.0-63cvhw5
module load cdhit-4.8.1-gcc-8.3.0-bcay75d
module load htslib-1.9-gcc-8.3.0-jn7ehrc
module load kraken2-2.1.1-gcc-9.2.0-ocivj3u
module load python-3.7.4-gcc-8.3.0-3tniqr5
module load launcher
module load miniconda3-4.6.14-gcc-8.3.0-eenl5dj
module load ncbi-toolkit-22_0_0-gcc-9.2.0-jjhd2wa
module load ngsadmix-32-gcc-8.3.0-qbnwmpq
module load ngsRelate/v2
module load R/3.6.1
module load samtools-1.10-gcc-8.3.0-khgksad
module load vcftools-0.1.14-gcc-8.3.0-safy5vc

Download scripts

Put scripts needed into ~/bin or similar directory that is mapped in .bashrc path IF you don’t have a bin directory in your $HOME you can make one now (mkdir ~/bin)

cd ~/bin
svn checkout https://github.com/RyanEckert/Stephanocoenia_FKNMS_PopGen/trunk/scripts/
mv scripts/* .
rm scripts

wget http://www.cmpg.unibe.ch/software/PGDSpider/PGDSpider_2.0.7.1.zip
unzip PGDSpider_2.0.7.1.zip
rm PGDSpider_2.0.7.1.zip

git clone https://github.com/Rosemeis/pcangsd.git
cd pcangsd

conda activate 2bRAD

pip install --user -r requirements.txt
python setup.py build_ext --inplace
pip3 install -e .

cd ~/bin

git clone https://bitbucket.org/simongravel/moments.git
cd moments

conda activate 2bRAD

conda install -c bioconda moments

pip install --user -r requirements.txt
python setup.py build_ext --inplace
pip3 install -e .

svn checkout https://github.com/xiaoming-liu/stairway-plot-v2.git

mv stairway-plot-v2.git/trunk/stairway_plot_v2.1.1.zip .
unzip stairway_plot_v2.1.1.zip
rm -r stairway_plot_v2.1.1.zip stairway-plot-v2.git

Make all scripts executable

chmod +x *.sh *.pl *.py

IF not already, it is useful to add ~/bin to your $PATH
This way you can easily access your executable scripts without specifying the absolute path to them.
Otherwise skip to “Build working directory”

PATH="$HOME/bin:$PATH";

To permanently add this to your $PATH add to .bashrc use nano text editor

nano ~/.bashrc

ADD the following text to file under PATHS section if in your .bashrc: export PATH="$HOME/bin:$PATH";
exit nano with ctrl+x

source ~/.bashrc
echo $PATH

Build working directory

cd
mkdir 2bRAD/sint/
mkdir 2bRAD/sint/fknms/
mkdir 2bRAD/sint/fknms/rawReads/
cd 2bRAD/sint/fknms/rawReads/


D O W N L O A D   R E A D S


Download and concatenate raw reads from BaseSpace

If you have not previously, download BaseSpaceCLI

wget "https://launch.basespace.illumina.com/CLI/latest/amd64-linux/bs" -O $HOME/bin/bs

chmod +x ~/bin/bs

Go to the website and confirm authorization by logging in to your basespace acct.

bs auth

Making a script to download the reads and merge samples across 2 NovaSeq lanes


echo '#!/bin/bash' > downloadReads.sh
echo 'bs download project --concurrency=high -q -n JA21001 -o .' >> downloadReads.sh
# -n is the project name and -o is the output directory

echo "find . -name '*.gz' -exec mv {} . \;" >> downloadReads.sh
echo 'rmdir SA*' >>downloadReads.sh
echo 'mkdir ../concatReads' >> downloadReads.sh
echo 'cp *.gz ../concatReads' >> downloadReads.sh
echo 'cd ../concatReads' >> downloadReads.sh
echo 'mergeReads.sh -o mergeTemp' >> downloadReads.sh
# -o is the directory to put output files in

echo 'rm *L00*' >> downloadReads.sh
echo "find . -name '*.gz' -exec mv {} . \;" >> downloadReads.sh
echo 'gunzip *.gz' >> downloadReads.sh
echo 'rmdir mergeTemp' >> downloadReads.sh

chmod +x downloadReads.sh

launcher_creator.py -b 'srun downloadReads.sh' -n downloadReads -q shortq7 -t 06:00:00 -e reckert2017@fau.edu
sbatch downloadReads.slurm


How many reads before filtering?

echo '#!/bin/bash' >rawReads
echo readCounts.sh -e .fastq -o sintRaw >>rawReads

sbatch -o rawReads.o%j -e rawReads.e%j rawReads --mail-type=ALL --mail-user=reckert2017@fau.edu

T R I M M I N G   &   F I L T E R I N G


Trim and demultiplex reads

cd ../concatReads

2bRAD_trim_launch_dedup.pl fastq > trims.sh
launcher_creator.py -j trims.sh -n trims -q shortq7 -t 06:00:00 -e reckert2017@fau.edu
sbatch --mem=200GB trims.slurm

Check that we have the correct number of trim files (226 in this case)

ls -l *.tr0 | wc -l

mkdir ../trimmedReads
srun mv *.tr0 ../trimmedReads &

zipper.py -f fastq -a -9 --launcher -e reckert2017@fau.edu
sbatch --mem=200GB zip.slurm

cd ../trimmedReads

Rename sequence files using sampleRename.py This script needs a .csv with XXX Make sure you use the reverse complement of your inline BCs!

srun sampleRename.py -i sampleList -n fk_ -f tr0

Quality filtering using cutadapt

I can’t get KoKo’s module for cutadapt to work, so we’ll do it in miniconda. Run below if you don’t have a conda env. set up, otherwise you can skip to the next chunk

module load miniconda3-4.6.14-gcc-8.3.0-eenl5dj
conda config --add channels defaults
conda config --add channels bioconda
conda create -n 2bRAD cutadapt

Removing reads with qualities at ends less than Q15 for de novo analysis

source activate 2bRAD

echo '#!/bin/bash' > trimse.sh
echo 'module load miniconda3-4.6.14-gcc-8.3.0-eenl5dj' >> trimse.sh
echo 'source activate 2bRAD' >> trimse.sh
for file in *.tr0; do
echo "cutadapt -q 15,15 -m 36 -o ${file/.tr0/}.trim $file > ${file/.tr0/}.trimlog.txt" >> trimse.sh;
done

I can’t get it to run through launcher so just run it serially, it takes a while to run, so consider breaking up in several jobs.

sbatch -o trimse.o%j -e trimse.e%j --mem=200GB trimse.sh

Do we have expected number of *.trim files created?

conda deactivate
ls -l *.trim | wc -l

How many reads in each sample?

echo '#!/bin/bash' >sintReads
echo readCounts.sh -e trim -o sintFilt >>sintReads
sbatch --mem=200GB sintReads

mkdir ../filteredReads
mv *.trim ../filteredReads

zipper.py -f tr0 -a -9 --launcher -e reckert2017@fau.edu
sbatch zip.slurm

cat sintFiltReadCounts


D E N O V O   R E F E R E N C E


Construct denovo reference for aligning reads

Remove symbiodiniaceae reads

If you’ve not already, build a bt reference for the concatenated zoox genomes, otherwise skip ahead to the next chunk

mkdir ~/bin/symGenomes
cd ~/bin/symGenomes
echo "bowtie2-build symbConcatGenome.fasta symbConcatGenome" > bowtie2-build
launcher_creator.py -j bowtie2-build -n bowtie2-build -q shortq7 -t 06:00:00 -e reckert2017@fau.edu
module load bowtie2-2.3.5.1-gcc-8.3.0-63cvhw5
sbatch --mem=200GB bowtie2-build.slurm
module load samtools-1.10-gcc-8.3.0-khgksad
srun samtools faidx symbConcatGenome.fasta

Mapping reads to concatenated Symbiodinaceae genome

cd ~/2bRAD/past/sefl/filteredReads

mkdir symbionts
SYMGENOME=~/bin/symGenomes/symbConcatGenome

2bRAD_bowtie2_launcher.py -g $SYMGENOME -f .trim -n zooxMaps --split -u un -a zoox --aldir symbionts --launcher -e reckert2017@fau.edu

sbatch zooxMaps.slurm

Checking Symbiodiniaceae read mapping rates

>zooxAlignmentRates
for F in `ls *trim`; do
M=`grep -E '^[ATGCN]+$' $F | wc -l | grep -f - zooxMaps.e* -A 4 | tail -1 | perl -pe 's/zooxMaps\.e\d+-|% overall alignment rate//g'` ;
echo "$F.sam $M">>zooxAlignmentRates;
done

# ls *trim | cut -d '.' -f 1 >align1
# grep "% overall" zooxMaps.e* | cut -d ' ' -f 1 >align2
# paste <(awk -F' ' '{print $1}' align1) <(awk -F' ' '{print $1}' align2) >zooxAlignmentRates
# rm align1 align2

less zooxAlignmentRates

Clean up directory

zipper.py -f .trim -a -9 --launcher -e reckert2017@fau.edu
sbatch --mem=200GB zip.slurm

mv *.sam symbionts/

cd ~/2bRAD/sint/fknms/
mv filteredReads/symbionts .

Uniquing reads

‘stacking’ individual trimmed fastq reads:

cd filteredReads
ls *.trim.un | perl -pe 's/^(.+)$/uniquerOne.pl $1 >$1\.uni/' > unique

launcher_creator.py -j unique -n unique -q shortq7 -t 06:00:00 -e reckert2017@fau.edu
sbatch --mem=200GB unique.slurm

Checking there is a .uni for all samples

ls -l *.uni | wc -l

Collecting common tags (major alleles).

Merging uniqued files (set minInd to >10, or >10% of total number of samples, whichever is greater).

echo 'mergeUniq.pl uni minInd=30 > all.uniq' > allunique

launcher_creator.py -j allunique -n allunique -q shortq7 -t 06:00:00 -e reckert2017@fau.edu
sbatch --mem=200GB allunique.slurm

Discarding tags that have more than 7 observations without reverse-complement

srun awk '!($3>7 && $4==0) && $2!="seq"' all.uniq >all.tab

Creating fasta file out of merged and filtered tags:

srun awk '{print ">"$1"\n"$2}' all.tab >all.fasta

Clustering allowing for up to 3 mismatches (-c 0.91); the most abundant sequence becomes reference

echo '#!/bin/bash' >cdhit
echo cd-hit-est -i all.fasta -o cdh_alltags.fas -aL 1 -aS 1 -g 1 -c 0.91 -M 0 -T 0 >>cdhit
sbatch --mem=200GB -e cdhit.e%j -o cdhit.o%j cdhit

rm *.uni

Remove contamination

We can remove contamination sequences from our denovo reference with kraken

We need a Kraken database to search against. If you don’t already have one, you need to build one now. otherwise you can skip to running Kraken

We can download a compiled standard database:

mkdir ~/bin/krakenDB
cd ~/bin/krakenDB

srun wget https://genome-idx.s3.amazonaws.com/kraken/k2_pluspf_20210127.tar.gz
echo tar -xvf k2_pluspf_20210127.tar.gz >tar

launcher_creator.py -j tar -n tar -q mediumq7 -t 24:00:00 -e reckert2017@fau.edu

sbatch tar.slurm

Alternatively, we can build a custom database, which can include the Symbiodiniaceae genomes

echo '#!/bin/bash' >krakendb.sh
echo kraken2-build --download-taxonomy --db ~/bin/krakenDB >>krakendb.sh
echo kraken2-build --download-library archaea --threads 16 --db ~/bin/krakenDB >>krakendb.sh
echo kraken2-build --download-library bacteria --threads 16 --db ~/bin/krakenDB >>krakendb.sh
echo kraken2-build --download-library viral --threads 16 --db ~/bin/krakenDB >>krakendb.sh
echo kraken2-build --download-library human --threads 16 --db ~/bin/krakenDB >>krakendb.sh
echo kraken2-build --download-library fungi --threads 16 --db ~/bin/krakenDB >>krakendb.sh
echo kraken2-build --download-library protozoa --threads 16 --db ~/bin/krakenDB >>krakendb.sh
echo kraken2-build --download-library UniVec_Core --threads 16 --db ~/bin/krakenDB >>krakendb.sh

sbatch --mem=200GB -p longq7 -e krakenDB.e%j -o krakenDB.o%j krakendb.sh

Format and add Symbiodiniaceae genomes to the database

cd ~/bin/symGenomes

# Symbiodinium microadriaticum
sed '/>/ s/$/|kraken:taxid|2951/' Symbiodinium_microadriacticum_genome.scaffold.fasta >S_microadriacticum.fa

# Breviolum minutum
sed '/>/ s/$/|kraken:taxid|2499525/' Breviolum_minutum.v1.0.genome.fa >B_minutum.fa

# Cladocopium goreaui
sed '/>/ s/$/|kraken:taxid|2562237/' Cladocopium_goreaui_Genome.Scaffolds.fasta >C_goreaui.fa

# Durusdinium trenchii
sed '/>/ s/$/|kraken:taxid|1381693/' 102_symbd_genome_scaffold.fa >D_trenchii.fa

echo '#!/bin/bash' >kdbAdd
echo kraken2-build --add-to-library ~/bin/symGenomes/S_microadriacticum.fa --db ~/bin/krakenDB >>kdbAdd
echo kraken2-build --add-to-library ~/bin/symGenomes/B_minutum.fa --db ~/bin/krakenDB >>kdbAdd
echo kraken2-build --add-to-library ~/bin/symGenomes/C_goreaui.fa --db ~/bin/krakenDB >>kdbAdd
echo kraken2-build --add-to-library ~/bin/symGenomes/D_trenchii.fa --db ~/bin/krakenDB >>kdbAdd

sbatch --mem=200GB -o kdbAdd.o%j -e kdbAdd.e%j kdbAdd

Finally, build the database

echo '#!/bin/bash' >kdbBuild
echo kraken2-build --download-taxonomy --threads 16 --db /mnt/beegfs/home/reckert2017/bin/krakenDB >>kdbBuild
echo kraken2-build --build --db ~/bin/krakenDB >>kdbBuild
sbatch --mem=200GB -o kdbBuild.o%j -e kdbBuild.e%j kdbBuild

Remove potential contamination from reference

cd ~/2bRAD/sint/fknms/filteredReads

echo '#!/bin/bash' >krakenDB
echo kraken2 --db ~/bin/krakenDB cdh_alltags.fas --threads 16 --classified-out cdh_alltags.contam.fa --unclassified-out cdh_alltags.unclass.fa --report krakenDB.report --output krakenDB.out >>krakenDB

sbatch --mem=200GB -o krakenDB.o%j -e krakenDB.e%j krakenDB

Additionally, I use a kraken blastNT database to remove additional sequences not mapping to Cnidarian taxa.

echo '#!/bin/bash' >krakenNT
echo kraken2 --db ~/bin/krakenNT/nt_db cdh_alltags.unclass.fa --threads 16 --classified-out cdh_alltags.class.fa --unclassified-out sint_denovo.fa --report krakenNT.report --output krakenNT.out >>krakenNT

sbatch --mem=200GB -o krakenNT.o%j -e krakenNT.e%j krakenNT

extract_kraken_reads.py -s cdh_alltags.class.fa -k krakenNT.out -r krakenNT.report -t 6073 --include-children -o sint_denovo.fa --append

Construct denovo genome

With 30 pseudo chromosomes from clean major allele tags

mkdir ../mappedReads
mv sint_denovo.fa ../mappedReads
cd ../mappedReads

concatFasta.pl fasta=sint_denovo.fa num=30

Format pseudo genome

GENOME_FASTA=sint_denovo_cc.fasta

echo '#!/bin/bash' >genomeBuild.sh
echo bowtie2-build $GENOME_FASTA $GENOME_FASTA >>genomeBuild.sh
echo samtools faidx $GENOME_FASTA >>genomeBuild.sh

sbatch -o genomeBuild.o%j -e genomeBuild.e%j --mem=200GB genomeBuild.sh

M A P P I N G   R E A D S   T O   R E F E R E N C E


Mapping reads to reference and formatting bam files

Map reads to fake genome:

mv ../filteredReads/*.un .
mv ../filteredReads/symbionts .

GENOME_FASTA=sint_denovo_cc.fasta

# mapping with --local option, enables clipping of mismatching ends (guards against deletions near ends of RAD tags)
2bRAD_bowtie2_launcher.py -f un -g $GENOME_FASTA --launcher -e reckert2017@fau.edu
sbatch --mem=200GB maps.slurm

Do we have the right number of SAM files?

ls *.sam | wc -l 

Checking alignment rates

ls *un | cut -d '.' -f 1 >align1
grep "% overall" maps.e* | cut -d ' ' -f 1 >align2
>alignmentRates
paste <(awk -F' ' '{print $1}' align1) <(awk -F' ' '{print $1}' align2) >alignmentRates
rm align1 align2

less alignmentRates

Convert SAM files to BAM files

BAM files will be used for genotyping, population structure, etc.

>s2b
for file in *.sam; do
echo "samtools sort -O bam -o ${file/.sam/}.bam $file && samtools index ${file/.sam/}.bam">>s2b;
done

launcher_creator.py -j s2b -n s2b -q shortq7 -t 06:00:00 -e reckert2017@fau.edu
sbatch --mem=200GB s2b.slurm

Do we have enough BAM files?

ls *bam | wc -l  # should be the same number as number of trim files

Clean up directory

zipper.py -a -9 -f sam --launcher -e reckert2017@fau.edu
sbatch zip.slurm

rm *.un

G E N O T Y P I N G


“FUZZY genotyping” with ANGSD - without calling actual genotypes but working with genotype likelihoods at each SNP. Optimal for low-coverage data (<10x).

mkdir ../ANGSD
cd ../ANGSD
mv ../mappedReads/*.bam* .

ls *bam >bamsClones

Assessing base qualities and coverage depth

ANGSD settings: -minMapQ 20: only highly unique mappings (prob of erroneous mapping =< 1%) -baq 1: realign around indels (not terribly relevant for 2bRAD reads mapped with –local option) -maxDepth: highest total depth (sum over all samples) to assess; set to 10x number of samples -minInd: the minimal number of individuals the site must be genotyped in. Reset to 50% of total N at this stage.

export FILTERS="-uniqueOnly 1 -remove_bads 1 -minMapQ 20 -maxDepth 2260 -minInd 113"
export TODO="-doQsDist 1 -doDepth 1 -doCounts 1 -dumpCounts 2"

echo '#!/bin/bash' >sintDD.sh
echo angsd -b bamsClones -GL 1 $FILTERS $TODO -P 1 -out dd >>sintDD.sh

sbatch --mem=200GB -o sintDD.o%j -e sintDD.e%j --mail-user=reckert2017@fau.edu --mail-type=ALL sintDD.sh

Summarizing results (using Misha Matz modified script by Matteo Fumagalli)

echo '#!/bin/bash' >RQC.sh
echo Rscript ~/bin/plotQC.R prefix=dd >>RQC.sh
echo gzip -9 dd.counts >>RQC.sh
sbatch -e RQC.e%j -o RQC.o%j --dependency=afterok:460550 --mem=200GB RQC.sh

Proportion of sites covered at >5X:

cat quality.txt

scp dd.pdf to laptop to look at distribution of base quality scores, fraction of sites in each sample passing coverage thresholds and fraction of sites passing genotyping rates cutoffs. Use these to guide choices of -minQ, -minIndDepth and -minInd filters in subsequent ANGSD runs

Identifying clones and technical replicates

FILTERS="-uniqueOnly 1 -remove_bads 1 -minMapQ 20 -minQ 30 -dosnpstat 1 -doHWE 1 -hwe_pval 1e-5 -sb_pval 1e-5 -hetbias_pval 1e-5 -skipTriallelic 1 -minInd 192 -snp_pval 1e-6 -minMaf 0.05"
TODO="-doMajorMinor 1 -doMaf 1 -doCounts 1 -makeMatrix 1 -doIBS 1 -doCov 1 -doGeno 8 -doBcf 1 -doPost 1 -doGlf 2"

echo '#!/bin/bash' > sintClones.sh
echo angsd -b bamsClones -GL 1 $FILTERS $TODO -P 1 -out sintClones >>sintClones.sh

sbatch --mem=200GB -o sintClones.o%j -e sintClones.e%j -p shortq7 --mail-type=ALL --mail-user=reckert2017@fau.edu sintClones.sh

Use ibs matrix to identify clones with hierachial clustering in R. scp to local machine and run chunk below in R

if (!require("pacman")) install.packages("pacman")

pacman::p_load("dendextend", "ggdendro", "tidyverse")

cloneBams = read.csv("../data/stephanocoeniaMetaData.csv") # list of bam files

cloneMa = as.matrix(read.table("../data/snps/clones/sintClones.ibsMat")) # reads in IBS matrix produced by ANGSD 

dimnames(cloneMa) = list(cloneBams[,1],cloneBams[,1])
clonesHc = hclust(as.dist(cloneMa),"ave")

clonePops = cloneBams$region
cloneDepth = cloneBams$depthZone

cloneDend = cloneMa %>% as.dist() %>% hclust(.,"ave") %>% as.dendrogram()
cloneDData = cloneDend %>% dendro_data()

# Making the branches hang shorter so we can easily see clonal groups
cloneDData$segments$yend2 = cloneDData$segments$yend
for(i in 1:nrow(cloneDData$segments)) {
  if (cloneDData$segments$yend2[i] == 0) {
    cloneDData$segments$yend2[i] = (cloneDData$segments$y[i] - 0.01)}}

cloneDendPoints = cloneDData$labels
cloneDendPoints$pop = clonePops[order.dendrogram(cloneDend)]
cloneDendPoints$depth=cloneDepth[order.dendrogram(cloneDend)]
rownames(cloneDendPoints) = cloneDendPoints$label

# Making points at the leaves to place symbols for populations
point = as.vector(NA)
for(i in 1:nrow(cloneDData$segments)) {
  if (cloneDData$segments$yend[i] == 0) {
    point[i] = cloneDData$segments$y[i] - 0.01
  } else {
    point[i] = NA}}

cloneDendPoints$y = point[!is.na(point)]

techReps = c("S066.1", "S066.2", "S066.3", "S162.1", "S162.2", "S162.3", "S205.1", "S205.2", "S205.3")
cloneDendPoints$depth = factor(cloneDendPoints$depth,levels(cloneDendPoints$depth)[c(2,1)])

cloneDendPoints$pop = factor(cloneDendPoints$pop,levels(cloneDendPoints$pop)[c(4, 1, 3, 2)])

flPal = paletteer_d("vapoRwave::jazzCup")[c(2:5)]

cloneDendA = ggplot() +
  geom_segment(data = segment(cloneDData), aes(x = x, y = y, xend = xend, yend = yend2), size = 0.5) +
  geom_point(data = cloneDendPoints, aes(x = x, y = y, fill = pop, shape = depth), size = 4, stroke = 0.25) +
  #scale_fill_brewer(palette = "Dark2", name = "Population") +
  scale_fill_manual(values = flPal, name= "Population")+
  scale_shape_manual(values = c(24, 25), name = "Depth Zone")+
  geom_hline(yintercept = 0.12, color = "red", lty = 5, size = 0.75) + # creating a dashed line to indicate a clonal distance threshold
  geom_text(data = subset(cloneDendPoints, subset = label %in% techReps), aes(x = x, y = (y - .015), label = label), angle = 90) + # spacing technical replicates further from leaf
  geom_text(data = subset(cloneDendPoints, subset = !label %in% techReps), aes(x = x, y = (y - .010), label = label), angle = 90) +
  labs(y = "Genetic distance (1 - IBS)") +
  guides(fill = guide_legend(override.aes = list(shape = 22)))+
  theme_classic()

cloneDend = cloneDendA + theme(
  axis.title.x = element_blank(),
  axis.text.x = element_blank(),
  axis.line.x = element_blank(),
  axis.ticks.x = element_blank(),
  axis.title.y = element_text(size = 12, color = "black", angle = 90),
  axis.text.y = element_text(size = 10, color = "black"),
  axis.line.y = element_line(),
  axis.ticks.y = element_line(),
  panel.grid = element_blank(),
  panel.border = element_blank(),
  panel.background = element_blank(),
  plot.background = element_blank(),
  legend.key = element_blank(),
  legend.title = element_text(size = 12),
  legend.text = element_text(size = 10),
  legend.position = "bottom")

cloneDend

ggsave("../figures/cloneDend.png", plot = cloneDend, height = 8, width = 35, units = "in", dpi = 300)
ggsave("../figures/cloneDend.eps", plot = cloneDend, height = 8, width = 35, units = "in", dpi = 300)

Removing clones and re-running ANGSD

mkdir clones
mv sintClones* clones

ls *.bam > bamsNoClones

cat bamsClones | grep -v 'fk_S066.1.trim.un.bt2.bam\|fk_S066.3.trim.un.bt2.bam\|fk_S162.1.trim.un.bt2.bam\|fk_S162.3.trim.un.bt2.bam\|fk_S205.1.trim.un.bt2.bam\|fk_S205.3.trim.un.bt2.bam' >bamsNoClones

FILTERS="-uniqueOnly 1 -remove_bads 1 -minMapQ 20 -minQ 30 -dosnpstat 1 -doHWE 1 -hwe_pval 1e-5 -sb_pval 1e-5 -hetbias_pval 1e-5 -skipTriallelic 1 -minInd 187 -snp_pval 1e-6 -minMaf 0.05"
TODO="-doMajorMinor 1 -doMaf 1 -doCounts 1 -makeMatrix 1 -doIBS 1 -doCov 1 -doGeno 8 -doBcf 1 -doPost 1 -doGlf 2"

echo '#!/bin/bash' > sintNoClones.sh
echo srun angsd -b bamsNoClones -GL 1 $FILTERS $TODO -P 1 -out sintNoClones >> sintNoClones.sh

sbatch --mem=200GB -o sintNoClones.o%j -e sintNoClones.e%j -p shortq7 --mail-type=ALL --mail-user=reckert2017@fau.edu sintNoClones.sh

How many SNPs?

grep "filtering:" sintNoClones.e*

24,670 SNPs

S Y M B I O N T S


cd ~/2bRAD/sint/fknms/
mv mappedReads/symbionts .
cd symbionts

SYMGENOME=~/bin/symGenomes/symbConcatGenome
2bRAD_bowtie2_launcher.py -f zoox -g $SYMGENOME --launcher -e reckert2017@fau.edu

sbatch --mem=100GB maps.slurm

>s2b
for file in *.sam; do
echo "samtools sort -O bam -o ${file/.sam/}.bam $file && samtools index ${file/.sam/}.bam">>s2b;
done

launcher_creator.py -j s2b -n s2b -t 6:00:00 -N 5 -e reckert2017@fau.edu -q shortq7
sbatch s2b.slurm

Count reads mapping to each chromosome for concatenated Symbiodiniaceae genome

>zooxReads

for i in *.bam; do
echo $i >>zooxReads;
samtools idxstats $i | cut -f 1,3 >>zooxReads;
done

O U T L I E R   D E T E C T I O N


cd ~2bRAD/sint/fknms/

mkdir bayescan
cd bayescan

srun cp ../ANGSD/sintNoClones.vcf.gz . #Note do not use the renamed version, pgdspider has a tough time parsing the samples into pops when you do
srun gunzip sintNoClones.vcf.gz

You will have to create or scp a textfile with your .bam file names and populations

Convert vcf with PGDSpider

echo "############
# VCF Parser questions
PARSER_FORMAT=VCF
# Do you want to include a file with population definitions?
VCF_PARSER_POP_QUESTION=true
# Only input following regions (refSeqName:start:end, multiple regions: whitespace separated):
VCF_PARSER_REGION_QUESTION=
# What is the ploidy of the data?
VCF_PARSER_PLOIDY_QUESTION=DIPLOID
# Only output following individuals (ind1, ind2, ind4, ...):
VCF_PARSER_IND_QUESTION=
# Output genotypes as missing if the read depth of a position for the sample is below:
VCF_PARSER_READ_QUESTION=
# Take most likely genotype if "PL" or "GL" is given in the genotype field?
VCF_PARSER_PL_QUESTION=true
# Do you want to exclude loci with only missing data?
VCF_PARSER_EXC_MISSING_LOCI_QUESTION=false
# Select population definition file:
VCF_PARSER_POP_FILE_QUESTION=./fkSintBsPops.txt
# Only output SNPs with a phred-scaled quality of at least:
VCF_PARSER_QUAL_QUESTION=
# Do you want to include non-polymorphic SNPs?
VCF_PARSER_MONOMORPHIC_QUESTION=false
# Output genotypes as missing if the phred-scale genotype quality is below:
VCF_PARSER_GTQUAL_QUESTION=
# GESTE / BayeScan Writer questions
WRITER_FORMAT=GESTE_BAYE_SCAN
# Specify which data type should be included in the GESTE / BayeScan file  (GESTE / BayeScan can only analyze one data type per file):
GESTE_BAYE_SCAN_WRITER_DATA_TYPE_QUESTION=SNP
############" >vcf2bayescan.spid

java -Xmx1024m -Xms512m -jar ~/bin/PGDSpider_2.0.7.1/PGDSpider2-cli.jar -inputfile sintNoClones.bcf -outputfile fkSint.bayescan -spid vcf2bayescan.spid

Launch BayeScan (Takes 12+ hr depending on read count/SNPs)

echo '#!/bin/bash' > sintBayescan.sh
echo bayescan fkSint.bayescan -threads=100 >> sintBayescan.sh

sbatch -e sintBayescan.e%j -o sintBayescan.o%j -p mediumq7 --mail-user reckert2017@fau.edu --mail-type=ALL sintBayescan.sh

Extract outlier SNPs from BayeScan output

srun removeBayescanOutliers.pl bayescan=fkSint.baye_fst.txt vcf=sintNoClones.bcf FDR=0.05 mode=extract > fkSintVcfOutliers.vcf

Using BayeScEnv we can look for outliers realted to depth

echo "32.0 21.1 33.2 26.5 32.8 18.0 23.6 43.9" > depth.txt

cp fkSint.bayescan fkSint.bayeScEnv

echo '#!/bin/bash' > sintBayeScEnv.sh
echo bayescenv fkSint.bayeScEnv -env depth.txt >> sintBayeScEnv.sh

sbatch -e sintBayeScEnv.e%j -o sintBayeScEnv.o%j -p longq7 --mail-user reckert2017@fau.edu --mail-type=ALL sintBayeScEnv.sh

Extract outlier SNPs from BayeScEnv output

srun removeBayescanOutliers.pl bayescan=fkSint.bayeS_fst.txt vcf=sintNoClones.bcf FDR=0.05 mode=extract > fkSintBScEnvVcfOutliers.vcf


P O P U L A T I O N   S T R U C T U R E


pcangsd

Using pcangsd to discern population structure

cd ~/2bRAD/sint/fknms
mkdir pcangsd
cd pcangsd

cp ../ANGSD/sintNoClones.beagle .
gzip sintNoClones.beagle

source activate 2bRAD

echo '#!/bin/bash' > pcangsd.sh
echo 'pcangsd -b sintNoClones.beagle.gz -o fkSintPcangsd --admix --inbreedSamples --pcadapt --selection' >> pcangsd

launcher_creator.py -j pcangsd -n pcangsd -q shortq7 -t 06:00:00 -e $EMAIL -w 1 -N 1
sbatch pcangsd.slurm

Calculate population structure from genotype likelihoods using NGSadmix for K from 2 to 11 : FIRST remove all clones/genotyping replicates! (we did this).

mkdir ../ngsAdmix
cp *beagle* ../ngsAdmix

zipper.py -f bam -a -9 --launcher -e reckert2017@fau.edu
sbatch zip.slurm

cd ../ngsAdmix

Create a file with 50 replicate simulations for each value of K 1-11 (num pops + 3)

ngsAdmixLauncher.py -f sintNoClones.beagle.gz --maxK 11 -r 50 -n fkSint --launcher -e reckert2017@fau.edu

sbatch --mem=200GB fkSintNgsAdmix.slurm

Calculating most likely value of K

Next, take the likelihood value from each run of NGSadmix and put them into a file that can be used with Clumpak to calculate the most likely K using the methods of Evanno et al. (2005).

>fkSintNgsAdmixLogfile
for log in fkSint*.log; do
grep -Po 'like=\K[^ ]+' $log >> fkSintNgsAdmixLogfile;
done

Format for CLUMPAK in R

R

You are now using R in the terminal

logs <- as.data.frame(read.table("fkSintNgsAdmixLogfile"))

#output is organized with 10, 11 preceding 1, 2, 3 etc.
logs$K <- c(rep("10", 50), rep("11", 50), rep("1", 50), rep("2", 50), rep("3", 50),
rep("4", 50), rep("5", 50), rep("6", 50),
    rep("7", 50), rep("8", 50), rep("9", 50))
write.table(logs[, c(2, 1)], "fkSintNgsAdmixLogfile_formatted", row.names = F,
        col.names = F, quote = F)
quit()
# No need to save workspace image [press 'n']
n

Check that your formatted logfile has the appropriate number of entries

cat fkSintNgsAdmixLogfile_formatted | wc -l

make copies of .qopt files to run structure selector on (.Q files)

for file in fkSint*.qopt; do
filename=$(basename -- "$file" .qopt);
cp "$file" "$filename".Q;
done

mkdir fkSintQ
mv fkSint*Q fkSintQ

zip -r fkSintQ.zip fkSintQ

scp .zip and formatted logfile to local machine and upload to CLUMPAK (http://clumpak.tau.ac.il/bestK.html) and structure selector (https://lmme.ac.cn/StructureSelector/index.html)

scp sintNoClones* to local machine for further analyses with R

Running ANGSD within lineages

Since we have 4 distinct lineages, we will pare down SNPs to only those found within all lineages. This helps avoid issues of ascertainment bias.

cd ~/2bRAD/sint/fknms/ANGSD

FILTERS="-uniqueOnly 1 -remove_bads 1 -minMapQ 20 -minQ 25 -dosnpstat 1 -doHWE 1 -hwe_pval 1e-5 -sb_pval 1e-5 -hetbias_pval 1e-5 -skipTriallelic 1 -snp_pval 1e-5 -minMaf 0.05"

TODO="-doMajorMinor 1 -doMaf 1 -doGeno 8 -doPost 1 -doGlf 2"

echo "angsd -b blueBams -GL 1 -P 1 $FILTERS $TODO -minInd 104 -out sintBlueSnps
angsd -b tealBams -GL 1 -P 1 $FILTERS $TODO -minInd 32 -out sintTealSnps
angsd -b greenBams -GL 1 -P 1 $FILTERS $TODO -minInd 24 -out sintGreenSnps
angsd -b yellowBams -GL 1 -P 1 $FILTERS $TODO -minInd 12 -out sintYellowSnps" > kSnps

launcher_creator.py -j kSnps -n kSnps -q shortq7 -t 06:00:00 -e $EMAIL -w 2 -N 1

sbatch kSnps.slurm


Filtering down the sites to those found within each lineage with the given ANGSD filtering criteria

for file in sint*Snps.geno.gz; do
echo '#!/bin/bash' > ${file%%.*}.sh;
echo "zcat $file | awk '{print \$1\"\t\"\$2}' > ${file%%.*}sites" >> ${file%%.*}.sh;
sbatch -e ${file%%.*}.e%j -o ${file%%.*}.o%j -p shortq7 --mem=100GB --mail-user reckert2017@fau.edu --mail-type=ALL ${file%%.*}.sh;
done

srun awk '(++c[$0])==(ARGC-1)' *Snpssites > sitesToDo

mkdir filteredSNPS

mv sint*Snps* filteredSNPS/
mv kSnps* filteredSNPS/


Now index the sites file and run angsd on all samples with only the reduced number of sites using -sites flag

angsd sites index sitesToDo

FILTERS="-uniqueOnly 1 -remove_bads 1 -minMapQ 20 -minQ 25 -dosnpstat 1 -doHWE 1 -hwe_pval 1e-5 -sb_pval 1e-5 -hetbias_pval 1e-5 -skipTriallelic 1 -snp_pval 1e-5 -minMaf 0.05"

TODO="-doMajorMinor 1 -doMaf 1 -doCounts 1 -makeMatrix 1 -doIBS 1 -doCov 1 -doGeno 8 -doBcf 1 -doPost 1 -doGlf 2"

echo "angsd -b bamsNoClones -sites sitesToDo -GL 1 -P 1 $FILTERS $TODO -minInd 176 -out sintFiltSnps" > sintFiltSnps

launcher_creator.py -j sintFiltSnps -n sintFiltSnps -q shortq7 -t 06:00:00 -e $EMAIL -w 2 -N 1
sbatch sintFiltSnps.slurm


mv sintFilt* filteredSNPS/


I N B R E E D I N G   A N D   R E L A T E D N E S S

Running ngsRelate on filtered SNPs

cd ~/2bRAD/sint/fknms/ANGSD

FILTERS="-uniqueOnly 1 -remove_bads 1 -minMapQ 20 -minQ 25 -dosnpstat 1 -doHWE 1 -hwe_pval 1e-5 -sb_pval 1e-5 -hetbias_pval 1e-5 -skipTriallelic 1 -snp_pval 1e-5 -minMaf 0.05"

TODO="-doMajorMinor 1 -doMaf 1 -doCounts 1 -makeMatrix 1 -doIBS 1 -doCov 1 -doGeno 8 -doBcf 1 -doPost 1 -doGlf 3"

echo '#!/bin/bash' > sintFiltRelate.sh
echo srun angsd -b bamsNoClones -GL 1 -sites sitesToDo $FILTERS $TODO -P 1 -minInd 176 -out sintFiltRelate >> sintFiltRelate.sh

sbatch --mem=200GB -o sintFiltRelate.o%j -e sintFiltRelate.e%j -p shortq7 --mail-type=ALL --mail-user=reckert2017@fau.edu sintFiltRelate.sh

cd ngsRelate
mv ~/2bRAD/sint/fknms/ANGSD/*Relate* .

zcat sintFiltRelate.mafs.gz | cut -f5 |sed 1d >freq

echo '#!/bin/bash' > ngsFiltRelate.sh
echo ngsRelate -g sintFiltRelate.glf.gz -n 220 -f freq  -O filtRelate >> ngsFiltRelate.sh

sbatch -e ngsFiltRelate.e%j -o ngsFiltRelate.o%j --mem=200GB --mail-user reckert2017@fau.edu --mail-type=ALL ngsFiltRelate.sh


S I T E   F R E Q U E N C Y   S P E C T R A

Running SFS calculations within lineage to use for Fst and to run StairwayPlot

cd ~/2bRAD/sint/fknms/ANGSD

## scp bamsClusters text file to directory

awk 'BEGIN { FS=" " } $2 == "Blue" { print $1 }' bamsClusters >> blueBams
awk 'BEGIN { FS=" " } $2 == "Teal" { print $1 }' bamsClusters >> tealBams
awk 'BEGIN { FS=" " } $2 == "Green" { print $1 }' bamsClusters >> greenBams
awk 'BEGIN { FS=" " } $2 == "Yellow" { print $1 }' bamsClusters >> yellowBams


## No filters to distort allelic frequencies

FILTERS="-uniqueOnly 1 -remove_bads 1 -skipTriallelic 1 -minMapQ 30 -minQ 35 -doHWE 1 -sb_pval 1e-5 -hetbias_pval 1e-5 -maxHetFreq 0.5"
TODO="-doMajorMinor 1 -doMaf 1 -dosnpstat 1 -doPost 2 -doGeno 11 -doGlf 2"

echo "angsd -b blueBams -GL 1 $FILTERS $TODO -P 1 -minInd 104 -out sintBlueSFS
angsd -b tealBams -GL 1 $FILTERS $TODO -P 1 -minInd 32 -out sintTealSFS
angsd -b greenBams -GL 1 $FILTERS $TODO -P 1 -minInd 24 -out sintGreenSFS
angsd -b yellowBams -GL 1 $FILTERS $TODO -P 1 -minInd 12 -out sintYellowSFS" > sfsClusters

launcher_creator.py -j sfsClusters -n sfsClusters -q shortq7 -t 06:00:00 -e $EMAIL -N 1
sbatch sfsClusters.slurm


for file in sint*SFS.geno.gz; do
echo '#!/bin/bash' > ${file%%.*}.sh;
echo "zcat $file | awk '{print \$1\"\t\"\$2}' > ${file%%.*}sites" >> ${file%%.*}.sh;
sbatch -e ${file%%.*}.e%j -o ${file%%.*}.o%j -p longq7 --mem=100GB --mail-user reckert2017@fau.edu --mail-type=ALL ${file%%.*}.sh;
done


Now, compile common sites from all lineages using awk

srun awk '(++c[$0])==(ARGC-1)' *SFSsites > sfsSitesToDo

cat sfsSitesToDo | wc -l

#index sites for ANGSD and re-run ANGSD using ```-sites```
angsd sites index sfsSitesToDo

export GENOME_FASTA=~/2bRAD/sint/fknms/mappedReads/sint_denovo_cc.fasta

TODO="-doSaf 1 -ref $GENOME_FASTA -anc $GENOME_FASTA -doMaf 1 -doMajorMinor 4"

echo "angsd -sites sfsSitesToDo -b blueBams -GL 1 -P 1 $TODO -out sintBlue
angsd -sites sfsSitesToDo -b tealBams -GL 1 -P 1 $TODO -out sintTeal
angsd -sites sfsSitesToDo -b greenBams -GL 1 -P 1 $TODO -out sintGreen
angsd -sites sfsSitesToDo -b yellowBams -GL 1 -P 1 $TODO -out sintYellow" >sfs

launcher_creator.py -j sfs -n sfs -q shortq7 -t 06:00:00 -e $EMAIL -N 1
sbatch sfs.slurm


Once all jobs run:

mkdir ../SFS
mv sfs* ../SFS
mv sintBlue* ../SFS
mv sintTeal* ../SFS
mv sintGreen* ../SFS
mv sintYellow* ../SFS

cd ../SFS

###-- 1d-SFS
echo "realSFS sintBlue.saf.idx >sintBlue.sfs
realSFS sintTeal.saf.idx >sintTeal.sfs
realSFS sintGreen.saf.idx >sintGreen.sfs
realSFS sintYellow.saf.idx >sintYellow.sfs" >realSFS

launcher_creator.py -j realSFS -n realSFS -q shortq7 -t 06:00:00 -e $EMAIL -w 4 -N 1
sbatch realSFS.slurm


###-- 2d-SFS
echo "realSFS sintBlue.saf.idx sintTeal.saf.idx -P 20 > pBlTl.sfs ; realSFS fst index sintBlue.saf.idx sintTeal.saf.idx -sfs pBlTl.sfs -fstout pBlTl
realSFS sintBlue.saf.idx sintGreen.saf.idx -P 20 > pBlGn.sfs ; realSFS fst index sintBlue.saf.idx sintGreen.saf.idx -sfs pBlGn.sfs -fstout pBlGn
realSFS sintBlue.saf.idx sintYellow.saf.idx -P 20 > pBlYl.sfs ; realSFS fst index sintBlue.saf.idx sintYellow.saf.idx -sfs pBlYl.sfs -fstout pBlYl
realSFS sintTeal.saf.idx sintGreen.saf.idx -P 20 > pTlGn.sfs ; realSFS fst index sintTeal.saf.idx sintGreen.saf.idx -sfs pTlGn.sfs -fstout pTlGn
realSFS sintTeal.saf.idx sintYellow.saf.idx -P 20 > pTlYl.sfs ; realSFS fst index sintTeal.saf.idx sintYellow.saf.idx -sfs pTlYl.sfs -fstout pTlYl
realSFS sintGreen.saf.idx sintYellow.saf.idx -P 20 > pGnYl.sfs ; realSFS fst index sintGreen.saf.idx sintYellow.saf.idx -sfs pGnYl.sfs -fstout pGnYl" >2dSFS

launcher_creator.py -j 2dSFS -n 2dSFS -q shortq7 -t 06:00:00 -e $EMAIL -w 20 -N 1
sbatch 2dSFS.slurm

Global Fst between lineages

realSFS fst stats pBlYl.fst.idx
realSFS fst stats pTlYl.fst.idx
realSFS fst stats pGnYl.fst.idx
realSFS fst stats pBlTl.fst.idx
realSFS fst stats pBlGn.fst.idx
realSFS fst stats pTlGn.fst.idx

launcher_creator.py -j sintKFst -n sintKFst -q shortq7 -t 06:00:00 -e $EMAIL -N 1
sbatch sintKFst.slurm


StairwayPlot2

First, we filter Fst outliers from sites:

FILTERS="-minMapQ 20 -minQ 25 -minInd 88 -doHWE 1 -sb_pval 1e-5 -hetbias_pval 1e-5 -uniqueOnly 1 -remove_bads 1 -skipTriallelic 1 -snp_pval 1e-5 -minMaf 0.05"
-sites sitesToDo -GL 1 -P 1 -uniqueOnly 1 -remove_bads 1 -minMapQ 20 -minQ 25 -dosnpstat 1 -doHWE 1 -hwe_pval 1e-5 -sb_pval 1e-5 -hetbias_pval 1e-5 -skipTriallelic 1 -snp_pval 1e-5 -minMaf 0.05 -doMajorMinor 1 -doMaf 1 -doCounts 1 -makeMatrix 1 -doIBS 1 -doCov 1 -doGeno 8 -doBcf 1 -doPost 1 -doGlf 2 -minInd 187 -out sintFiltSnps

TODO="-doMajorMinor 4 -ref $GENOME_FASTA -doMaf 1 -dosnpstat 1 -doPost 2 -doBcf 1 --ignore-RG 0 -doGeno 11 -doCounts 1"


mkdir ~/2bRAD/sint/fknms/filteredSNPS/bayescan
cd ~/2bRAD/sint/fknms/filteredSNPS/bayescan


Create a file called vcf2bayescan.spid containing this text:

echo "############
# VCF Parser questions
PARSER_FORMAT=VCF
# Do you want to include a file with population definitions?
VCF_PARSER_POP_QUESTION=true
# Only input following regions (refSeqName:start:end, multiple regions: whitespace separated):
VCF_PARSER_REGION_QUESTION=
# What is the ploidy of the data?
VCF_PARSER_PLOIDY_QUESTION=DIPLOID
# Only output following individuals (ind1, ind2, ind4, ...):
VCF_PARSER_IND_QUESTION=
# Output genotypes as missing if the read depth of a position for the sample is below:
VCF_PARSER_READ_QUESTION=
# Take most likely genotype if "PL" or "GL" is given in the genotype field?
VCF_PARSER_PL_QUESTION=true
# Do you want to exclude loci with only missing data?
VCF_PARSER_EXC_MISSING_LOCI_QUESTION=false
# Select population definition file:
VCF_PARSER_POP_FILE_QUESTION=./bspops.txt
# Only output SNPs with a phred-scaled quality of at least:
VCF_PARSER_QUAL_QUESTION=
# Do you want to include non-polymorphic SNPs?
VCF_PARSER_MONOMORPHIC_QUESTION=false
# Output genotypes as missing if the phred-scale genotype quality is below:
VCF_PARSER_GTQUAL_QUESTION=
# GESTE / BayeScan Writer questions
WRITER_FORMAT=GESTE_BAYE_SCAN
# Specify which data type should be included in the GESTE / BayeScan file  (GESTE / BayeScan can only analyze one data type per file):
GESTE_BAYE_SCAN_WRITER_DATA_TYPE_QUESTION=SNP
############" >vcf2bayescan.spid


Need a file “bspops” that has samples in .bams order with the assigned populations/lineages

cp ../../ANGSD/bamsClusters ./bspops.txt

cp ../ANGSD/sintFiltSnps.bcf .

srun java -Xmx1024m -Xms512m -jar ~/bin/PGDSpider_2.0.7.1/PGDSpider2-cli.jar -inputfile sintFiltSnps.bcf -outputfile sintFilt.bayescan -spid vcf2bayescan.spid


echo "bayescan sintFilt.bayescan -threads=100" >bayeScan

launcher_creator.py -j bayeScan -n bayeScan -q mediumq7 -t 06:00:00 -e $EMAIL -N 5
sbatch bayeScan.slurm

cut -d" " -f2- sintFilt.baye_fst.txt > sintFst

## removing all the .bcf data before snps
tail --lines=+70 sintFiltSnps.bcf | cut -f 1,2 | paste --delimiters "\t" - sintFst > sint.baye_fst_pos.txt


Identify sites with q-value < 0.5


cp ../../ANGSD/sfsSitesToDo .

awk '$5<0.5 {print $1"\t"$2}' sint.baye_fst_pos.txt > ss_bayeOuts

grep -Fvxf ss_bayeOuts sfsSitesToDo > sitesToDo_filtBaye

cp sitesToDo_filtBaye ../../ANGSD
cd  ../../ANGSD

angsd sites index sitesToDo_filtBaye

export GENOME_FASTA=~/2bRAD/sint/fknms/mappedReads/sint_denovo_cc.fasta

TODO="-doSaf 1 -anc $GENOME_FASTA -ref $GENOME_FASTA -doMaf 1 -doMajorMinor 4"

echo "angsd -sites sitesToDo_filtBaye -b blueBams -GL 1 -P 1 $TODO -out sintBlueSFSbaye
angsd -sites sitesToDo_filtBaye -b tealBams -GL 1 -P 1 $TODO -out sintTealSFSbaye
angsd -sites sitesToDo_filtBaye -b greenBams -GL 1 -P 1 $TODO -out sintGreenSFSbaye
angsd -sites sitesToDo_filtBaye -b yellowBams -GL 1 -P 1 $TODO -out sintYellowSFSbaye" >sfs_filtBaye

launcher_creator.py -j sfs_filtBaye -n sfs_filtBaye -q shortq7 -t 06:00:00 -e $EMAIL -N 1
sbatch sfs_filtBaye.slurm

mv *SFS* ../SFS
mv *sfs* ../SFS

Generating per-population SFS

cd ../SFS

echo "realSFS sintBlueSFSbaye.saf.idx -fold 1 >sintFiltBlue.sfs
realSFS sintTealSFSbaye.saf.idx -fold 1 >sintFiltTeal.sfs
realSFS sintGreenSFSbaye.saf.idx -fold 1 >sintFiltGreen.sfs
realSFS sintYellowSFSbaye.saf.idx -fold 1 >sintFiltYellow.sfs" >sintFiltSFS

launcher_creator.py -j sintFiltSFS -n sintFiltSFS -q shortq7 -t 06:00:00 -e $EMAIL -N 1
sbatch sintFiltSFS.slurm


Making blueprint files for StairwayPlot

mkdir swPlot
cd swPlot

# SWplot only wants the mutations, so have to omit first number in SFS file (0 mutations).

# L = cat sitesToDo_filtBaye | wc -l 
# which in our case =1830041

echo "#input setting
popid: sintBlue # id of the population (no white space)
nseq: 262 # number of sequences (2*n)
L: 1830041 # total number of observed nucleic sites, including polymorphic and monomorphic
whether_folded: true # whethr the SFS is folded (true or false)
SFS: 54968.740090 18444.475031 9443.254125 6460.637387 4497.061230 3519.062453 2895.388432 2422.638318 1956.686134 1717.620993 1378.797384 1416.463385 1227.364278 1021.759822 1073.007228 873.080148 845.752484 800.981800 595.181349 794.836408 571.203766 628.706337 527.980210 429.664844 721.559927 315.050869 481.708978 307.300287 394.221545 394.074502 466.456788 274.440765 363.386447 273.479395 296.832588 320.575804 315.126815 252.266556 213.566298 275.958738 257.535768 185.436902 340.324052 175.413491 179.986660 138.658719 287.299094 47.316965 238.946119 266.961035 111.840623 223.114384 45.279077 165.697377 296.385206 83.585503 214.297286 125.366633 111.661200 219.281119 139.052795 160.386831 131.614256 76.786039 157.239727 159.636010 60.218127 143.194085 99.001725 148.481068 84.514873 126.043678 104.077862 96.209692 141.031878 149.956901 101.599229 56.392862 101.079956 80.280370 135.261593 40.235595 27.424630 121.969309 189.776245 74.226144 50.242446 84.001571 107.542195 31.137798 29.916997 67.201712 131.755256 60.641361 64.148078 116.233605 108.268881 36.805898 51.451547 51.855920 67.043933 99.173703 25.902900 28.275804 101.866575 47.917146 36.437218 48.157593 146.198005 27.417829 34.145564 72.762140 95.264702 6.290856 58.686022 89.500602 76.097667 40.816598 63.178597 39.973295 69.971593 108.340421 51.045114 61.909694 103.159211 33.569557 142.186072 6.764939 112.449083 86.473221 155.056690
#smallest_size_of_SFS_bin_used_for_estimation: 1 # default is 1; to ignore singletons, uncomment this line and change this number to 2
#largest_size_of_SFS_bin_used_for_estimation: 131 # default is nseq/2 for folded SFS
pct_training: 0.67 # percentage of sites for training
nrand: 65 130 195 260 # number of random break points for each try (separated by white space)
project_dir: sintBlue # project directory
stairway_plot_dir: $HOME/bin/stairway_plot/stairway_plot_es # directory to the stairway plot files
ninput: 200 # number of input files to be created for each estimation
#random_seed: 6
#output setting
mu: 2e-8 # assumed mutation rate per site per generation
year_per_generation: 15 # assumed generation time (in years)
#plot setting
plot_title: sintBlue # title of the plot
xrange: 0.1,10000 # Time (1k year) range; format: xmin,xmax; "0,0" for default
yrange: 0,0 # Ne (1k individual) range; format: xmin,xmax; "0,0" for default
xspacing: 2 # X axis spacing
yspacing: 2 # Y axis spacing
fontsize: 12 # Font size" >sintBlue.blueprint


echo "#input setting
popid: sintTeal # id of the population (no white space)
nseq: 80 # number of sequences ->
L: 1830041 # total number of observed nucleic sites, including polymorphic and monomorphic
whether_folded: true # whethr the SFS is folded (true or false)
SFS: 19027.916655 8774.712038 5079.474815 3679.957335 2677.663655 2157.865160 1778.801580 1440.323144 1319.701029 998.536711 934.144099 871.915368 812.451003 750.938932 623.670542 540.641827 671.476401 526.080209 501.266127 456.155482 447.647628 330.610513 426.093935 326.858819 276.638676 294.294057 347.649095 269.002785 189.682393 303.585744 226.318022 274.625068 201.149769 196.103761 173.848938 191.193373 267.200748 249.900867 248.353637 325.131914
#smallest_size_of_SFS_bin_used_for_estimation: 1 # default is 1; to ignore singletons, uncomment this line and change this number to 2
#largest_size_of_SFS_bin_used_for_estimation: 80 # default is nseq/2 for folded SFS
pct_training: 0.67 # percentage of sites for training
nrand: 19 39 58 78 # number of random break points for each try (separated by white space)
project_dir: sintTeal # project directory
stairway_plot_dir: $HOME/bin/stairway_plot/stairway_plot_es # directory to the stairway plot files
ninput: 200 # number of input files to be created for each estimation
#random_seed: 6
#output setting
mu: 2e-8 # assumed mutation rate per site per generation
year_per_generation: 15 # assumed generation time (in years)
#plot setting
plot_title: sintTeal # title of the plot
xrange: 0.1,10000 # Time (1k year) range; format: xmin,xmax; "0,0" for default
yrange: 0,0 # Ne (1k individual) range; format: xmin,xmax; "0,0" for default
xspacing: 2 # X axis spacing
yspacing: 2 # Y axis spacing
fontsize: 12 # Font size" >sintTeal.blueprint


echo "#input setting
popid: sintGreen # id of the population (no white space)
nseq: 62 # number of sequences
L: 1830041 # total number of observed nucleic sites, including polymorphic and monomorphic
whether_folded: true # whethr the SFS is folded (true or false)
SFS: 18387.776643 7847.061378 4651.990927 3141.107325 2303.550391 1773.451011 1508.736063 1285.307124 1143.097857 949.314445 859.933600 748.766734 719.856903 655.836831 564.083017 512.608619 517.212306 456.491290 461.208427 387.023802 341.543152 363.322278 311.201884 282.317211 353.999771 232.677877 288.694402 321.652486 380.855557 268.842073 519.097976
#smallest_size_of_SFS_bin_used_for_estimation: 1 # default is 1; to ignore singletons, uncomment this line and change this number to 2
#largest_size_of_SFS_bin_used_for_estimation: 80 # default is nseq/2 for folded SFS
pct_training: 0.67 # percentage of sites for training
nrand: 15 30 45 60 # number of random break points for each try (separated by white space)
project_dir: sintGreen # project directory
stairway_plot_dir: $HOME/bin/stairway_plot/stairway_plot_es # directory to the stairway plot files
ninput: 200 # number of input files to be created for each estimation
#random_seed: 6
#output setting
mu: 2e-8 # assumed mutation rate per site per generation
year_per_generation: 15 # assumed generation time (in years)
#plot setting
plot_title: sintGreen # title of the plot
xrange: 0.1,10000 # Time (1k year) range; format: xmin,xmax; "0,0" for default
yrange: 0,0 # Ne (1k individual) range; format: xmin,xmax; "0,0" for default
xspacing: 2 # X axis spacing
yspacing: 2 # Y axis spacing
fontsize: 12 # Font size" >sintGreen.blueprint


echo "#input setting
popid: sintYellow # id of the population (no white space)
nseq: 30 # number of sequences
L: 1830041 # total number of observed nucleic sites, including polymorphic and monomorphic 
whether_folded: true # whethr the SFS is folded (true or false)
SFS: 16199.629544 7221.873982 4153.983827 2802.707609 2381.014773 1825.956297 1494.476878 1079.971463 1099.834349 768.376639 713.072962 618.778692 676.597541 626.859962 790.416252
#smallest_size_of_SFS_bin_used_for_estimation: 1 # default is 1; to ignore singletons, uncomment this line and change this number to 2
#largest_size_of_SFS_bin_used_for_estimation: 80 # default is nseq/2 for folded SFS
pct_training: 0.67 # percentage of sites for training
nrand: 7 14 21 28 # number of random break points for each try (separated by white space)
project_dir: sintYellow # project directory
stairway_plot_dir: $HOME/bin/stairway_plot/stairway_plot_es # directory to the stairway plot files
ninput: 200 # number of input files to be created for each estimation
#random_seed: 6
#output setting
mu: 2e-8 # assumed mutation rate per site per generation
year_per_generation: 15 # assumed generation time (in years)
#plot setting
plot_title: sintYellow # title of the plot
xrange: 0.1,10000 # Time (1k year) range; format: xmin,xmax; "0,0" for default
yrange: 0,0 # Ne (1k individual) range; format: xmin,xmax; "0,0" for default
xspacing: 2 # X axis spacing
yspacing: 2 # Y axis spacing
fontsize: 12 # Font size" >sintYellow.blueprint

Create batch files and launch:

for blueprint in *blueprint; do
java -cp $HOME/bin/stairway_plot/stairway_plot_es Stairbuilder $blueprint;
done


# split each .sh script and launch all (second job waiting with dependency on first finishing)

for file in *.sh; do
grep -B 805 "# Step 2: determine number of break points" $file > ${file%.*}1;
grep -A 805 "# Step 2: determine number of break points" $file > ${file%.*}2;
echo '#!/bin/bash' | cat - ${file%.*}2 >temp;
mv temp ${file%.*}2;
launcher_creator.py -j ${file%.*}1 -n ${file%.*}1 -q shortq7 -t 06:00:00 -e $EMAIL -N 5;
DEP=$(sbatch ${file%.*}1.slurm) && sbatch --dependency=afterok:${DEP##* } -e ${file%.*}2.e%j -o ${file%.*}2.o%j -p shortq7 --mem=200GB --mail-user reckert2017@fau.edu --mail-type=ALL ${file%.*}2;
done

launcher_creator.py -j sintYellow.blueprint1 -n sintYellow.blueprint1 -q shortq7 -t 06:00:00 -e $EMAIL -N 5;
DEP=$(sbatch sintYellow.blueprint1.slurm) && sbatch --dependency=afterok:${DEP##* } -e sintYellow.blueprint2.e%j -o sintYellow.blueprint2.o%j -p shortq7 --mem=200GB --mail-user reckert2017@fau.edu --mail-type=ALL sintYellow.blueprint2;

After all have run:

mv sint*/*final.summary .


H E T E R O Z Y G O S I T Y


Calculating Heterozygosity across all loci (variant//invariant) using ANGSD and R script from Misha Matz (https://github.com/z0on/2bRAD_denovo)

cd ~/2bRAD/sint/fknms/ANGSD
cp ../SFS/sfsSitesToDo .

angsd sites index sfsSitesToDo

export GENOME_FASTA=~/2bRAD/sint/fknms/mappedReads/sint_denovo_cc.fasta

FILTERS="-maxHetFreq 0.5 -uniqueOnly 1 -remove_bads 1 -skipTriallelic 1 -minMapQ 25 -minQ 30 -doHWE 1 -sb_pval 1e-5 -hetbias_pval 1e-5 -minInd 187"
TODO="-ref $GENOME_FASTA -makeMatrix 1 -doIBS 1 -doCov 1 -doGeno 32 -doPost 1 -doGlf 2 -doCounts 1 -doMajorMinor 1 -dosnpstat 1 -doMaf 1"

echo "angsd -sites sfsSitesToDo -b bamsNoClones -GL 1 -P 1 $TODO $FILTERS -out sintHet" >angsdHet

launcher_creator.py -j angsdHet -n angsdHet -q shortq7 -t 06:00:00 -e $EMAIL -N 1
sbatch angsdHet.slurm

mkdir ../heterozygosity
cp sintHet* ../heterozygosity/

cd ../heterozygosity

echo heterozygosity_beagle.R sintHet.beagle.gz >filtHet

launcher_creator.py -j filtHet -n filtHet -q mediumq7 -t 24:00:00 -e $EMAIL -N 1
sbatch filtHet.slurm

tail -n 220 filtHet.e* > sintHet

N U C L E O T I D E   D I V E R S I T Y

cd ~/2bRAD/sint/fknms/
mkdir theta
cd ANGSD

GENOME_FASTA=~/2bRAD/sint/fknms/mappedReads/sint_denovo_cc.fasta

>kThetas

for POP in yellow green teal blue; do
echo "angsd -b ${POP}Bams -GL 1 -P 1 -sites sitesToDo_filtBaye -anc $GENOME_FASTA -doSaf 1 -out ${POP}Out &&\
realSFS ${POP}Out.saf.idx -fold 1 > ${POP}Out.sfs &&\
realSFS saf2theta ${POP}Out.saf.idx -sfs ${POP}Out.sfs -outname ${POP} -fold 1 &&\
thetaStat do_stat ${POP}.thetas.idx" >> kThetas
done

launcher_creator.py -j kThetas -n kThetas -q shortq7 -t 6:00:00 -e $EMAIL -w 4 -N 1
sbatch kThetas.slurm

mv *theta* ../theta


M I G R A T I O N   M O D E L I N G

To use BayesAss3 (XXX) we first need to convert our ANGSD output into genotype format

cd~/2bRAD/sint/fknms
mkdir bayesAss

cd bayesAss
cp ../ANGSD/sintNoClones.bcf .
cp ../bayescan/fkSintBsPops.txt .

echo "# VCF Parser questions
PARSER_FORMAT=VCF

# Only output SNPs with a phred-scaled quality of at least:
VCF_PARSER_QUAL_QUESTION=
# Select population definition file:
VCF_PARSER_POP_FILE_QUESTION=./fkSintBsPops.txt
# What is the ploidy of the data?
VCF_PARSER_PLOIDY_QUESTION=DIPLOID
# Do you want to include a file with population definitions?
VCF_PARSER_POP_QUESTION=true
# Output genotypes as missing if the phred-scale genotype quality is below:
VCF_PARSER_GTQUAL_QUESTION=
# Do you want to include non-polymorphic SNPs?
VCF_PARSER_MONOMORPHIC_QUESTION=false
# Only output following individuals (ind1, ind2, ind4, ...):
VCF_PARSER_IND_QUESTION=
# Only input following regions (refSeqName:start:end, multiple regions: whitespace separated):
VCF_PARSER_REGION_QUESTION=
# Output genotypes as missing if the read depth of a position for the sample is below:
VCF_PARSER_READ_QUESTION=
# Take most likely genotype if "PL" or "GL" is given in the genotype field?
VCF_PARSER_PL_QUESTION=true
# Do you want to exclude loci with only missing data?
VCF_PARSER_EXC_MISSING_LOCI_QUESTION=true

# Immanc (BayesAss) Writer questions
WRITER_FORMAT=IMMANC

# Specify the locus/locus combination you want to write to the Immanc (BayesAss) file:
IMMANC_WRITER_LOCUS_COMBINATION_QUESTION=
# Specify which data type should be included in the Immanc (BayesAss)) file  (Immanc (BayesAss) can only analyze one data type per file):
IMMANC_WRITER_DATA_TYPE_QUESTION=SNP" >fkSintBA.spid


module load pgdspider-2.1.1.2-gcc-9.2.0-ghxvd4c 

pgdSpider=/opt/ohpc/pub/spack/opt/spack/linux-centos7-x86_64/gcc-9.2.0/pgdspider-2.1.1.2-ghxvd4c4ieqngkbutakc7x6j4pfkqm5e/bin/PGDSpider2-cli.jar

echo '#!/bin/bash' > pgdSpider.sh
echo "java -Xmx1024m -Xms512m -jar $pgdSpider -inputformat VCF -outputformat IMMANC vcf -inputfile sintNoClones.bcf -outputfile fkSintBayesAss.txt -spid fkSintBA.spid" >>pgdSpider.sh

sbatch -e pgdSpider.e%j -o pgdSpider.o%j -p mediumq7 --mail-user reckert2017@fau.edu --mail-type=ALL pgdSpider.sh


## Default params:
#MCMC reps: 1,000,000
#burn in: 100,000
#sampling freq: 100
#delta migration (1): 0.1
#delta allele freq (3): 0.1
#delta inbreeding (4): 0.1

rm sintNoClones.bcf
rm fkSintBsPops.txt

module load gcc-9.2.0-gcc-8.3.0-ebpgkrt gsl-2.5-gcc-9.2.0-i6lf4jb netlib-lapack-3.9.1-gcc-9.2.0-gcqg2b2 BayesAss/3.0.4.2

# Run a test with verbose [-v] output to see the acceptance rates in the terminal (takes a few minutes to compute)
# check the output file [less BATest.o*] and kill the job once you get output with acceptance rates [scancel {yourJobID}]
# After ~10—15 min you should start seeing output in the BATest.o* file

echo '#!/bin/bash' > BATest
echo BA3SNP -v -i 5000000 -b 500000 -n 100 fkSintBayesAss.txt >> BATest

sbatch -e BATest.e%j -o BATest.o%j -p shortq7 --mail-user reckert2017@fau.edu --mail-type=ALL BATest

#logP(M): -638.54 logL(G): -3494125.80 logL: -3494764.33 % done: [0.00] % accepted: (0.41, 0.01, 0.80, 0.06, 0.76)

# we are looking for 20—60% acceptance, ideally somewhere near 20—30%
# relationships between mixing parameters and acceptance rates are inverse
# defaults are 0.1 (all parameters are scale 0—1)
# increase [-m] increase [-a] and decrease [-f]

echo '#!/bin/bash' > BATest
echo BA3SNP -v -i 5000000 -b 500000 -n 1000 -m 0.2 -a 0.6 -f 0.04 fkSintBayesAss.txt >> BATest

sbatch -e BATest.e%j -o BATest.o%j -p shortq7 --mail-user reckert2017@fau.edu --mail-type=ALL BATest
# % accepted: (0.26, 0.02, 0.29, 0.15, 0.76)

echo '#!/bin/bash' > BATest
echo BA3SNP -v -i 5000000 -b 500000 -n 1000 -m 0.2 -a 0.6 -f 0.03 fkSintBayesAss.txt >> BATest
# % accepted: (0.26, 0.02, 0.29, 0.20, 0.76)

echo '#!/bin/bash' > BATest
echo BA3SNP -v -i 5000000 -b 500000 -n 1000 -m 0.2 -a 0.6 -f 0.025 fkSintBayesAss.txt >> BATest

sbatch -e BATest.e%j -o BATest.o%j -p shortq7 --mail-user reckert2017@fau.edu --mail-type=ALL BATest
# % accepted: (0.25, 0.01, 0.28, 0.23, 0.76)

Make and launch 10 iterations of BayesAss, each in its own run directory so we can keep all trace files (saved as ‘BA3trace.txt’ and would overwrite if not in separate directories). We are using [-s $RANDOM] to use a random start seed for each independent run

module load gcc-9.2.0-gcc-8.3.0-ebpgkrt gsl-2.5-gcc-9.2.0-i6lf4jb netlib-lapack-3.9.1-gcc-9.2.0-gcqg2b2 BayesAss/3.0.4.2

for i in {01..10}; do
echo '#!/bin/bash' > BayesAss$i.sh
echo BA3SNP -v -u -s $RANDOM -i 6500000 -b 3000000 -n 100 -m 0.2 -a 0.6 -f 0.025 -t -o fkSintBARun${i}Out.txt ../fkSintBayesAss.txt >> BayesAss$i.sh;
mkdir run$i;
mv BayesAss$i.sh run$i;
cd run$i;
sbatch -e BayesAss$i.e%j -o BayesAss$i.o%j -p longq7 --mem=100GB --exclusive --mail-user reckert2017@fau.edu --mail-type=ALL BayesAss$i.sh
cd ..;
done

# after all runs complete copy files to main BayesAss directory
cd ~/2bRAD/sint/fknms/bayesAss

cp run*/*Out.txt .

for i in {01..10}; do 
cp ../run$i/BA3trace.txt BA3trace$i.txt;
done