version: 12 June, 2024

A B O U T   T H I S   D O C U M E N T

This walks through the steps to process 2bRAD reads without an existing genome.

Be sure to read through what you are doing and follow instructions before copy/pasting code chunks.

Copy the code chunks into your terminal, taking care to change the necessary portions to fit your data/ your directory structure, etc.

First, you will need to replace reckert2017 with your user name and reckert2017@fau.edu with your email throughout this code.


S E T U P


Download all necessary scripts and load modules for processing/analysis

Login to KoKo

ssh reckert2017@koko-login.hpc.fau.edu

Load necessary modules

or you can add to ~/.bashrc to load at login (use nano .bashrc)

module load angsd-0.933-gcc-9.2.0-65d64pp
module load bayescan-2.1-gcc-8.3.0-7gakqmd
module load qt-5.15.2-gcc-9.2.0-zi7wcem BayeScEnv/1.1
module load bcftools-1.9-gcc-8.3.0-il4d373
module load bowtie2-2.3.5.1-gcc-8.3.0-63cvhw5
module load cdhit-4.8.1-gcc-8.3.0-bcay75d
module load htslib-1.9-gcc-8.3.0-jn7ehrc
module load kraken2-2.1.1-gcc-9.2.0-ocivj3u
module load python-3.7.4-gcc-8.3.0-3tniqr5
module load launcher
module load miniconda3-4.6.14-gcc-8.3.0-eenl5dj
module load ncbi-toolkit-22_0_0-gcc-9.2.0-jjhd2wa
module load ngsadmix-32-gcc-8.3.0-qbnwmpq
module load ngsRelate/v2
module load R/3.6.1
module load samtools-1.10-gcc-8.3.0-khgksad
module load vcftools-0.1.14-gcc-8.3.0-safy5vc

Download scripts

Put scripts needed into ~/bin or similar directory that is mapped in .bashrc path IF you don’t have a bin directory in your $HOME you can make one now (mkdir ~/bin)

cd ~/bin
svn checkout https://github.com/RyanEckert/Stephanocoenia_FKNMS_PopGen/trunk/scripts/
mv scripts/* .
rm scripts

wget http://www.cmpg.unibe.ch/software/PGDSpider/PGDSpider_2.0.7.1.zip
unzip PGDSpider_2.0.7.1.zip
rm PGDSpider_2.0.7.1.zip

git clone https://github.com/Rosemeis/pcangsd.git
cd pcangsd

conda activate 2bRAD

pip install --user -r requirements.txt
python setup.py build_ext --inplace
pip3 install -e .

cd ~/bin

git clone https://bitbucket.org/simongravel/moments.git
cd moments

conda activate 2bRAD

conda install -c bioconda moments

pip install --user -r requirements.txt
python setup.py build_ext --inplace
pip3 install -e .

svn checkout https://github.com/xiaoming-liu/stairway-plot-v2.git

mv stairway-plot-v2.git/trunk/stairway_plot_v2.1.1.zip .
unzip stairway_plot_v2.1.1.zip
rm -r stairway_plot_v2.1.1.zip stairway-plot-v2.git

Make all scripts executable

chmod +x *.sh *.pl *.py

IF not already, it is useful to add ~/bin to your $PATH
This way you can easily access your executable scripts without specifying the absolute path to them.
Otherwise skip to “Build working directory”

PATH="$HOME/bin:$PATH";

To permanently add this to your $PATH add to .bashrc use nano text editor

nano ~/.bashrc

ADD the following text to file under PATHS section if in your .bashrc: export PATH="$HOME/bin:$PATH";
exit nano with ctrl+x

source ~/.bashrc
echo $PATH

Build working directory

cd
mkdir 2bRAD/sint/
mkdir 2bRAD/sint/fknms/
mkdir 2bRAD/sint/fknms/rawReads/
cd 2bRAD/sint/fknms/rawReads/


D O W N L O A D   R E A D S


Download and concatenate raw reads from BaseSpace

If you have not previously, download BaseSpaceCLI

wget "https://launch.basespace.illumina.com/CLI/latest/amd64-linux/bs" -O $HOME/bin/bs

chmod +x ~/bin/bs

Go to the website and confirm authorization by logging in to your basespace acct.

bs auth

Making a script to download the reads and merge samples across 2 NovaSeq lanes


echo '#!/bin/bash' > downloadReads.sh
echo 'bs download project --concurrency=high -q -n JA21001 -o .' >> downloadReads.sh
# -n is the project name and -o is the output directory

echo "find . -name '*.gz' -exec mv {} . \;" >> downloadReads.sh
echo 'rmdir SA*' >>downloadReads.sh
echo 'mkdir ../concatReads' >> downloadReads.sh
echo 'cp *.gz ../concatReads' >> downloadReads.sh
echo 'cd ../concatReads' >> downloadReads.sh
echo 'mergeReads.sh -o mergeTemp' >> downloadReads.sh
# -o is the directory to put output files in

echo 'rm *L00*' >> downloadReads.sh
echo "find . -name '*.gz' -exec mv {} . \;" >> downloadReads.sh
echo 'gunzip *.gz' >> downloadReads.sh
echo 'rmdir mergeTemp' >> downloadReads.sh

chmod +x downloadReads.sh

launcher_creator.py -b 'srun downloadReads.sh' -n downloadReads -q shortq7 -t 06:00:00 -e reckert2017@fau.edu
sbatch downloadReads.slurm


How many reads before filtering?

echo '#!/bin/bash' >rawReads
echo readCounts.sh -e .fastq -o sintRaw >>rawReads

sbatch -o rawReads.o%j -e rawReads.e%j rawReads --mail-type=ALL --mail-user=reckert2017@fau.edu

T R I M M I N G   &   F I L T E R I N G


Trim and demultiplex reads

cd ../concatReads

2bRAD_trim_launch_dedup.pl fastq > trims.sh
launcher_creator.py -j trims.sh -n trims -q shortq7 -t 06:00:00 -e reckert2017@fau.edu
sbatch --mem=200GB trims.slurm

Check that we have the correct number of trim files (226 in this case)

ls -l *.tr0 | wc -l

mkdir ../trimmedReads
srun mv *.tr0 ../trimmedReads &

zipper.py -f fastq -a -9 --launcher -e reckert2017@fau.edu
sbatch --mem=200GB zip.slurm

cd ../trimmedReads

Rename sequence files using sampleRename.py This script needs a .csv with XXX Make sure you use the reverse complement of your inline BCs!

srun sampleRename.py -i sampleList -n fk_ -f tr0

Quality filtering using cutadapt

I can’t get KoKo’s module for cutadapt to work, so we’ll do it in miniconda. Run below if you don’t have a conda env. set up, otherwise you can skip to the next chunk

module load miniconda3-4.6.14-gcc-8.3.0-eenl5dj
conda config --add channels defaults
conda config --add channels bioconda
conda create -n 2bRAD cutadapt

Removing reads with qualities at ends less than Q15 for de novo analysis

source activate 2bRAD

echo '#!/bin/bash' > trimse.sh
echo 'module load miniconda3-4.6.14-gcc-8.3.0-eenl5dj' >> trimse.sh
echo 'source activate 2bRAD' >> trimse.sh
for file in *.tr0; do
echo "cutadapt -q 15,15 -m 36 -o ${file/.tr0/}.trim $file > ${file/.tr0/}.trimlog.txt" >> trimse.sh;
done

I can’t get it to run through launcher so just run it serially, it takes a while to run, so consider breaking up in several jobs.

sbatch -o trimse.o%j -e trimse.e%j --mem=200GB trimse.sh

Do we have expected number of *.trim files created?

conda deactivate
ls -l *.trim | wc -l

How many reads in each sample?

echo '#!/bin/bash' >sintReads
echo readCounts.sh -e trim -o sintFilt >>sintReads
sbatch --mem=200GB sintReads

mkdir ../filteredReads
mv *.trim ../filteredReads

zipper.py -f tr0 -a -9 --launcher -e reckert2017@fau.edu
sbatch zip.slurm

cat sintFiltReadCounts


D E N O V O   R E F E R E N C E


Construct denovo reference for aligning reads

Remove symbiodiniaceae reads

If you’ve not already, build a bt reference for the concatenated zoox genomes, otherwise skip ahead to the next chunk

mkdir ~/bin/symGenomes
cd ~/bin/symGenomes
echo "bowtie2-build symbConcatGenome.fasta symbConcatGenome" > bowtie2-build
launcher_creator.py -j bowtie2-build -n bowtie2-build -q shortq7 -t 06:00:00 -e reckert2017@fau.edu
module load bowtie2-2.3.5.1-gcc-8.3.0-63cvhw5
sbatch --mem=200GB bowtie2-build.slurm
module load samtools-1.10-gcc-8.3.0-khgksad
srun samtools faidx symbConcatGenome.fasta

Mapping reads to concatenated Symbiodinaceae genome

cd ~/2bRAD/fknms/sint/filteredReads

mkdir symbionts
SYMGENOME=~/bin/symGenomes/symbConcatGenome

2bRAD_bowtie2_launcher.py -g $SYMGENOME -f .trim -n zooxMaps --split -u un -a zoox --aldir symbionts --launcher -e reckert2017@fau.edu

sbatch zooxMaps.slurm

Checking Symbiodiniaceae read mapping rates

>zooxAlignmentRates
for F in `ls *trim`; do
M=`grep -E '^[ATGCN]+$' $F | wc -l | grep -f - zooxMaps.e* -A 4 | tail -1 | perl -pe 's/zooxMaps\.e\d+-|% overall alignment rate//g'` ;
echo "$F.sam $M">>zooxAlignmentRates;
done

# ls *trim | cut -d '.' -f 1 >align1
# grep "% overall" zooxMaps.e* | cut -d ' ' -f 1 >align2
# paste <(awk -F' ' '{print $1}' align1) <(awk -F' ' '{print $1}' align2) >zooxAlignmentRates
# rm align1 align2

less zooxAlignmentRates

Clean up directory

zipper.py -f .trim -a -9 --launcher -e reckert2017@fau.edu
sbatch --mem=200GB zip.slurm

mv *.sam symbionts/

cd ~/2bRAD/sint/fknms/
mv filteredReads/symbionts .

Uniquing reads

‘stacking’ individual trimmed fastq reads:

cd filteredReads
ls *.trim.un | perl -pe 's/^(.+)$/uniquerOne.pl $1 >$1\.uni/' > unique

launcher_creator.py -j unique -n unique -q shortq7 -t 06:00:00 -e reckert2017@fau.edu
sbatch --mem=200GB unique.slurm

Checking there is a .uni for all samples

ls -l *.uni | wc -l

Collecting common tags (major alleles).

Merging uniqued files (set minInd to >10, or >10% of total number of samples, whichever is greater).

echo 'mergeUniq.pl uni minInd=30 > all.uniq' > allunique

launcher_creator.py -j allunique -n allunique -q shortq7 -t 06:00:00 -e reckert2017@fau.edu
sbatch --mem=200GB allunique.slurm

Discarding tags that have more than 7 observations without reverse-complement

srun awk '!($3>7 && $4==0) && $2!="seq"' all.uniq >all.tab

Creating fasta file out of merged and filtered tags:

srun awk '{print ">"$1"\n"$2}' all.tab >all.fasta

Clustering allowing for up to 3 mismatches (-c 0.91); the most abundant sequence becomes reference

echo '#!/bin/bash' >cdhit
echo cd-hit-est -i all.fasta -o cdh_alltags.fas -aL 1 -aS 1 -g 1 -c 0.91 -M 0 -T 0 >>cdhit
sbatch --mem=200GB -e cdhit.e%j -o cdhit.o%j cdhit

rm *.uni

Remove contamination

We can remove contamination sequences from our denovo reference with kraken

We need a Kraken database to search against. If you don’t already have one, you need to build one now. otherwise you can skip to running Kraken

We can download a compiled standard database:

mkdir ~/bin/krakenDB
cd ~/bin/krakenDB

srun wget https://genome-idx.s3.amazonaws.com/kraken/k2_pluspf_20210127.tar.gz
echo tar -xvf k2_pluspf_20210127.tar.gz >tar

launcher_creator.py -j tar -n tar -q mediumq7 -t 24:00:00 -e reckert2017@fau.edu

sbatch tar.slurm

Alternatively, we can build a custom database, which can include the Symbiodiniaceae genomes

echo '#!/bin/bash' >krakendb.sh
echo kraken2-build --download-taxonomy --db ~/bin/krakenDB >>krakendb.sh
echo kraken2-build --download-library archaea --threads 16 --db ~/bin/krakenDB >>krakendb.sh
echo kraken2-build --download-library bacteria --threads 16 --db ~/bin/krakenDB >>krakendb.sh
echo kraken2-build --download-library viral --threads 16 --db ~/bin/krakenDB >>krakendb.sh
echo kraken2-build --download-library human --threads 16 --db ~/bin/krakenDB >>krakendb.sh
echo kraken2-build --download-library fungi --threads 16 --db ~/bin/krakenDB >>krakendb.sh
echo kraken2-build --download-library protozoa --threads 16 --db ~/bin/krakenDB >>krakendb.sh
echo kraken2-build --download-library UniVec_Core --threads 16 --db ~/bin/krakenDB >>krakendb.sh

sbatch --mem=200GB -p longq7 -e krakenDB.e%j -o krakenDB.o%j krakendb.sh

Format and add Symbiodiniaceae genomes to the database

cd ~/bin/symGenomes

# Symbiodinium microadriaticum
sed '/>/ s/$/|kraken:taxid|2951/' Symbiodinium_microadriacticum_genome.scaffold.fasta >S_microadriacticum.fa

# Breviolum minutum
sed '/>/ s/$/|kraken:taxid|2499525/' Breviolum_minutum.v1.0.genome.fa >B_minutum.fa

# Cladocopium goreaui
sed '/>/ s/$/|kraken:taxid|2562237/' Cladocopium_goreaui_Genome.Scaffolds.fasta >C_goreaui.fa

# Durusdinium trenchii
sed '/>/ s/$/|kraken:taxid|1381693/' 102_symbd_genome_scaffold.fa >D_trenchii.fa

echo '#!/bin/bash' >kdbAdd
echo kraken2-build --add-to-library ~/bin/symGenomes/S_microadriacticum.fa --db ~/bin/krakenDB >>kdbAdd
echo kraken2-build --add-to-library ~/bin/symGenomes/B_minutum.fa --db ~/bin/krakenDB >>kdbAdd
echo kraken2-build --add-to-library ~/bin/symGenomes/C_goreaui.fa --db ~/bin/krakenDB >>kdbAdd
echo kraken2-build --add-to-library ~/bin/symGenomes/D_trenchii.fa --db ~/bin/krakenDB >>kdbAdd

sbatch --mem=200GB -o kdbAdd.o%j -e kdbAdd.e%j kdbAdd

Finally, build the database

echo '#!/bin/bash' >kdbBuild
echo kraken2-build --download-taxonomy --threads 16 --db /mnt/beegfs/home/reckert2017/bin/krakenDB >>kdbBuild
echo kraken2-build --build --db ~/bin/krakenDB >>kdbBuild
sbatch --mem=200GB -o kdbBuild.o%j -e kdbBuild.e%j kdbBuild

Remove potential contamination from reference

cd ~/2bRAD/sint/fknms/filteredReads

echo '#!/bin/bash' >krakenDB
echo kraken2 --db ~/bin/krakenDB cdh_alltags.fas --threads 16 --classified-out cdh_alltags.contam.fa --unclassified-out cdh_alltags.unclass.fa --report krakenDB.report --output krakenDB.out >>krakenDB

sbatch --mem=200GB -o krakenDB.o%j -e krakenDB.e%j krakenDB

Construct denovo genome

With 30 pseudo chromosomes from clean major allele tags

mkdir ../mappedReads
mv sint_denovo.fa ../mappedReads
cd ../mappedReads

concatFasta.pl fasta=sint_denovo.fa num=30

Format pseudo genome

GENOME_FASTA=sint_denovo_cc.fasta

echo '#!/bin/bash' >genomeBuild.sh
echo bowtie2-build $GENOME_FASTA $GENOME_FASTA >>genomeBuild.sh
echo samtools faidx $GENOME_FASTA >>genomeBuild.sh

sbatch -o genomeBuild.o%j -e genomeBuild.e%j --mem=200GB genomeBuild.sh

M A P P I N G   R E A D S   T O   R E F E R E N C E


Mapping reads to reference and formatting bam files

Map reads to fake genome:

mv ../filteredReads/*.un .
mv ../filteredReads/symbionts .

GENOME_FASTA=sint_denovo_cc.fasta

# mapping with --local option, enables clipping of mismatching ends (guards against deletions near ends of RAD tags)
2bRAD_bowtie2_launcher.py -f un -g $GENOME_FASTA --launcher -e reckert2017@fau.edu
sbatch --mem=200GB maps.slurm

Do we have the right number of SAM files?

ls *.sam | wc -l 

Checking alignment rates

ls *un | cut -d '.' -f 1 >align1
grep "% overall" maps.e* | cut -d ' ' -f 1 >align2
>alignmentRates
paste <(awk -F' ' '{print $1}' align1) <(awk -F' ' '{print $1}' align2) >alignmentRates
rm align1 align2

less alignmentRates

Convert SAM files to BAM files

BAM files will be used for genotyping, population structure, etc.

>s2b
for file in *.sam; do
echo "samtools sort -O bam -o ${file/.sam/}.bam $file && samtools index ${file/.sam/}.bam">>s2b;
done

launcher_creator.py -j s2b -n s2b -q shortq7 -t 06:00:00 -e reckert2017@fau.edu
sbatch --mem=200GB s2b.slurm

Do we have enough BAM files?

ls *bam | wc -l  # should be the same number as number of trim files

Clean up directory

zipper.py -a -9 -f sam --launcher -e reckert2017@fau.edu
sbatch zip.slurm

rm *.un

G E N O T Y P I N G


“FUZZY genotyping” with ANGSD - without calling actual genotypes but working with genotype likelihoods at each SNP. Optimal for low-coverage data (<10x).

mkdir ../ANGSD
cd ../ANGSD
mv ../mappedReads/*.bam* .

ls *bam >bamsClones

Assessing base qualities and coverage depth

ANGSD settings: -minMapQ 20: only highly unique mappings (prob of erroneous mapping =< 1%) -baq 1: realign around indels (not terribly relevant for 2bRAD reads mapped with –local option) -maxDepth: highest total depth (sum over all samples) to assess; set to 10x number of samples -minInd: the minimal number of individuals the site must be genotyped in. Reset to 50% of total N at this stage.

export FILTERS="-uniqueOnly 1 -remove_bads 1 -minMapQ 20 -maxDepth 2260 -minInd 113"
export TODO="-doQsDist 1 -doDepth 1 -doCounts 1 -dumpCounts 2"

echo '#!/bin/bash' >sintDD.sh
echo angsd -b bamsClones -GL 1 $FILTERS $TODO -P 1 -out dd >>sintDD.sh

sbatch --mem=200GB -o sintDD.o%j -e sintDD.e%j --mail-user=reckert2017@fau.edu --mail-type=ALL sintDD.sh

Summarizing results (using Misha Matz modified script by Matteo Fumagalli)

echo '#!/bin/bash' >RQC.sh
echo Rscript ~/bin/plotQC.R prefix=dd >>RQC.sh
echo gzip -9 dd.counts >>RQC.sh
sbatch -e RQC.e%j -o RQC.o%j --dependency=afterok:460550 --mem=200GB RQC.sh

Proportion of sites covered at >5X:

cat quality.txt

scp dd.pdf to laptop to look at distribution of base quality scores, fraction of sites in each sample passing coverage thresholds and fraction of sites passing genotyping rates cutoffs. Use these to guide choices of -minQ, -minIndDepth and -minInd filters in subsequent ANGSD runs

Identifying clones and technical replicates

FILTERS="-uniqueOnly 1 -remove_bads 1 -minMapQ 20 -minQ 30 -dosnpstat 1 -doHWE 1 -hwe_pval 1e-5 -sb_pval 1e-5 -hetbias_pval 1e-5 -skipTriallelic 1 -minInd 170 -snp_pval 1e-6 -minMaf 0.05 -setMinDepthInd 3"
TODO="-doMajorMinor 1 -doMaf 1 -doCounts 1 -makeMatrix 1 -doIBS 1 -doCov 1 -doGeno 8 -doBcf 1 -doPost 1 -doGlf 2"

echo '#!/bin/bash' > sintClones3x.sh
echo angsd -b bamsClones -GL 1 $FILTERS $TODO -P 1 -out sintClones3x >>sintClones3x.sh

sbatch --mem=200GB -o sintClones3x.o%j -e sintClones3x.e%j -p shortq7 --mail-type=ALL --mail-user=reckert2017@fau.edu sintClones3x.sh

Use ibs matrix to identify clones with hierachial clustering in R. scp to local machine and run chunk below in R

if (!require("pacman")) install.packages("pacman")

pacman::p_load("dendextend", "ggdendro", "tidyverse")

cloneBams = read.csv("../data/stephanocoeniaMetaData.csv") # list of bam files

cloneMa = as.matrix(read.table("../data/snps/clones/sintClones.ibsMat")) # reads in IBS matrix produced by ANGSD 

dimnames(cloneMa) = list(cloneBams[,1],cloneBams[,1])
clonesHc = hclust(as.dist(cloneMa),"ave")

clonePops = cloneBams$region
cloneDepth = cloneBams$depthZone

cloneDend = cloneMa %>% as.dist() %>% hclust(.,"ave") %>% as.dendrogram()
cloneDData = cloneDend %>% dendro_data()

# Making the branches hang shorter so we can easily see clonal groups
cloneDData$segments$yend2 = cloneDData$segments$yend
for(i in 1:nrow(cloneDData$segments)) {
  if (cloneDData$segments$yend2[i] == 0) {
    cloneDData$segments$yend2[i] = (cloneDData$segments$y[i] - 0.01)}}

cloneDendPoints = cloneDData$labels
cloneDendPoints$pop = clonePops[order.dendrogram(cloneDend)]
cloneDendPoints$depth=cloneDepth[order.dendrogram(cloneDend)]
rownames(cloneDendPoints) = cloneDendPoints$label

# Making points at the leaves to place symbols for populations
point = as.vector(NA)
for(i in 1:nrow(cloneDData$segments)) {
  if (cloneDData$segments$yend[i] == 0) {
    point[i] = cloneDData$segments$y[i] - 0.01
  } else {
    point[i] = NA}}

cloneDendPoints$y = point[!is.na(point)]

techReps = c("S066.1", "S066.2", "S066.3", "S162.1", "S162.2", "S162.3", "S205.1", "S205.2", "S205.3")
cloneDendPoints$depth = factor(cloneDendPoints$depth,levels(cloneDendPoints$depth)[c(2,1)])

cloneDendPoints$pop = factor(cloneDendPoints$pop,levels(cloneDendPoints$pop)[c(4, 1, 3, 2)])

flPal = paletteer_d("vapoRwave::jazzCup")[c(2:5)]

cloneDendA = ggplot() +
  geom_segment(data = segment(cloneDData), aes(x = x, y = y, xend = xend, yend = yend2), size = 0.5) +
  geom_point(data = cloneDendPoints, aes(x = x, y = y, fill = pop, shape = depth), size = 4, stroke = 0.25) +
  #scale_fill_brewer(palette = "Dark2", name = "Population") +
  scale_fill_manual(values = flPal, name= "Population")+
  scale_shape_manual(values = c(24, 25), name = "Depth Zone")+
  geom_hline(yintercept = 0.12, color = "red", lty = 5, size = 0.75) + # creating a dashed line to indicate a clonal distance threshold
  geom_text(data = subset(cloneDendPoints, subset = label %in% techReps), aes(x = x, y = (y - .015), label = label), angle = 90) + # spacing technical replicates further from leaf
  geom_text(data = subset(cloneDendPoints, subset = !label %in% techReps), aes(x = x, y = (y - .010), label = label), angle = 90) +
  labs(y = "Genetic distance (1 - IBS)") +
  guides(fill = guide_legend(override.aes = list(shape = 22)))+
  theme_classic()

cloneDend = cloneDendA + theme(
  axis.title.x = element_blank(),
  axis.text.x = element_blank(),
  axis.line.x = element_blank(),
  axis.ticks.x = element_blank(),
  axis.title.y = element_text(size = 12, color = "black", angle = 90),
  axis.text.y = element_text(size = 10, color = "black"),
  axis.line.y = element_line(),
  axis.ticks.y = element_line(),
  panel.grid = element_blank(),
  panel.border = element_blank(),
  panel.background = element_blank(),
  plot.background = element_blank(),
  legend.key = element_blank(),
  legend.title = element_text(size = 12),
  legend.text = element_text(size = 10),
  legend.position = "bottom")

cloneDend

ggsave("../figures/cloneDend.png", plot = cloneDend, height = 8, width = 35, units = "in", dpi = 300)
ggsave("../figures/cloneDend.eps", plot = cloneDend, height = 8, width = 35, units = "in", dpi = 300)

Removing clones and re-running ANGSD

mkdir clones
mv sintClones* clones

ls *.bam > bamsNoClones

cat bamsClones | grep -v 'fk_S066.1.trim.un.bt2.bam\|fk_S066.3.trim.un.bt2.bam\|fk_S162.1.trim.un.bt2.bam\|fk_S162.3.trim.un.bt2.bam\|fk_S205.1.trim.un.bt2.bam\|fk_S205.3.trim.un.bt2.bam' >bamsNoClones

FILTERS="-uniqueOnly 1 -remove_bads 1 -minMapQ 20 -minQ 30 -dosnpstat 1 -doHWE 1 -hwe_pval 1e-5 -sb_pval 1e-5 -hetbias_pval 1e-5 -skipTriallelic 1 -minInd 165 -snp_pval 1e-6 -minMaf 0.05 -setMinDepthInd 3"
TODO="-doMajorMinor 1 -doMaf 1 -doCounts 1 -makeMatrix 1 -doIBS 1 -doCov 1 -doGeno 8 -doBcf 1 -doPost 1 -doGlf 2"

echo '#!/bin/bash' > sintNoClones3x.sh
echo srun angsd -b bamsNoClones -GL 1 $FILTERS $TODO -P 1 -out sintNoClones3x >> sintNoClones3x.sh

sbatch --mem=200GB -o sintNoClones3x.o%j -e sintNoClones3x.e%j -p shortq7 --mail-type=ALL --mail-user=reckert2017@fau.edu sintNoClones3x.sh

How many SNPs?

grep "filtering:" sintNoClones.e*

18,763 SNPs

P O P U L A T I O N   S T R U C T U R E


pcangsd

Using pcangsd to discern population structure

cd ~/2bRAD/sint/fknms
mkdir pcangsd
cd pcangsd

cp ../ANGSD/sintNoClones3x.beagle.gz .


source activate 2bRAD

echo '#!/bin/bash' > pcangsd.sh
echo 'pcangsd -b sintNoClones3x.beagle.gz -o fkSintPcangsd3x --admix --inbreedSamples --pcadapt --selection' >> pcangsd

launcher_creator.py -j pcangsd -n pcangsd -q shortq7 -t 06:00:00 -e $EMAIL -w 1 -N 1
sbatch pcangsd.slurm

Calculate population structure from genotype likelihoods using NGSadmix for K from 2 to 8 : FIRST remove all clones/genotyping replicates! (we did this).

mkdir ../ngsAdmix
cp *beagle* ../ngsAdmix

zipper.py -f bam -a -9 --launcher -e reckert2017@fau.edu
sbatch zip.slurm

cd ../ngsAdmix

Create a file with 50 replicate simulations for each value of K 1-11 (num pops + 3)

ngsAdmixLauncher.py -f sintNoClones3x.beagle.gz --maxK 11 -r 50 -n fkSint3x --launcher -e reckert2017@fau.edu

sbatch --mem=200GB fkSint3xNgsAdmix.slurm

Calculating most likely value of K

Next, take the likelihood value from each run of NGSadmix and put them into a file that can be used with Clumpak to calculate the most likely K using the methods of Evanno et al. (2005).

>fkSint3xNgsAdmixLogfile
for log in fkSint3x*.log; do
grep -Po 'like=\K[^ ]+' $log >> fkSint3xNgsAdmixLogfile;
done

Format for CLUMPAK in R

R

You are now using R in the terminal

logs <- as.data.frame(read.table("fkSint3xNgsAdmixLogfile"))

#output is organized with 10, 11 preceding 1, 2, 3 etc.
logs$K <- c(rep("10", 50), rep("11", 50), rep("1", 50), rep("2", 50), rep("3", 50),
rep("4", 50), rep("5", 50), rep("6", 50),
    rep("7", 50), rep("8", 50), rep("9", 50))
write.table(logs[, c(2, 1)], "fkSint3xNgsAdmixLogfile_formatted", row.names = F,
        col.names = F, quote = F)
quit()
# No need to save workspace image [press 'n']
n

Check that your formatted logfile has the appropriate number of entries

cat fkSint3xNgsAdmixLogfile_formatted | wc -l

make copies of .qopt files to run structure selector on (.Q files)

for file in fkSint3x*.qopt; do
filename=$(basename -- "$file" .qopt);
cp "$file" "$filename".Q;
done

mkdir fkSint3xQ
mv fkSint3x*Q fkSint3xQ

zip -r fkSint3xQ.zip fkSint3xQ

scp .zip and formatted logfile to local machine and upload to CLUMPAK (http://clumpak.tau.ac.il/bestK.html) and structure selector (https://lmme.ac.cn/StructureSelector/index.html)

scp sintNoClones* to local machine for further analyses with R

Running ANGSD within lineages

Since we have 4 distinct lineages, we will pare down SNPs to only those found within all lineages. This helps avoid issues of ascertainment bias.

cd ~/2bRAD/sint/fknms/ANGSD3x

# created 'bamsClusters' in R, based on Admixture analysis

awk 'BEGIN { FS=" " } $2 == "Blue" { print $1 }' bamsClusters >> blueBams
awk 'BEGIN { FS=" " } $2 == "Teal" { print $1 }' bamsClusters >> tealBams
awk 'BEGIN { FS=" " } $2 == "Green" { print $1 }' bamsClusters >> greenBams
awk 'BEGIN { FS=" " } $2 == "Yellow" { print $1 }' bamsClusters >> yellowBams

FILTERS="-uniqueOnly 1 -remove_bads 1 -minMapQ 20 -minQ 30 -dosnpstat 1 -doHWE 1 -hwe_pval 1e-5 -sb_pval 1e-5 -hetbias_pval 1e-5 -skipTriallelic 1 -snp_pval 1e-5 -minMaf 0.05 -setMinDepthInd 3"

TODO="-doMajorMinor 1 -doMaf 1 -doGeno 8 -doPost 1 -doGlf 2"

echo "angsd -b blueBams -GL 1 -P 1 $FILTERS $TODO -minInd 98 -out sintBlueSnps
angsd -b tealBams -GL 1 -P 1 $FILTERS $TODO -minInd 30 -out sintTealSnps
angsd -b greenBams -GL 1 -P 1 $FILTERS $TODO -minInd 23 -out sintGreenSnps
angsd -b yellowBams -GL 1 -P 1 $FILTERS $TODO -minInd 11 -out sintYellowSnps" > kSnps

launcher_creator.py -j kSnps -n kSnps -q shortq7 -t 06:00:00 -e $EMAIL -w 2 -N 1

sbatch kSnps.slurm


Filtering down the sites to those found within each lineage with the given ANGSD filtering criteria

for file in sint*Snps.geno.gz; do
echo '#!/bin/bash' > ${file%%.*}.sh;
echo "zcat $file | awk '{print \$1\"\t\"\$2}' > ${file%%.*}sites" >> ${file%%.*}.sh;
sbatch -e ${file%%.*}.e%j -o ${file%%.*}.o%j -p shortq7 --mem=100GB --mail-user reckert2017@fau.edu --mail-type=ALL ${file%%.*}.sh;
done

srun awk '(++c[$0])==(ARGC-1)' *Snpssites > sitesToDo

mkdir ../filteredSNPS3x

mv sint*Snps* ../filteredSNPS3x/
mv kSnps* ../filteredSNPS3x/


Now index the sites file and run angsd on all samples with only the reduced number of sites using -sites flag

cd ~/2bRAD/fknms/sint/ANGSD3x

angsd sites index sitesToDo

FILTERS="-uniqueOnly 1 -remove_bads 1 -minMapQ 20 -minQ 30 -dosnpstat 1 -doHWE 1 -hwe_pval 1e-5 -sb_pval 1e-5 -hetbias_pval 1e-5 -skipTriallelic 1 -snp_pval 1e-5 -minMaf 0.05 -setMinDepthInd 3"

TODO="-doMajorMinor 1 -doMaf 1 -doCounts 1 -makeMatrix 1 -doIBS 1 -doCov 1 -doGeno 8 -doBcf 1 -doPost 1 -doGlf 2"

echo "angsd -b bamsNoClones -sites sitesToDo -GL 1 -P 1 $FILTERS $TODO -minInd 165 -out sintFiltSnps" > sintFiltSnps3x

launcher_creator.py -j sintFiltSnps3x -n sintFiltSnps3x -q shortq7 -t 06:00:00 -e $EMAIL -w 2 -N 1
sbatch sintFiltSnps3x.slurm


mv sintFilt* ../filteredSNPS3x/


I N B R E E D I N G   A N D   R E L A T E D N E S S

Running ngsRelate on filtered SNPs

cd ~/2bRAD/sint/fknms/ANGSD3x

FILTERS="-uniqueOnly 1 -remove_bads 1 -minMapQ 20 -minQ 30 -dosnpstat 1 -doHWE 1 -hwe_pval 1e-5 -sb_pval 1e-5 -hetbias_pval 1e-5 -skipTriallelic 1 -snp_pval 1e-5 -minMaf 0.05 -setMinDepthInd 3"

TODO="-doMajorMinor 1 -doMaf 1 -doCounts 1 -makeMatrix 1 -doIBS 1 -doCov 1 -doGeno 8 -doBcf 1 -doPost 1 -doGlf 3"

echo '#!/bin/bash' > sintFiltRelate3x.sh
echo srun angsd -b bamsNoClones -GL 1 -sites sitesToDo $FILTERS $TODO -P 1 -minInd 165 -out sintFiltRelate3x >> sintFiltRelate3x.sh

sbatch --mem=200GB -o sintFiltRelate3x.o%j -e sintFiltRelate3x.e%j -p shortq7 --mail-type=ALL --mail-user=reckert2017@fau.edu sintFiltRelate3x.sh

mkdir ../ngsRelate3x
cd ../ngsRelate3x

mv ~/2bRAD/fknms/sint/ANGSD3x/*Relate* .

zcat sintFiltRelate3x.mafs.gz | cut -f5 |sed 1d >freq

echo '#!/bin/bash' > ngsFiltRelate3x.sh
echo ngsRelate -g sintFiltRelate3x.glf.gz -n 220 -f freq -z ../ANGSD3x/bamsNoClones -O filtRelate3x >> ngsFiltRelate3x.sh

sbatch -e ngsFiltRelate3x.e%j -o ngsFiltRelate3x.o%j --mem=200GB --mail-user reckert2017@fau.edu --mail-type=ALL ngsFiltRelate3x.sh


ngsF inbreeding

module load ngstools-master-gcc-8.3.0-qcbecbz
module load ngsf-master-gcc-8.3.0-yscrrl3

mkdir ~/2bRAD/fknms/sint/ngsF3x
cd ~/2bRAD/fknms/sint/ngsF3x

cp ../ngsRelate3x/sintFilt* .

N_SITES=$((`zcat sintFiltRelate3x.mafs.gz | wc -l`-1))
srun zcat sintFiltRelate3x.glf.gz | ngsF --n_ind 220 --n_sites $N_SITES --glf - --min_epsilon 0.001 --out fkSint.approx_indF --approx_EM --seed 12345 --init_values r
srun zcat sintFiltRelate3x.glf.gz  | ngsF --n_ind 220 --n_sites $N_SITES --glf - --min_epsilon 0.001 --out fkSintF.indF --init_values fkSint.approx_indF.pars

S I T E   F R E Q U E N C Y   S P E C T R A

Running SFS calculations within lineage to use for Fst and to run StairwayPlot

cd ~/2bRAD/fknms/sint/ANGSD3x

## scp bamsClusters text file to directory

awk 'BEGIN { FS=" " } $2 == "Blue" { print $1 }' bamsClusters >> blueBams
awk 'BEGIN { FS=" " } $2 == "Teal" { print $1 }' bamsClusters >> tealBams
awk 'BEGIN { FS=" " } $2 == "Green" { print $1 }' bamsClusters >> greenBams
awk 'BEGIN { FS=" " } $2 == "Yellow" { print $1 }' bamsClusters >> yellowBams


## No filters to distort allelic frequencies

FILTERS="-uniqueOnly 1 -remove_bads 1 -skipTriallelic 1 -minMapQ 30 -minQ 35 -doHWE 1 -sb_pval 1e-5 -hetbias_pval 1e-5 -maxHetFreq 0.5 -setMinDepthInd 3"
TODO="-doMajorMinor 1 -doMaf 1 -dosnpstat 1 -doPost 2 -doGeno 11 -doGlf 2"

echo "angsd -b blueBams -GL 1 -P 1 $FILTERS $TODO -minInd 98 -out sintBlueSFS3x
angsd -b tealBams -GL 1 -P 1 $FILTERS $TODO -minInd 30 -out sintTealSFS3x
angsd -b greenBams -GL 1 -P 1 $FILTERS $TODO -minInd 23 -out sintGreenSFS3x
angsd -b yellowBams -GL 1 -P 1 $FILTERS $TODO -minInd 11 -out sintYellowSFS3x" > sfsClusters3x

launcher_creator.py -j sfsClusters3x -n sfsClusters3x -q shortq7 -t 06:00:00 -e $EMAIL -N 1
sbatch sfsClusters3x.slurm


for file in sint*SFS3x.geno.gz; do
echo '#!/bin/bash' > ${file%%.*}.sh;
echo "zcat $file | awk '{print \$1\"\t\"\$2}' > ${file%%.*}sites" >> ${file%%.*}.sh;
sbatch -e ${file%%.*}.e%j -o ${file%%.*}.o%j -p longq7 --mem=100GB --mail-user reckert2017@fau.edu --mail-type=ALL ${file%%.*}.sh;
done


Now, compile common sites from all lineages using awk

srun awk '(++c[$0])==(ARGC-1)' *SFS3xsites > sfs3xSitesToDo

cat sfs3xSitesToDo | wc -l

#index sites for ANGSD and re-run ANGSD using ```-sites```
angsd sites index sfs3xSitesToDo

export GENOME_FASTA=~/2bRAD/fknms/sint/mappedReads/sint_denovo_cc.fasta

TODO="-doSaf 1 -ref $GENOME_FASTA -anc $GENOME_FASTA -doMaf 1 -doMajorMinor 4"

echo "angsd -sites sfs3xSitesToDo -b blueBams -GL 1 -P 1 $TODO -out sintBlue3x
angsd -sites sfs3xSitesToDo -b tealBams -GL 1 -P 1 $TODO -out sintTeal3x
angsd -sites sfs3xSitesToDo -b greenBams -GL 1 -P 1 $TODO -out sintGreen3x
angsd -sites sfs3xSitesToDo -b yellowBams -GL 1 -P 1 $TODO -out sintYellow3x" >sfs3x

launcher_creator.py -j sfs3x -n sfs3x -q shortq7 -t 06:00:00 -e $EMAIL -N 1
sbatch sfs3x.slurm


Once all jobs run:

mkdir ../SFS3x
mv sfs* ../SFS3x
mv sintBlue* ../SFS3x
mv sintTeal* ../SFS3x
mv sintGreen* ../SFS3x
mv sintYellow* ../SFS3x

cd ../SFS3x

###-- 1d-SFS
echo "realSFS sintBlue3x.saf.idx >sintBlue3x.sfs
realSFS sintTeal3x.saf.idx >sintTeal3x.sfs
realSFS sintGreen3x.saf.idx >sintGreen3x.sfs
realSFS sintYellow3x.saf.idx >sintYellow3x.sfs" >realSFS3x

launcher_creator.py -j realSFS3x -n realSFS3x -q shortq7 -t 06:00:00 -e $EMAIL -w 4 -N 1
sbatch realSFS3x.slurm


###-- 2d-SFS
echo "realSFS sintBlue3x.saf.idx sintTeal3x.saf.idx -P 20 > pBlTl3x.sfs ; realSFS fst index sintBlue3x.saf.idx sintTeal3x.saf.idx -sfs pBlTl3x.sfs -fstout pBlTl3x
realSFS sintBlue3x.saf.idx sintGreen3x.saf.idx -P 20 > pBlGn3x.sfs ; realSFS fst index sintBlue3x.saf.idx sintGreen3x.saf.idx -sfs pBlGn3x.sfs -fstout pBlGn3x
realSFS sintBlue3x.saf.idx sintYellow3x.saf.idx -P 20 > pBlYl3x.sfs ; realSFS fst index sintBlue3x.saf.idx sintYellow3x.saf.idx -sfs pBlYl3x.sfs -fstout pBlYl3x
realSFS sintTeal3x.saf.idx sintGreen3x.saf.idx -P 20 > pTlGn3x.sfs ; realSFS fst index sintTeal3x.saf.idx sintGreen3x.saf.idx -sfs pTlGn3x.sfs -fstout pTlGn3x
realSFS sintTeal3x.saf.idx sintYellow3x.saf.idx -P 20 > pTlYl3x.sfs ; realSFS fst index sintTeal3x.saf.idx sintYellow3x.saf.idx -sfs pTlYl3x.sfs -fstout pTlYl3x
realSFS sintGreen3x.saf.idx sintYellow3x.saf.idx -P 20 > pGnYl3x.sfs ; realSFS fst index sintGreen3x.saf.idx sintYellow3x.saf.idx -sfs pGnYl3x.sfs -fstout pGnYl3x" >2dSFS3x

launcher_creator.py -j 2dSFS3x -n 2dSFS3x -q shortq7 -t 06:00:00 -e $EMAIL -w 20 -N 1
sbatch 2dSFS3x.slurm

Global Fst between lineages

> sintKFst3x
realSFS fst stats pBlYl3x.fst.idx >> sintKFst3x
realSFS fst stats pTlYl3x.fst.idx >> sintKFst3x
realSFS fst stats pGnYl3x.fst.idx >> sintKFst3x
realSFS fst stats pBlTl3x.fst.idx >> sintKFst3x
realSFS fst stats pBlGn3x.fst.idx >> sintKFst3x
realSFS fst stats pTlGn3x.fst.idx >> sintKFst3x

echo "Blue
Teal
Green
Blue
Blue
Teal">fstPops1

echo "Yellow
Yellow
Yellow
Teal
Green
Green" >fstPops2

paste fstPops1 fstPops2 sintKFst3x > sintKFst3x.o

echo -e "pop1\tpop2\tfst\tweightedFst" | cat - sintKFst3x.o > sintKFst3x.out

rm fstPops1 fstPops2 sintKFst3x sintKFst3x.o

cat sintKFst3x.out


StairwayPlot2

First, we filter Fst outliers from sites:

cd ~/2bRAD/fknms/sint/ANGSD3x

FILTERS="-minMapQ 20 -minQ 30 -minInd 165 -doHWE 1 -sb_pval 1e-5 -hetbias_pval 1e-5 -uniqueOnly 1 -remove_bads 1 -skipTriallelic 1 -snp_pval 1e-5 -minMaf 0.05 -setMinDepthInd 3"
TODO="-doMajorMinor 4 -ref $GENOME_FASTA -doMaf 1 -dosnpstat 1 -doPost 2 -doBcf 1 --ignore-RG 0 -doGeno 11 -doCounts 1"

echo "angsd -b bamsNoClones -sites sfs3xSitesToDo -GL 1 $FILTERS $TODO -P 1 -out sintFiltSnps3xBayesFst" > sintFiltSnps3xBayesFst
launcher_creator.py -j sintFiltSnps3xBayesFst -n sintFiltSnps3xBayesFst -t 6:00:00 -e $EMAIL -w 1 -q shortq7
sbatch sintFiltSnps3xBayesFst.slurm

mkdir ~/2bRAD/fknms/sint/bayescan3x
cd ~/2bRAD/fknms/sint/bayescan3x


Create a file called vcf2bayescan.spid containing this text:

echo "############
# VCF Parser questions
PARSER_FORMAT=VCF
# Do you want to include a file with population definitions?
VCF_PARSER_POP_QUESTION=true
# Only input following regions (refSeqName:start:end, multiple regions: whitespace separated):
VCF_PARSER_REGION_QUESTION=
# What is the ploidy of the data?
VCF_PARSER_PLOIDY_QUESTION=DIPLOID
# Only output following individuals (ind1, ind2, ind4, ...):
VCF_PARSER_IND_QUESTION=
# Output genotypes as missing if the read depth of a position for the sample is below:
VCF_PARSER_READ_QUESTION=
# Take most likely genotype if "PL" or "GL" is given in the genotype field?
VCF_PARSER_PL_QUESTION=true
# Do you want to exclude loci with only missing data?
VCF_PARSER_EXC_MISSING_LOCI_QUESTION=false
# Select population definition file:
VCF_PARSER_POP_FILE_QUESTION=./bspops.txt
# Only output SNPs with a phred-scaled quality of at least:
VCF_PARSER_QUAL_QUESTION=
# Do you want to include non-polymorphic SNPs?
VCF_PARSER_MONOMORPHIC_QUESTION=false
# Output genotypes as missing if the phred-scale genotype quality is below:
VCF_PARSER_GTQUAL_QUESTION=
# GESTE / BayeScan Writer questions
WRITER_FORMAT=GESTE_BAYE_SCAN
# Specify which data type should be included in the GESTE / BayeScan file  (GESTE / BayeScan can only analyze one data type per file):
GESTE_BAYE_SCAN_WRITER_DATA_TYPE_QUESTION=SNP
############" >vcf2bayescan.spid


Need a file “bspops” that has samples in .bams order with the assigned populations/lineages

cp ../ANGSD3x/bamsClusters ./bspops.txt

cp ../ANGSD3x/sintFiltSnps3xBayesFst.bcf .

srun bcftools convert -O v -o sintFiltSnps3xBayesFst.vcf sintFiltSnps3xBayesFst.bcf

srun java -Xmx1024m -Xms512m -jar ~/bin/PGDSpider_2.0.7.1/PGDSpider2-cli.jar -inputfile sintFiltSnps3xBayesFst.vcf -outputfile sintFilt3x.bayescan -spid vcf2bayescan.spid

echo '#!/bin/bash' >bayeScan3x.sh
echo "bayescan sintFilt3x.bayescan -threads=100" >>bayeScan3x.sh

sbatch -e bayeScan3x.e%j -o bayeScan3x.o%j -p mediumq7 --mail-user reckert2017@fau.edu --mail-type=ALL bayeScan3x.sh

cut -d" " -f2- sintFilt3x.baye_fst.txt > sint3xFst

## removing all the .bcf data before snps
tail --lines=+70 sintFiltSnps3xBayesFst.vcf | cut -f 1,2 | paste --delimiters "\t" - sint3xFst > sint3x.baye_fst_pos.txt


Identify sites with q-value < 0.5


cp ../ANGSD3x/sfs3xSitesToDo .

awk '$5<0.5 {print $1"\t"$2}' sint3x.baye_fst_pos.txt > sint3xBayeOuts

grep -Fvxf sint3xBayeOuts sfs3xSitesToDo > sitesToDo3x_filtBaye

cp sitesToDo3x_filtBaye ../ANGSD3x
cd ../ANGSD3x

angsd sites index sitesToDo3x_filtBaye

export GENOME_FASTA=~/2bRAD/fknms/sint/mappedReads/sint_denovo_cc.fasta

TODO="-doSaf 1 -anc $GENOME_FASTA -ref $GENOME_FASTA -doMaf 1 -doMajorMinor 4"

echo "angsd -sites sitesToDo3x_filtBaye -b blueBams -GL 1 -P 1 $TODO -out sintBlueSFS3xbaye
angsd -sites sitesToDo3x_filtBaye -b tealBams -GL 1 -P 1 $TODO -out sintTealSFS3xbaye
angsd -sites sitesToDo3x_filtBaye -b greenBams -GL 1 -P 1 $TODO -out sintGreenSFS3xbaye
angsd -sites sitesToDo3x_filtBaye -b yellowBams -GL 1 -P 1 $TODO -out sintYellowSFS3xbaye" >sfs3x_filtBaye

launcher_creator.py -j sfs3x_filtBaye -n sfs3x_filtBaye -q shortq7 -t 06:00:00 -e $EMAIL -N 1
sbatch sfs3x_filtBaye.slurm

mv *SFS* ../SFS3x
mv *sfs* ../SFS3x

Generating per-population SFS

cd ../SFS3x

echo "realSFS sintBlueSFS3xbaye.saf.idx -fold 1 >sint3xFiltBlue.sfs
realSFS sintTealSFS3xbaye.saf.idx -fold 1 >sint3xFiltTeal.sfs
realSFS sintGreenSFS3xbaye.saf.idx -fold 1 >sint3xFiltGreen.sfs
realSFS sintYellowSFS3xbaye.saf.idx -fold 1 >sint3xFiltYellow.sfs" >sintFiltSFS3x

launcher_creator.py -j sintFiltSFS3x -n sintFiltSFS3x -q shortq7 -t 06:00:00 -e $EMAIL -N 1
sbatch sintFiltSFS3x.slurm


Making blueprint files for StairwayPlot

mkdir ../swPlot3x
cd ../swPlot3x

# SWplot only wants the mutations, so have to omit first number in SFS file (0 mutations).

# L = cat ../ANGSD3x/sitesToDo3x_filtBaye | wc -l 
# which in our case =1602283

echo "#input setting
popid: sintBlue # id of the population (no white space)
nseq: 262 # number of sequences (2*n)
L: 1602283 # total number of observed nucleic sites, including polymorphic and monomorphic
whether_folded: true # whethr the SFS is folded (true or false)
SFS: 48412.081585 16072.295629 8136.263471 5566.281044 3884.862859 3000.493939 2490.542066 2044.605947 1677.829824 1462.808464 1153.734320 1210.340336 996.550586 853.912239 871.393691 745.252259 707.343914 644.722151 524.781440 619.940399 453.336102 512.412506 418.589833 345.809706 589.674490 224.318676 425.283171 225.893123 296.762772 325.835638 366.133932 265.969924 263.448203 232.699638 255.827769 255.852748 232.754720 243.526873 148.221226 258.233364 208.669644 144.653762 264.557705 161.092656 138.914035 101.710255 253.695151 29.148930 181.051140 221.657390 97.162859 184.658342 21.433043 121.318353 251.518539 100.839142 157.846619 89.114729 107.951232 187.816747 136.096324 102.504921 119.852484 74.043546 120.152182 118.940635 73.374622 110.083987 103.685855 114.348029 54.475104 95.164183 92.298819 77.948983 105.927514 121.353275 85.980549 43.879627 90.136484 73.330610 122.956757 37.319611 21.488750 103.526304 176.889672 45.691251 30.019737 68.840477 95.416782 24.877401 27.192546 76.326701 94.728179 44.204632 55.961376 107.185398 98.646553 28.767601 41.776037 41.180102 58.361194 90.205366 17.981080 20.352570 93.415816 43.464174 31.050422 35.300437 126.310322 20.320044 27.939147 75.448620 71.715330 3.034916 48.578866 86.238681 51.132361 33.057738 65.638127 27.008721 64.533759 77.421571 40.131592 121.553633 31.729413 50.015735 126.942517 14.357809 92.065024 66.165010 149.756469 
#smallest_size_of_SFS_bin_used_for_estimation: 1 # default is 1; to ignore singletons, uncomment this line and change this number to 2
#largest_size_of_SFS_bin_used_for_estimation: 131 # default is nseq/2 for folded SFS
pct_training: 0.67 # percentage of sites for training
nrand: 65 130 195 260 # number of random break points for each try (separated by white space)
project_dir: sintBlue # project directory
stairway_plot_dir: $HOME/bin/stairway_plot/stairway_plot_es # directory to the stairway plot files
ninput: 200 # number of input files to be created for each estimation
#random_seed: 6
#output setting
mu: 2e-8 # assumed mutation rate per site per generation
year_per_generation: 15 # assumed generation time (in years)
#plot setting
plot_title: sintBlue # title of the plot
xrange: 0.1,10000 # Time (1k year) range; format: xmin,xmax; "0,0" for default
yrange: 0,0 # Ne (1k individual) range; format: xmin,xmax; "0,0" for default
xspacing: 2 # X axis spacing
yspacing: 2 # Y axis spacing
fontsize: 12 # Font size" >sintBlue3x.blueprint


echo "#input setting
popid: sintTeal # id of the population (no white space)
nseq: 80 # number of sequences ->
L: 1602283 # total number of observed nucleic sites, including polymorphic and monomorphic
whether_folded: true # whethr the SFS is folded (true or false)
SFS: 16583.279005 7587.040220 4303.689077 3107.127953 2274.114237 1798.221276 1496.358048 1173.763214 1103.075623 822.648285 753.066621 730.392423 645.355034 602.195016 506.652940 428.804524 534.251728 450.322840 403.078661 380.013903 342.235477 286.348260 373.893637 243.034076 245.770095 249.644525 277.998715 224.384919 164.261869 253.896033 168.473848 240.951964 159.797919 171.288663 147.072982 148.058161 244.680742 217.024433 217.696620 289.787897
#smallest_size_of_SFS_bin_used_for_estimation: 1 # default is 1; to ignore singletons, uncomment this line and change this number to 2
#largest_size_of_SFS_bin_used_for_estimation: 40 # default is nseq/2 for folded SFS
pct_training: 0.67 # percentage of sites for training
nrand: 19 39 58 78 # number of random break points for each try (separated by white space)
project_dir: sintTeal # project directory
stairway_plot_dir: $HOME/bin/stairway_plot/stairway_plot_es # directory to the stairway plot files
ninput: 200 # number of input files to be created for each estimation
#random_seed: 12
#output setting
mu: 2e-8 # assumed mutation rate per site per generation
year_per_generation: 15 # assumed generation time (in years)
#plot setting
plot_title: sintTeal # title of the plot
xrange: 0.1,10000 # Time (1k year) range; format: xmin,xmax; "0,0" for default
yrange: 0,0 # Ne (1k individual) range; format: xmin,xmax; "0,0" for default
xspacing: 2 # X axis spacing
yspacing: 2 # Y axis spacing
fontsize: 12 # Font size" >sintTeal3x.blueprint


echo "#input setting
popid: sintGreen # id of the population (no white space)
nseq: 62 # number of sequences
L: 1602283 # total number of observed nucleic sites, including polymorphic and monomorphic
whether_folded: true # whethr the SFS is folded (true or false)
SFS: 16061.203222 6825.579753 3958.728333 2661.799656 1970.474821 1495.152098 1276.663012 1067.091402 978.810625 796.180817 728.985927 639.560007 622.381059 552.760773 472.054998 428.264832 427.700157 381.403228 396.408490 314.339368 265.966511 300.508523 255.298912 252.674516 282.144474 192.814169 240.587680 273.439733 316.896586 225.258840 456.920758
#smallest_size_of_SFS_bin_used_for_estimation: 1 # default is 1; to ignore singletons, uncomment this line and change this number to 2
#largest_size_of_SFS_bin_used_for_estimation: 31 # default is nseq/2 for folded SFS
pct_training: 0.67 # percentage of sites for training
nrand: 15 30 45 60 # number of random break points for each try (separated by white space)
project_dir: sintGreen # project directory
stairway_plot_dir: $HOME/bin/stairway_plot/stairway_plot_es # directory to the stairway plot files
ninput: 200 # number of input files to be created for each estimation
#random_seed: 71
#output setting
mu: 2e-8 # assumed mutation rate per site per generation
year_per_generation: 15 # assumed generation time (in years)
#plot setting
plot_title: sintGreen # title of the plot
xrange: 0.1,10000 # Time (1k year) range; format: xmin,xmax; "0,0" for default
yrange: 0,0 # Ne (1k individual) range; format: xmin,xmax; "0,0" for default
xspacing: 2 # X axis spacing
yspacing: 2 # Y axis spacing
fontsize: 12 # Font size" >sintGreen3x.blueprint


echo "#input setting
popid: sintYellow # id of the population (no white space)
nseq: 30 # number of sequences
L: 1602283 # total number of observed nucleic sites, including polymorphic and monomorphic 
whether_folded: true # whethr the SFS is folded (true or false)
SFS: 14098.279848 6216.748731 3541.171487 2370.205010 2002.251792 1516.454204 1284.282892 890.364677 919.754597 657.957085 598.193012 515.892221 576.527408 537.284192 673.801264
#smallest_size_of_SFS_bin_used_for_estimation: 1 # default is 1; to ignore singletons, uncomment this line and change this number to 2
#largest_size_of_SFS_bin_used_for_estimation: 15 # default is nseq/2 for folded SFS
pct_training: 0.67 # percentage of sites for training
nrand: 7 14 21 28 # number of random break points for each try (separated by white space)
project_dir: sintYellow # project directory
stairway_plot_dir: $HOME/bin/stairway_plot/stairway_plot_es # directory to the stairway plot files
ninput: 200 # number of input files to be created for each estimation
#random_seed: 23
#output setting
mu: 2e-8 # assumed mutation rate per site per generation
year_per_generation: 15 # assumed generation time (in years)
#plot setting
plot_title: sintYellow # title of the plot
xrange: 0.1,10000 # Time (1k year) range; format: xmin,xmax; "0,0" for default
yrange: 0,0 # Ne (1k individual) range; format: xmin,xmax; "0,0" for default
xspacing: 2 # X axis spacing
yspacing: 2 # Y axis spacing
fontsize: 12 # Font size" >sintYellow3x.blueprint

Create batch files and launch:

for blueprint in *blueprint; do
java -cp $HOME/bin/stairway_plot/stairway_plot_es Stairbuilder $blueprint;
done

# split each .sh script and launch all (second job waiting with dependency on first finishing)

for file in *.sh; do
grep -B 805 "# Step 2: determine number of break points" $file > ${file%.*}1;
grep -A 805 "# Step 2: determine number of break points" $file > ${file%.*}2;
echo '#!/bin/bash' | cat - ${file%.*}2 >temp;
mv temp ${file%.*}2;
launcher_creator.py -j ${file%.*}1 -n ${file%.*}1 -q shortq7 -t 06:00:00 -e $EMAIL -N 5;
DEP=$(sbatch ${file%.*}1.slurm) && sbatch --dependency=afterok:${DEP##* } -e ${file%.*}2.e%j -o ${file%.*}2.o%j -p shortq7 --mem=200GB --mail-user reckert2017@fau.edu --mail-type=ALL ${file%.*}2;
done

After all have run:

mv sint*/*final.summary .


H E T E R O Z Y G O S I T Y


Calculating Heterozygosity across all loci (variant//invariant) using ANGSD and R script from Misha Matz (https://github.com/z0on/2bRAD_denovo)

cd ~/2bRAD/fknms/sint/ANGSD3x
cp ../SFS3x/sfs3xSitesToDo .

angsd sites index sfs3xSitesToDo

export GENOME_FASTA=~/2bRAD/fknms/sint/mappedReads/sint_denovo_cc.fasta

FILTERS="-maxHetFreq 0.5 -uniqueOnly 1 -remove_bads 1 -skipTriallelic 1 -minMapQ 30 -minQ 35 -doHWE 1 -sb_pval 1e-5 -hetbias_pval 1e-5 -minInd 165 -setMinDepthInd 3"
TODO="-ref $GENOME_FASTA -makeMatrix 1 -doIBS 1 -doCov 1 -doGeno 32 -doPost 1 -doGlf 2 -doCounts 1 -doMajorMinor 1 -dosnpstat 1 -doMaf 1"

echo "angsd -sites sfs3xSitesToDo -b bamsNoClones -GL 1 -P 1 $TODO $FILTERS -out sintHet3x" >angsdHet3x

launcher_creator.py -j angsdHet3x -n angsdHet3x -q shortq7 -t 06:00:00 -e $EMAIL -N 1
sbatch angsdHet3x.slurm

mkdir ../heterozygosity3x
cp sintHet* ../heterozygosity3x/

cd ../heterozygosity3x

echo heterozygosity_beagle.R sintHet3x.beagle.gz >filtHet3x

launcher_creator.py -j filtHet3x -n filtHet3x -q shortq7 -t 6:00:00 -e $EMAIL -N 1
sbatch --mem=200GB filtHet3x.slurm

tail -n 220 filtHet3x.e* > sintHet3x

cat 

N U C L E O T I D E   D I V E R S I T Y

cd ~/2bRAD/fknms/sint/
mkdir theta3x

cd ANGSD3x

GENOME_FASTA=~/2bRAD/fknms/sint/mappedReads/sint_denovo_cc.fasta

>kThetas3x

for POP in yellow green teal blue; do
echo "angsd -b ${POP}Bams -GL 1 -P 1 -sites sfs3xSitesToDo -anc $GENOME_FASTA -doSaf 1 -out ${POP}3xOut &&\
realSFS ${POP}3xOut.saf.idx -fold 1 > ${POP}3xOut.sfs &&\
realSFS saf2theta ${POP}3xOut.saf.idx -sfs ${POP}3xOut.sfs -outname ${POP}3x -fold 1 &&\
thetaStat do_stat ${POP}3x.thetas.idx" >> kThetas3x
done

launcher_creator.py -j kThetas3x -n kThetas3x -q shortq7 -t 6:00:00 -e $EMAIL -w 4 -N 1
sbatch kThetas3x.slurm

mv *theta* ../theta3x


M I G R A T I O N   M O D E L I N G

cd ~/2bRAD/sint/fknms/ANGSD3x

FILTERS="-uniqueOnly 1 -remove_bads 1 -minMapQ 20 -minQ 30 -dosnpstat 1 -doHWE 1 -hwe_pval 1e-5 -sb_pval 1e-5 -hetbias_pval 1e-5 -skipTriallelic 1 -snp_pval 1e-5 -minMaf 0.05 -setMinDepthInd 3"

TODO="-doMajorMinor 1 -doMaf 1 -doCounts 1 -makeMatrix 1 -doIBS 1 -doCov 1 -doGeno 8 -doBcf 1 -doPost 1 -doGlf 2"

echo "angsd -b blueBams -GL 1 -P 1 $FILTERS $TODO -minInd 98 -sites sitesToDo -out sintBlueSnps3x" > blueSnps3x

launcher_creator.py -j blueSnps3x -n blueSnps3x -q shortq7 -t 06:00:00 -e $EMAIL -w 2 -N 1

sbatch blueSnps3x.slurm

To use BayesAss3 (XXX) we first need to convert our ANGSD output into genotype format

cd ~/2bRAD/fknms/sint
mkdir BA33x

cd BA33x
cp ../ANGSD3x/blueBams . 
cp ../ANGSD3x/sintBlueSnps3x.bcf .

#need file bspops that lists bams and the sampling site, tab delimited.
cp ../bayescan/fkSintBsPops.txt bspops.txt

awk -F ' ' 'NR==FNR{a[$1]; next} ($1 in a) || ($2 in a)' blueBams bspops.txt > fkSintBlueBA3Pops.txt

cat fkSintBlueBA3Pops.txt | wc -l
# 131 — Good!

bcftools view sintBlueSnps3x.bcf > sintBlueSnps3x.vcf

echo "# VCF Parser questions
PARSER_FORMAT=VCF

# Only output SNPs with a phred-scaled quality of at least:
VCF_PARSER_QUAL_QUESTION=
# Select population definition file:
VCF_PARSER_POP_FILE_QUESTION=./fkSintBlueBA3Pops.txt
# What is the ploidy of the data?
VCF_PARSER_PLOIDY_QUESTION=DIPLOID
# Do you want to include a file with population definitions?
VCF_PARSER_POP_QUESTION=true
# Output genotypes as missing if the phred-scale genotype quality is below:
VCF_PARSER_GTQUAL_QUESTION=
# Do you want to include non-polymorphic SNPs?
VCF_PARSER_MONOMORPHIC_QUESTION=false
# Only output following individuals (ind1, ind2, ind4, ...):
VCF_PARSER_IND_QUESTION=
# Only input following regions (refSeqName:start:end, multiple regions: whitespace separated):
VCF_PARSER_REGION_QUESTION=
# Output genotypes as missing if the read depth of a position for the sample is below:
VCF_PARSER_READ_QUESTION=
# Take most likely genotype if "PL" or "GL" is given in the genotype field?
VCF_PARSER_PL_QUESTION=true
# Do you want to exclude loci with only missing data?
VCF_PARSER_EXC_MISSING_LOCI_QUESTION=true

# Immanc (BayesAss) Writer questions
WRITER_FORMAT=IMMANC

# Specify the locus/locus combination you want to write to the Immanc (BayesAss) file:
IMMANC_WRITER_LOCUS_COMBINATION_QUESTION=
# Specify which data type should be included in the Immanc (BayesAss)) file  (Immanc (BayesAss) can only analyze one data type per file):
IMMANC_WRITER_DATA_TYPE_QUESTION=SNP" >fkSintBA.spid


module load pgdspider-2.1.1.2-gcc-9.2.0-ghxvd4c 

pgdSpider=/opt/ohpc/pub/spack/opt/spack/linux-centos7-x86_64/gcc-9.2.0/pgdspider-2.1.1.2-ghxvd4c4ieqngkbutakc7x6j4pfkqm5e/bin/PGDSpider2-cli.jar

echo '#!/bin/bash' > pgdSpider.sh
echo "java -Xmx1024m -Xms512m -jar $pgdSpider -inputformat VCF -outputformat IMMANC vcf -inputfile sintBlueSnps3x.vcf -outputfile fkSintBayesAss3x.txt -spid fkSintBA.spid" >>pgdSpider.sh

sbatch -e pgdSpider.e%j -o pgdSpider.o%j -p mediumq7 --mail-user reckert2017@fau.edu --mail-type=ALL pgdSpider.sh
## Default params:
#MCMC reps: 1,000,000
#burn in: 100,000
#sampling freq: 100
#delta migration (1): 0.1
#delta allele freq (3): 0.1
#delta inbreeding (4): 0.1

rm sintBlueSnps3x.bcf
rm bspops.txt

module load gcc-9.2.0-gcc-8.3.0-ebpgkrt gsl-2.5-gcc-9.2.0-i6lf4jb netlib-lapack-3.9.1-gcc-9.2.0-gcqg2b2 BayesAss/3.0.4.2

# Run a test with verbose [-v] output to see the acceptance rates in the terminal (takes a few minutes to compute)
# check the output file [less BATest.o*] and kill the job once you get output with acceptance rates [scancel {yourJobID}]
# After ~5 min you should start seeing output in the BATest.o* file

echo '#!/bin/bash' > BATest
echo BA3SNP -v -i 30000000 -b 10000000 -n 1000 fkSintBayesAss3x.txt >> BATest

sbatch -e BATest.e%j -o BATest.o%j -p shortq7 --mail-user reckert2017@fau.edu --mail-type=ALL BATest

#logL: -545609.14 % done: [0.00] % accepted: (0.49, 0.00, 0.91, 0.13, 0.79)

# we are looking for 20—60% acceptance, ideally somewhere nearer 20—30%
# relationships between mixing parameters and acceptance rates are inverse
# defaults are 0.1 (all parameters are scale 0—1)
# increase [-m] increase [-a] and decrease [-f]

echo '#!/bin/bash' > BATest
echo BA3SNP -v -i 30000000 -b 10000000 -n 1000  -m 0.2 -a 0.6 -f 0.04 fkSintBayesAss3x.txt >> BATest

sbatch -e BATest.e%j -o BATest.o%j -p shortq7 --mail-user reckert2017@fau.edu --mail-type=ALL BATest
#logP(M): -324.29 logL(G): -546136.13 logL: -546460.42 % done: [0.00] % accepted: (0.30, 0.00, 0.34, 0.33, 0.80)

echo '#!/bin/bash' > BATest
echo BA3SNP -v -i 30000000 -b 10000000 -n 1000  -m 0.25 -a 0.7 -f 0.07 fkSintBayesAss3x.txt >> BATest

sbatch -e BATest.e%j -o BATest.o%j -p shortq7 --mail-user reckert2017@fau.edu --mail-type=ALL BATest
#logP(M): -313.10 logL(G): -545977.13 logL: -546290.23 % done: [0.00] % accepted: (0.24, 0.00, 0.31, 0.19, 0.79)

echo '#!/bin/bash' > BATest
echo BA3SNP -v -i 30000000 -b 10000000 -n 1000  -m 0.25 -a 0.85 -f 0.05 fkSintBayesAss3x.txt >> BATest

sbatch -e BATest.e%j -o BATest.o%j -p shortq7 --mail-user reckert2017@fau.edu --mail-type=ALL BATest
# logP(M): -310.70 logL(G): -545736.54 logL: -546047.24 % done: [0.00] % accepted: (0.24, 0.00, 0.25, 0.27, 0.80)

Make and launch 10 iterations of BayesAss, each in its own run directory so we can keep all trace files (saved as ‘BA3trace.txt’ and would overwrite if not in separate directories). We are using [-s $RANDOM] to use a random start seed for each independent run

module load gcc-9.2.0-gcc-8.3.0-ebpgkrt gsl-2.5-gcc-9.2.0-i6lf4jb netlib-lapack-3.9.1-gcc-9.2.0-gcqg2b2 BayesAss/3.0.4.2

for i in {01..10}; do
echo '#!/bin/bash' > BayesAss$i.sh
echo BA3SNP -v -u -s $RANDOM -i 30000000 -b 10000000 -n 1000  -m 0.25 -a 0.85 -f 0.05 -t -o fkSintBARun${i}Out.txt ../fkSintBayesAss3x.txt >> BayesAss$i.sh;
mkdir run$i;
mv BayesAss$i.sh run$i;
cd run$i;
sbatch -e BayesAss$i.e%j -o BayesAss$i.o%j -p longq7 --mem=200GB --exclusive --mail-user reckert2017@fau.edu --mail-type=ALL BayesAss$i.sh
cd ..;
done

# after all runs complete copy files to main BayesAss directory
cd ~/2bRAD/fknms/sint/BA33x

#cp run*/*Out.txt .

for i in {01..10}; do 
cp run$i/BA3trace.txt BA3trace.$i.txt;
done

O U T L I E R   D E T E C T I O N


cd ~/2bRAD/fknms/sint

cd bayescan3x

cp ../ANGSD3x/sintFiltSnps.bcf .

bcftools view sintFiltSnps.bcf > sintFiltSnps.vcf

rm sintFiltSnps.bcf

You will have to create or scp a textfile with your .bam file names and populations

Creating a popfile that sorts bams into lineage_depth


awk -F ' ' 'NR==FNR{a[$1]=$0; next} ($1 in a) {print a[$1]" "$2}' bspops.txt fkSintBsPops.txt > bsPopsNoAdmix.txt
awk -F ' ' '$2 != "Admixed" {print $1, $2, $3}' bsPopsNoAdmix.txt > tmp && mv tmp bsPopsNoAdmix.txt
awk '{sub(/.*_/,"",$3); print}' bsPopsNoAdmix.txt > tmp && mv tmp bsPopsNoAdmix.txt
awk -F ' ' '{print $1, $2"_"$3}' bsPopsNoAdmix.txt > tmp && mv tmp bsPopsNoAdmix.txt

cat bsPopsNoAdmix.txt
cat bsPopsNoAdmix.txt | wc -l

# looks good!

Remove admixed indivivuals from vcf

awk -F ' ' '{print $1}' bsPopsNoAdmix.txt > noAdmixedBams

cat noAdmixedBams

bcftools view -S noAdmixedBams sintFiltSnps.vcf > sintFiltSnpsNoAdmixed.vcf

Convert vcf with PGDSpider

echo "############
# VCF Parser questions
PARSER_FORMAT=VCF
# Do you want to include a file with population definitions?
VCF_PARSER_POP_QUESTION=true
# Only input following regions (refSeqName:start:end, multiple regions: whitespace separated):
VCF_PARSER_REGION_QUESTION=
# What is the ploidy of the data?
VCF_PARSER_PLOIDY_QUESTION=DIPLOID
# Only output following individuals (ind1, ind2, ind4, ...):
VCF_PARSER_IND_QUESTION=
# Output genotypes as missing if the read depth of a position for the sample is below:
VCF_PARSER_READ_QUESTION=
# Take most likely genotype if "PL" or "GL" is given in the genotype field?
VCF_PARSER_PL_QUESTION=true
# Do you want to exclude loci with only missing data?
VCF_PARSER_EXC_MISSING_LOCI_QUESTION=false
# Select population definition file:
VCF_PARSER_POP_FILE_QUESTION=./bsPopsNoAdmix.txt
# Only output SNPs with a phred-scaled quality of at least:
VCF_PARSER_QUAL_QUESTION=
# Do you want to include non-polymorphic SNPs?
VCF_PARSER_MONOMORPHIC_QUESTION=false
# Output genotypes as missing if the phred-scale genotype quality is below:
VCF_PARSER_GTQUAL_QUESTION=
# GESTE / BayeScan Writer questions
WRITER_FORMAT=GESTE_BAYE_SCAN
# Specify which data type should be included in the GESTE / BayeScan file  (GESTE / BayeScan can only analyze one data type per file):
GESTE_BAYE_SCAN_WRITER_DATA_TYPE_QUESTION=SNP
############" >vcf2bayescan3x.spid

java -Xmx1024m -Xms512m -jar ~/bin/PGDSpider_2.0.7.1/PGDSpider2-cli.jar -inputfile sintFiltSnpsNoAdmixed.vcf -outputfile fkSint3x.bayescan -spid vcf2bayescan3x.spid

Launch BayeScan (Takes 12+ hr depending on read count/SNPs)

echo '#!/bin/bash' > sintBayescan.sh
echo bayescan fkSint3x.bayescan -threads=100 >> sintBayescan.sh

sbatch -e sintBayescan.e%j -o sintBayescan.o%j -p mediumq7 --mail-user reckert2017@fau.edu --mail-type=ALL sintBayescan.sh

Extract outlier SNPs from BayeScan output

srun removeBayescanOutliers.pl bayescan=fkSint3x.baye_fst.txt vcf=sintFiltSnpsNoAdmixed.vcf FDR=0.05 mode=extract > fkSint3xVcfOutliers.vcf

Using BayeScEnv we can look for outliers realted to depth

# Get the average depths for each population (lineage/depth combo)
##1 Blue_Mesophotic       36.2
##2 Teal_Mesophotic       35.2
##3 Teal_Shallow          21.0 
##4 Blue_Shallow          23.7 
##5 Yellow_Shallow        20.2
##6 Green_Shallow         21.9
##7 Green_Mesophotic      31.4

echo "36.2 35.2 21.0 23.7 20.2 21.9 31.4" > depth.txt

cp fkSint3x.bayescan fkSint3x.bayeScEnv

echo '#!/bin/bash' > sintBayeScEnv.sh
echo bayescenv fkSint3x.bayeScEnv -env depth.txt >> sintBayeScEnv.sh

sbatch -e sintBayeScEnv.e%j -o sintBayeScEnv.o%j -p longq7 --mail-user reckert2017@fau.edu --mail-type=ALL sintBayeScEnv.sh

Extract outlier SNPs from BayeScEnv output

srun removeBayescanOutliers.pl bayescan=fkSint.bayeS_fst.txt vcf=sintFiltSnpsNoAdmixed.vcf FDR=0.05 mode=extract > fkSintBScEnvVcfOutliers.vcf


S Y M B I O N T S


cd ~/2bRAD/fknms/sint/
mv mappedReads/symbionts symbionts3x
cd symbionts3x

SYMGENOME=~/bin/symGenomes/symbConcatGenome

export GENOME_FASTA=~/2bRAD/fknms/sint/mappedReads/sint_denovo_cc.fasta

2bRAD_bowtie2_launcher.py -f zoox -g $GENOME_FASTA --split -a coral -u zoox --aldir coral --launcher -e reckert2017@fau.edu

mkdir coral

sbatch --mem=200GB maps.slurm

mkdir unused
mv *.sam unused/
mv *.trim.zoox unused/

2bRAD_bowtie2_launcher.py -f zoox -g $SYMGENOME --launcher -e reckert2017@fau.edu
sbatch --mem=200GB maps.slurm

>s2b
for file in *.sam; do
echo "samtools sort -O bam -o ${file/.sam/}.bam $file && samtools index ${file/.sam/}.bam">>s2b;
done

launcher_creator.py -j s2b -n s2b -t 6:00:00 -N 5 -e reckert2017@fau.edu -q shortq7
sbatch s2b.slurm

Count reads mapping to each chromosome for concatenated Symbiodiniaceae genome

>zooxReads

for i in *.bam; do
echo $i >>zooxReads;
samtools idxstats $i | cut -f 1,3 >>zooxReads;
done