This walks through the steps to process 2bRAD reads without an existing genome.
Be sure to read through what you are doing and follow instructions before copy/pasting code chunks.
Copy the code chunks into your terminal, taking care to change the necessary portions to fit your data/ your directory structure, etc.
First, you will need to replace reckert2017
with your
user name and reckert2017@fau.edu
with your email
throughout this code.
Download all necessary scripts and load modules for processing/analysis
ssh reckert2017@koko-login.hpc.fau.edu
or you can add to ~/.bashrc to load at login (use
nano .bashrc
)
module load angsd-0.933-gcc-9.2.0-65d64pp
module load bayescan-2.1-gcc-8.3.0-7gakqmd
module load qt-5.15.2-gcc-9.2.0-zi7wcem BayeScEnv/1.1
module load bcftools-1.9-gcc-8.3.0-il4d373
module load bowtie2-2.3.5.1-gcc-8.3.0-63cvhw5
module load cdhit-4.8.1-gcc-8.3.0-bcay75d
module load htslib-1.9-gcc-8.3.0-jn7ehrc
module load kraken2-2.1.1-gcc-9.2.0-ocivj3u
module load python-3.7.4-gcc-8.3.0-3tniqr5
module load launcher
module load miniconda3-4.6.14-gcc-8.3.0-eenl5dj
module load ncbi-toolkit-22_0_0-gcc-9.2.0-jjhd2wa
module load ngsadmix-32-gcc-8.3.0-qbnwmpq
module load ngsRelate/v2
module load R/3.6.1
module load samtools-1.10-gcc-8.3.0-khgksad
module load vcftools-0.1.14-gcc-8.3.0-safy5vc
Put scripts needed into ~/bin or similar directory that is mapped in
.bashrc path IF you don’t have a bin
directory in your
$HOME
you can make one now (mkdir ~/bin
)
cd ~/bin
svn checkout https://github.com/RyanEckert/Stephanocoenia_FKNMS_PopGen/trunk/scripts/
mv scripts/* .
rm scripts
wget http://www.cmpg.unibe.ch/software/PGDSpider/PGDSpider_2.0.7.1.zip
unzip PGDSpider_2.0.7.1.zip
rm PGDSpider_2.0.7.1.zip
git clone https://github.com/Rosemeis/pcangsd.git
cd pcangsd
conda activate 2bRAD
pip install --user -r requirements.txt
python setup.py build_ext --inplace
pip3 install -e .
cd ~/bin
git clone https://bitbucket.org/simongravel/moments.git
cd moments
conda activate 2bRAD
conda install -c bioconda moments
pip install --user -r requirements.txt
python setup.py build_ext --inplace
pip3 install -e .
svn checkout https://github.com/xiaoming-liu/stairway-plot-v2.git
mv stairway-plot-v2.git/trunk/stairway_plot_v2.1.1.zip .
unzip stairway_plot_v2.1.1.zip
rm -r stairway_plot_v2.1.1.zip stairway-plot-v2.git
Make all scripts executable
chmod +x *.sh *.pl *.py
IF not already, it is useful to add ~/bin
to your
$PATH
This way you can easily access your executable
scripts without specifying the absolute path to them.
Otherwise
skip to “Build working directory”
PATH="$HOME/bin:$PATH";
To permanently add this to your $PATH
add to
.bashrc
use nano
text editor
nano ~/.bashrc
ADD the following text to file under PATHS section if in your
.bashrc: export PATH="$HOME/bin:$PATH";
exit nano with
ctrl+x
source ~/.bashrc
echo $PATH
cd
mkdir 2bRAD/sint/
mkdir 2bRAD/sint/fknms/
mkdir 2bRAD/sint/fknms/rawReads/
cd 2bRAD/sint/fknms/rawReads/
If you have not previously, download BaseSpaceCLI
wget "https://launch.basespace.illumina.com/CLI/latest/amd64-linux/bs" -O $HOME/bin/bs
chmod +x ~/bin/bs
Go to the website and confirm authorization by logging in to your basespace acct.
bs auth
Making a script to download the reads and merge samples across 2 NovaSeq lanes
echo '#!/bin/bash' > downloadReads.sh
echo 'bs download project --concurrency=high -q -n JA21001 -o .' >> downloadReads.sh
# -n is the project name and -o is the output directory
echo "find . -name '*.gz' -exec mv {} . \;" >> downloadReads.sh
echo 'rmdir SA*' >>downloadReads.sh
echo 'mkdir ../concatReads' >> downloadReads.sh
echo 'cp *.gz ../concatReads' >> downloadReads.sh
echo 'cd ../concatReads' >> downloadReads.sh
echo 'mergeReads.sh -o mergeTemp' >> downloadReads.sh
# -o is the directory to put output files in
echo 'rm *L00*' >> downloadReads.sh
echo "find . -name '*.gz' -exec mv {} . \;" >> downloadReads.sh
echo 'gunzip *.gz' >> downloadReads.sh
echo 'rmdir mergeTemp' >> downloadReads.sh
chmod +x downloadReads.sh
launcher_creator.py -b 'srun downloadReads.sh' -n downloadReads -q shortq7 -t 06:00:00 -e reckert2017@fau.edu
sbatch downloadReads.slurm
How many reads before filtering?
echo '#!/bin/bash' >rawReads
echo readCounts.sh -e .fastq -o sintRaw >>rawReads
sbatch -o rawReads.o%j -e rawReads.e%j rawReads --mail-type=ALL --mail-user=reckert2017@fau.edu
cd ../concatReads
2bRAD_trim_launch_dedup.pl fastq > trims.sh
launcher_creator.py -j trims.sh -n trims -q shortq7 -t 06:00:00 -e reckert2017@fau.edu
sbatch --mem=200GB trims.slurm
Check that we have the correct number of trim files (226 in this case)
ls -l *.tr0 | wc -l
mkdir ../trimmedReads
srun mv *.tr0 ../trimmedReads &
zipper.py -f fastq -a -9 --launcher -e reckert2017@fau.edu
sbatch --mem=200GB zip.slurm
cd ../trimmedReads
Rename sequence files using sampleRename.py This script needs a .csv with XXX Make sure you use the reverse complement of your inline BCs!
srun sampleRename.py -i sampleList -n fk_ -f tr0
I can’t get KoKo’s module for cutadapt to work, so we’ll do it in miniconda. Run below if you don’t have a conda env. set up, otherwise you can skip to the next chunk
module load miniconda3-4.6.14-gcc-8.3.0-eenl5dj
conda config --add channels defaults
conda config --add channels bioconda
conda create -n 2bRAD cutadapt
Removing reads with qualities at ends less than Q15 for de novo analysis
source activate 2bRAD
echo '#!/bin/bash' > trimse.sh
echo 'module load miniconda3-4.6.14-gcc-8.3.0-eenl5dj' >> trimse.sh
echo 'source activate 2bRAD' >> trimse.sh
for file in *.tr0; do
echo "cutadapt -q 15,15 -m 36 -o ${file/.tr0/}.trim $file > ${file/.tr0/}.trimlog.txt" >> trimse.sh;
done
I can’t get it to run through launcher so just run it serially, it takes a while to run, so consider breaking up in several jobs.
sbatch -o trimse.o%j -e trimse.e%j --mem=200GB trimse.sh
Do we have expected number of *.trim files created?
conda deactivate
ls -l *.trim | wc -l
How many reads in each sample?
echo '#!/bin/bash' >sintReads
echo readCounts.sh -e trim -o sintFilt >>sintReads
sbatch --mem=200GB sintReads
mkdir ../filteredReads
mv *.trim ../filteredReads
zipper.py -f tr0 -a -9 --launcher -e reckert2017@fau.edu
sbatch zip.slurm
cat sintFiltReadCounts
Construct denovo reference for aligning reads
If you’ve not already, build a bt reference for the concatenated zoox genomes, otherwise skip ahead to the next chunk
mkdir ~/bin/symGenomes
cd ~/bin/symGenomes
echo "bowtie2-build symbConcatGenome.fasta symbConcatGenome" > bowtie2-build
launcher_creator.py -j bowtie2-build -n bowtie2-build -q shortq7 -t 06:00:00 -e reckert2017@fau.edu
module load bowtie2-2.3.5.1-gcc-8.3.0-63cvhw5
sbatch --mem=200GB bowtie2-build.slurm
module load samtools-1.10-gcc-8.3.0-khgksad
srun samtools faidx symbConcatGenome.fasta
Mapping reads to concatenated Symbiodinaceae genome
cd ~/2bRAD/past/sefl/filteredReads
mkdir symbionts
SYMGENOME=~/bin/symGenomes/symbConcatGenome
2bRAD_bowtie2_launcher.py -g $SYMGENOME -f .trim -n zooxMaps --split -u un -a zoox --aldir symbionts --launcher -e reckert2017@fau.edu
sbatch zooxMaps.slurm
Checking Symbiodiniaceae read mapping rates
>zooxAlignmentRates
for F in `ls *trim`; do
M=`grep -E '^[ATGCN]+$' $F | wc -l | grep -f - zooxMaps.e* -A 4 | tail -1 | perl -pe 's/zooxMaps\.e\d+-|% overall alignment rate//g'` ;
echo "$F.sam $M">>zooxAlignmentRates;
done
# ls *trim | cut -d '.' -f 1 >align1
# grep "% overall" zooxMaps.e* | cut -d ' ' -f 1 >align2
# paste <(awk -F' ' '{print $1}' align1) <(awk -F' ' '{print $1}' align2) >zooxAlignmentRates
# rm align1 align2
less zooxAlignmentRates
Clean up directory
zipper.py -f .trim -a -9 --launcher -e reckert2017@fau.edu
sbatch --mem=200GB zip.slurm
mv *.sam symbionts/
cd ~/2bRAD/sint/fknms/
mv filteredReads/symbionts .
‘stacking’ individual trimmed fastq reads:
cd filteredReads
ls *.trim.un | perl -pe 's/^(.+)$/uniquerOne.pl $1 >$1\.uni/' > unique
launcher_creator.py -j unique -n unique -q shortq7 -t 06:00:00 -e reckert2017@fau.edu
sbatch --mem=200GB unique.slurm
Checking there is a .uni for all samples
ls -l *.uni | wc -l
We can remove contamination sequences from our denovo reference with
kraken
We need a Kraken database to search against. If you don’t already have one, you need to build one now. otherwise you can skip to running Kraken
We can download a compiled standard database:
mkdir ~/bin/krakenDB
cd ~/bin/krakenDB
srun wget https://genome-idx.s3.amazonaws.com/kraken/k2_pluspf_20210127.tar.gz
echo tar -xvf k2_pluspf_20210127.tar.gz >tar
launcher_creator.py -j tar -n tar -q mediumq7 -t 24:00:00 -e reckert2017@fau.edu
sbatch tar.slurm
echo '#!/bin/bash' >krakendb.sh
echo kraken2-build --download-taxonomy --db ~/bin/krakenDB >>krakendb.sh
echo kraken2-build --download-library archaea --threads 16 --db ~/bin/krakenDB >>krakendb.sh
echo kraken2-build --download-library bacteria --threads 16 --db ~/bin/krakenDB >>krakendb.sh
echo kraken2-build --download-library viral --threads 16 --db ~/bin/krakenDB >>krakendb.sh
echo kraken2-build --download-library human --threads 16 --db ~/bin/krakenDB >>krakendb.sh
echo kraken2-build --download-library fungi --threads 16 --db ~/bin/krakenDB >>krakendb.sh
echo kraken2-build --download-library protozoa --threads 16 --db ~/bin/krakenDB >>krakendb.sh
echo kraken2-build --download-library UniVec_Core --threads 16 --db ~/bin/krakenDB >>krakendb.sh
sbatch --mem=200GB -p longq7 -e krakenDB.e%j -o krakenDB.o%j krakendb.sh
cd ~/bin/symGenomes
# Symbiodinium microadriaticum
sed '/>/ s/$/|kraken:taxid|2951/' Symbiodinium_microadriacticum_genome.scaffold.fasta >S_microadriacticum.fa
# Breviolum minutum
sed '/>/ s/$/|kraken:taxid|2499525/' Breviolum_minutum.v1.0.genome.fa >B_minutum.fa
# Cladocopium goreaui
sed '/>/ s/$/|kraken:taxid|2562237/' Cladocopium_goreaui_Genome.Scaffolds.fasta >C_goreaui.fa
# Durusdinium trenchii
sed '/>/ s/$/|kraken:taxid|1381693/' 102_symbd_genome_scaffold.fa >D_trenchii.fa
echo '#!/bin/bash' >kdbAdd
echo kraken2-build --add-to-library ~/bin/symGenomes/S_microadriacticum.fa --db ~/bin/krakenDB >>kdbAdd
echo kraken2-build --add-to-library ~/bin/symGenomes/B_minutum.fa --db ~/bin/krakenDB >>kdbAdd
echo kraken2-build --add-to-library ~/bin/symGenomes/C_goreaui.fa --db ~/bin/krakenDB >>kdbAdd
echo kraken2-build --add-to-library ~/bin/symGenomes/D_trenchii.fa --db ~/bin/krakenDB >>kdbAdd
sbatch --mem=200GB -o kdbAdd.o%j -e kdbAdd.e%j kdbAdd
echo '#!/bin/bash' >kdbBuild
echo kraken2-build --download-taxonomy --threads 16 --db /mnt/beegfs/home/reckert2017/bin/krakenDB >>kdbBuild
echo kraken2-build --build --db ~/bin/krakenDB >>kdbBuild
sbatch --mem=200GB -o kdbBuild.o%j -e kdbBuild.e%j kdbBuild
Remove potential contamination from reference
cd ~/2bRAD/sint/fknms/filteredReads
echo '#!/bin/bash' >krakenDB
echo kraken2 --db ~/bin/krakenDB cdh_alltags.fas --threads 16 --classified-out cdh_alltags.contam.fa --unclassified-out cdh_alltags.unclass.fa --report krakenDB.report --output krakenDB.out >>krakenDB
sbatch --mem=200GB -o krakenDB.o%j -e krakenDB.e%j krakenDB
Additionally, I use a kraken blastNT database to remove additional sequences not mapping to Cnidarian taxa.
echo '#!/bin/bash' >krakenNT
echo kraken2 --db ~/bin/krakenNT/nt_db cdh_alltags.unclass.fa --threads 16 --classified-out cdh_alltags.class.fa --unclassified-out sint_denovo.fa --report krakenNT.report --output krakenNT.out >>krakenNT
sbatch --mem=200GB -o krakenNT.o%j -e krakenNT.e%j krakenNT
extract_kraken_reads.py -s cdh_alltags.class.fa -k krakenNT.out -r krakenNT.report -t 6073 --include-children -o sint_denovo.fa --append
With 30 pseudo chromosomes from clean major allele tags
mkdir ../mappedReads
mv sint_denovo.fa ../mappedReads
cd ../mappedReads
concatFasta.pl fasta=sint_denovo.fa num=30
Format pseudo genome
GENOME_FASTA=sint_denovo_cc.fasta
echo '#!/bin/bash' >genomeBuild.sh
echo bowtie2-build $GENOME_FASTA $GENOME_FASTA >>genomeBuild.sh
echo samtools faidx $GENOME_FASTA >>genomeBuild.sh
sbatch -o genomeBuild.o%j -e genomeBuild.e%j --mem=200GB genomeBuild.sh
Mapping reads to reference and formatting bam files
Map reads to fake genome:
mv ../filteredReads/*.un .
mv ../filteredReads/symbionts .
GENOME_FASTA=sint_denovo_cc.fasta
# mapping with --local option, enables clipping of mismatching ends (guards against deletions near ends of RAD tags)
2bRAD_bowtie2_launcher.py -f un -g $GENOME_FASTA --launcher -e reckert2017@fau.edu
sbatch --mem=200GB maps.slurm
Do we have the right number of SAM files?
ls *.sam | wc -l
Checking alignment rates
ls *un | cut -d '.' -f 1 >align1
grep "% overall" maps.e* | cut -d ' ' -f 1 >align2
>alignmentRates
paste <(awk -F' ' '{print $1}' align1) <(awk -F' ' '{print $1}' align2) >alignmentRates
rm align1 align2
less alignmentRates
BAM files will be used for genotyping, population structure, etc.
>s2b
for file in *.sam; do
echo "samtools sort -O bam -o ${file/.sam/}.bam $file && samtools index ${file/.sam/}.bam">>s2b;
done
launcher_creator.py -j s2b -n s2b -q shortq7 -t 06:00:00 -e reckert2017@fau.edu
sbatch --mem=200GB s2b.slurm
Do we have enough BAM files?
ls *bam | wc -l # should be the same number as number of trim files
Clean up directory
zipper.py -a -9 -f sam --launcher -e reckert2017@fau.edu
sbatch zip.slurm
rm *.un
“FUZZY genotyping” with ANGSD - without calling actual genotypes but working with genotype likelihoods at each SNP. Optimal for low-coverage data (<10x).
mkdir ../ANGSD
cd ../ANGSD
mv ../mappedReads/*.bam* .
ls *bam >bamsClones
ANGSD
settings: -minMapQ 20
: only highly
unique mappings (prob of erroneous mapping =< 1%)
-baq 1
: realign around indels (not terribly relevant for
2bRAD reads mapped with –local option) -maxDepth
: highest
total depth (sum over all samples) to assess; set to 10x number of
samples -minInd
: the minimal number of individuals the site
must be genotyped in. Reset to 50% of total N at this stage.
export FILTERS="-uniqueOnly 1 -remove_bads 1 -minMapQ 20 -maxDepth 2260 -minInd 113"
export TODO="-doQsDist 1 -doDepth 1 -doCounts 1 -dumpCounts 2"
echo '#!/bin/bash' >sintDD.sh
echo angsd -b bamsClones -GL 1 $FILTERS $TODO -P 1 -out dd >>sintDD.sh
sbatch --mem=200GB -o sintDD.o%j -e sintDD.e%j --mail-user=reckert2017@fau.edu --mail-type=ALL sintDD.sh
Summarizing results (using Misha Matz modified script by Matteo Fumagalli)
echo '#!/bin/bash' >RQC.sh
echo Rscript ~/bin/plotQC.R prefix=dd >>RQC.sh
echo gzip -9 dd.counts >>RQC.sh
sbatch -e RQC.e%j -o RQC.o%j --dependency=afterok:460550 --mem=200GB RQC.sh
Proportion of sites covered at >5X:
cat quality.txt
scp
dd.pdf to laptop to look at distribution of base
quality scores, fraction of sites in each sample passing coverage
thresholds and fraction of sites passing genotyping rates cutoffs. Use
these to guide choices of -minQ
, -minIndDepth
and -minInd
filters in subsequent ANGSD
runs
FILTERS="-uniqueOnly 1 -remove_bads 1 -minMapQ 20 -minQ 30 -dosnpstat 1 -doHWE 1 -hwe_pval 1e-5 -sb_pval 1e-5 -hetbias_pval 1e-5 -skipTriallelic 1 -minInd 192 -snp_pval 1e-6 -minMaf 0.05"
TODO="-doMajorMinor 1 -doMaf 1 -doCounts 1 -makeMatrix 1 -doIBS 1 -doCov 1 -doGeno 8 -doBcf 1 -doPost 1 -doGlf 2"
echo '#!/bin/bash' > sintClones.sh
echo angsd -b bamsClones -GL 1 $FILTERS $TODO -P 1 -out sintClones >>sintClones.sh
sbatch --mem=200GB -o sintClones.o%j -e sintClones.e%j -p shortq7 --mail-type=ALL --mail-user=reckert2017@fau.edu sintClones.sh
Use ibs matrix to identify clones with hierachial clustering in
R
. scp
to local machine and run chunk below in
R
if (!require("pacman")) install.packages("pacman")
pacman::p_load("dendextend", "ggdendro", "tidyverse")
cloneBams = read.csv("../data/stephanocoeniaMetaData.csv") # list of bam files
cloneMa = as.matrix(read.table("../data/snps/clones/sintClones.ibsMat")) # reads in IBS matrix produced by ANGSD
dimnames(cloneMa) = list(cloneBams[,1],cloneBams[,1])
clonesHc = hclust(as.dist(cloneMa),"ave")
clonePops = cloneBams$region
cloneDepth = cloneBams$depthZone
cloneDend = cloneMa %>% as.dist() %>% hclust(.,"ave") %>% as.dendrogram()
cloneDData = cloneDend %>% dendro_data()
# Making the branches hang shorter so we can easily see clonal groups
cloneDData$segments$yend2 = cloneDData$segments$yend
for(i in 1:nrow(cloneDData$segments)) {
if (cloneDData$segments$yend2[i] == 0) {
cloneDData$segments$yend2[i] = (cloneDData$segments$y[i] - 0.01)}}
cloneDendPoints = cloneDData$labels
cloneDendPoints$pop = clonePops[order.dendrogram(cloneDend)]
cloneDendPoints$depth=cloneDepth[order.dendrogram(cloneDend)]
rownames(cloneDendPoints) = cloneDendPoints$label
# Making points at the leaves to place symbols for populations
point = as.vector(NA)
for(i in 1:nrow(cloneDData$segments)) {
if (cloneDData$segments$yend[i] == 0) {
point[i] = cloneDData$segments$y[i] - 0.01
} else {
point[i] = NA}}
cloneDendPoints$y = point[!is.na(point)]
techReps = c("S066.1", "S066.2", "S066.3", "S162.1", "S162.2", "S162.3", "S205.1", "S205.2", "S205.3")
cloneDendPoints$depth = factor(cloneDendPoints$depth,levels(cloneDendPoints$depth)[c(2,1)])
cloneDendPoints$pop = factor(cloneDendPoints$pop,levels(cloneDendPoints$pop)[c(4, 1, 3, 2)])
flPal = paletteer_d("vapoRwave::jazzCup")[c(2:5)]
cloneDendA = ggplot() +
geom_segment(data = segment(cloneDData), aes(x = x, y = y, xend = xend, yend = yend2), size = 0.5) +
geom_point(data = cloneDendPoints, aes(x = x, y = y, fill = pop, shape = depth), size = 4, stroke = 0.25) +
#scale_fill_brewer(palette = "Dark2", name = "Population") +
scale_fill_manual(values = flPal, name= "Population")+
scale_shape_manual(values = c(24, 25), name = "Depth Zone")+
geom_hline(yintercept = 0.12, color = "red", lty = 5, size = 0.75) + # creating a dashed line to indicate a clonal distance threshold
geom_text(data = subset(cloneDendPoints, subset = label %in% techReps), aes(x = x, y = (y - .015), label = label), angle = 90) + # spacing technical replicates further from leaf
geom_text(data = subset(cloneDendPoints, subset = !label %in% techReps), aes(x = x, y = (y - .010), label = label), angle = 90) +
labs(y = "Genetic distance (1 - IBS)") +
guides(fill = guide_legend(override.aes = list(shape = 22)))+
theme_classic()
cloneDend = cloneDendA + theme(
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.line.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.y = element_text(size = 12, color = "black", angle = 90),
axis.text.y = element_text(size = 10, color = "black"),
axis.line.y = element_line(),
axis.ticks.y = element_line(),
panel.grid = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
plot.background = element_blank(),
legend.key = element_blank(),
legend.title = element_text(size = 12),
legend.text = element_text(size = 10),
legend.position = "bottom")
cloneDend
ggsave("../figures/cloneDend.png", plot = cloneDend, height = 8, width = 35, units = "in", dpi = 300)
ggsave("../figures/cloneDend.eps", plot = cloneDend, height = 8, width = 35, units = "in", dpi = 300)
mkdir clones
mv sintClones* clones
ls *.bam > bamsNoClones
cat bamsClones | grep -v 'fk_S066.1.trim.un.bt2.bam\|fk_S066.3.trim.un.bt2.bam\|fk_S162.1.trim.un.bt2.bam\|fk_S162.3.trim.un.bt2.bam\|fk_S205.1.trim.un.bt2.bam\|fk_S205.3.trim.un.bt2.bam' >bamsNoClones
FILTERS="-uniqueOnly 1 -remove_bads 1 -minMapQ 20 -minQ 30 -dosnpstat 1 -doHWE 1 -hwe_pval 1e-5 -sb_pval 1e-5 -hetbias_pval 1e-5 -skipTriallelic 1 -minInd 187 -snp_pval 1e-6 -minMaf 0.05"
TODO="-doMajorMinor 1 -doMaf 1 -doCounts 1 -makeMatrix 1 -doIBS 1 -doCov 1 -doGeno 8 -doBcf 1 -doPost 1 -doGlf 2"
echo '#!/bin/bash' > sintNoClones.sh
echo srun angsd -b bamsNoClones -GL 1 $FILTERS $TODO -P 1 -out sintNoClones >> sintNoClones.sh
sbatch --mem=200GB -o sintNoClones.o%j -e sintNoClones.e%j -p shortq7 --mail-type=ALL --mail-user=reckert2017@fau.edu sintNoClones.sh
How many SNPs?
grep "filtering:" sintNoClones.e*
24,670 SNPs
cd ~/2bRAD/sint/fknms/
mv mappedReads/symbionts .
cd symbionts
SYMGENOME=~/bin/symGenomes/symbConcatGenome
2bRAD_bowtie2_launcher.py -f zoox -g $SYMGENOME --launcher -e reckert2017@fau.edu
sbatch --mem=100GB maps.slurm
>s2b
for file in *.sam; do
echo "samtools sort -O bam -o ${file/.sam/}.bam $file && samtools index ${file/.sam/}.bam">>s2b;
done
launcher_creator.py -j s2b -n s2b -t 6:00:00 -N 5 -e reckert2017@fau.edu -q shortq7
sbatch s2b.slurm
Count reads mapping to each chromosome for concatenated Symbiodiniaceae genome
>zooxReads
for i in *.bam; do
echo $i >>zooxReads;
samtools idxstats $i | cut -f 1,3 >>zooxReads;
done
cd ~2bRAD/sint/fknms/
mkdir bayescan
cd bayescan
srun cp ../ANGSD/sintNoClones.vcf.gz . #Note do not use the renamed version, pgdspider has a tough time parsing the samples into pops when you do
srun gunzip sintNoClones.vcf.gz
You will have to create or scp
a textfile with your .bam
file names and populations
Convert vcf with PGDSpider
echo "############
# VCF Parser questions
PARSER_FORMAT=VCF
# Do you want to include a file with population definitions?
VCF_PARSER_POP_QUESTION=true
# Only input following regions (refSeqName:start:end, multiple regions: whitespace separated):
VCF_PARSER_REGION_QUESTION=
# What is the ploidy of the data?
VCF_PARSER_PLOIDY_QUESTION=DIPLOID
# Only output following individuals (ind1, ind2, ind4, ...):
VCF_PARSER_IND_QUESTION=
# Output genotypes as missing if the read depth of a position for the sample is below:
VCF_PARSER_READ_QUESTION=
# Take most likely genotype if "PL" or "GL" is given in the genotype field?
VCF_PARSER_PL_QUESTION=true
# Do you want to exclude loci with only missing data?
VCF_PARSER_EXC_MISSING_LOCI_QUESTION=false
# Select population definition file:
VCF_PARSER_POP_FILE_QUESTION=./fkSintBsPops.txt
# Only output SNPs with a phred-scaled quality of at least:
VCF_PARSER_QUAL_QUESTION=
# Do you want to include non-polymorphic SNPs?
VCF_PARSER_MONOMORPHIC_QUESTION=false
# Output genotypes as missing if the phred-scale genotype quality is below:
VCF_PARSER_GTQUAL_QUESTION=
# GESTE / BayeScan Writer questions
WRITER_FORMAT=GESTE_BAYE_SCAN
# Specify which data type should be included in the GESTE / BayeScan file (GESTE / BayeScan can only analyze one data type per file):
GESTE_BAYE_SCAN_WRITER_DATA_TYPE_QUESTION=SNP
############" >vcf2bayescan.spid
java -Xmx1024m -Xms512m -jar ~/bin/PGDSpider_2.0.7.1/PGDSpider2-cli.jar -inputfile sintNoClones.bcf -outputfile fkSint.bayescan -spid vcf2bayescan.spid
Launch BayeScan
(Takes 12+ hr depending on read
count/SNPs)
echo '#!/bin/bash' > sintBayescan.sh
echo bayescan fkSint.bayescan -threads=100 >> sintBayescan.sh
sbatch -e sintBayescan.e%j -o sintBayescan.o%j -p mediumq7 --mail-user reckert2017@fau.edu --mail-type=ALL sintBayescan.sh
Extract outlier SNPs from BayeScan
output
srun removeBayescanOutliers.pl bayescan=fkSint.baye_fst.txt vcf=sintNoClones.bcf FDR=0.05 mode=extract > fkSintVcfOutliers.vcf
Using BayeScEnv we can look for outliers realted to depth
echo "32.0 21.1 33.2 26.5 32.8 18.0 23.6 43.9" > depth.txt
cp fkSint.bayescan fkSint.bayeScEnv
echo '#!/bin/bash' > sintBayeScEnv.sh
echo bayescenv fkSint.bayeScEnv -env depth.txt >> sintBayeScEnv.sh
sbatch -e sintBayeScEnv.e%j -o sintBayeScEnv.o%j -p longq7 --mail-user reckert2017@fau.edu --mail-type=ALL sintBayeScEnv.sh
Extract outlier SNPs from BayeScEnv
output
srun removeBayescanOutliers.pl bayescan=fkSint.bayeS_fst.txt vcf=sintNoClones.bcf FDR=0.05 mode=extract > fkSintBScEnvVcfOutliers.vcf
Using pcangsd
to discern population structure
cd ~/2bRAD/sint/fknms
mkdir pcangsd
cd pcangsd
cp ../ANGSD/sintNoClones.beagle .
gzip sintNoClones.beagle
source activate 2bRAD
echo '#!/bin/bash' > pcangsd.sh
echo 'pcangsd -b sintNoClones.beagle.gz -o fkSintPcangsd --admix --inbreedSamples --pcadapt --selection' >> pcangsd
launcher_creator.py -j pcangsd -n pcangsd -q shortq7 -t 06:00:00 -e $EMAIL -w 1 -N 1
sbatch pcangsd.slurm
Calculate population structure from genotype likelihoods using
NGSadmix
for K from 2 to 11 : FIRST remove all
clones/genotyping replicates! (we did this).
mkdir ../ngsAdmix
cp *beagle* ../ngsAdmix
zipper.py -f bam -a -9 --launcher -e reckert2017@fau.edu
sbatch zip.slurm
cd ../ngsAdmix
Create a file with 50 replicate simulations for each value of K 1-11 (num pops + 3)
ngsAdmixLauncher.py -f sintNoClones.beagle.gz --maxK 11 -r 50 -n fkSint --launcher -e reckert2017@fau.edu
sbatch --mem=200GB fkSintNgsAdmix.slurm
Next, take the likelihood value from each run of NGSadmix and put them into a file that can be used with Clumpak to calculate the most likely K using the methods of Evanno et al. (2005).
>fkSintNgsAdmixLogfile
for log in fkSint*.log; do
grep -Po 'like=\K[^ ]+' $log >> fkSintNgsAdmixLogfile;
done
Format for CLUMPAK in R
R
You are now using R in the terminal
logs <- as.data.frame(read.table("fkSintNgsAdmixLogfile"))
#output is organized with 10, 11 preceding 1, 2, 3 etc.
logs$K <- c(rep("10", 50), rep("11", 50), rep("1", 50), rep("2", 50), rep("3", 50),
rep("4", 50), rep("5", 50), rep("6", 50),
rep("7", 50), rep("8", 50), rep("9", 50))
write.table(logs[, c(2, 1)], "fkSintNgsAdmixLogfile_formatted", row.names = F,
col.names = F, quote = F)
quit()
# No need to save workspace image [press 'n']
n
Check that your formatted logfile has the appropriate number of entries
cat fkSintNgsAdmixLogfile_formatted | wc -l
make copies of .qopt files to run structure selector on (.Q files)
for file in fkSint*.qopt; do
filename=$(basename -- "$file" .qopt);
cp "$file" "$filename".Q;
done
mkdir fkSintQ
mv fkSint*Q fkSintQ
zip -r fkSintQ.zip fkSintQ
scp
.zip and formatted logfile to local machine and
upload to CLUMPAK
(http://clumpak.tau.ac.il/bestK.html) and structure
selector (https://lmme.ac.cn/StructureSelector/index.html)
scp
sintNoClones* to local machine for further analyses
with R
Since we have 4 distinct lineages, we will pare down SNPs to only those found within all lineages. This helps avoid issues of ascertainment bias.
cd ~/2bRAD/sint/fknms/ANGSD
FILTERS="-uniqueOnly 1 -remove_bads 1 -minMapQ 20 -minQ 25 -dosnpstat 1 -doHWE 1 -hwe_pval 1e-5 -sb_pval 1e-5 -hetbias_pval 1e-5 -skipTriallelic 1 -snp_pval 1e-5 -minMaf 0.05"
TODO="-doMajorMinor 1 -doMaf 1 -doGeno 8 -doPost 1 -doGlf 2"
echo "angsd -b blueBams -GL 1 -P 1 $FILTERS $TODO -minInd 104 -out sintBlueSnps
angsd -b tealBams -GL 1 -P 1 $FILTERS $TODO -minInd 32 -out sintTealSnps
angsd -b greenBams -GL 1 -P 1 $FILTERS $TODO -minInd 24 -out sintGreenSnps
angsd -b yellowBams -GL 1 -P 1 $FILTERS $TODO -minInd 12 -out sintYellowSnps" > kSnps
launcher_creator.py -j kSnps -n kSnps -q shortq7 -t 06:00:00 -e $EMAIL -w 2 -N 1
sbatch kSnps.slurm
Filtering down the sites to those found within each lineage with the given ANGSD filtering criteria
for file in sint*Snps.geno.gz; do
echo '#!/bin/bash' > ${file%%.*}.sh;
echo "zcat $file | awk '{print \$1\"\t\"\$2}' > ${file%%.*}sites" >> ${file%%.*}.sh;
sbatch -e ${file%%.*}.e%j -o ${file%%.*}.o%j -p shortq7 --mem=100GB --mail-user reckert2017@fau.edu --mail-type=ALL ${file%%.*}.sh;
done
srun awk '(++c[$0])==(ARGC-1)' *Snpssites > sitesToDo
mkdir filteredSNPS
mv sint*Snps* filteredSNPS/
mv kSnps* filteredSNPS/
Now index the sites file and run angsd on all samples with only the
reduced number of sites using -sites
flag
angsd sites index sitesToDo
FILTERS="-uniqueOnly 1 -remove_bads 1 -minMapQ 20 -minQ 25 -dosnpstat 1 -doHWE 1 -hwe_pval 1e-5 -sb_pval 1e-5 -hetbias_pval 1e-5 -skipTriallelic 1 -snp_pval 1e-5 -minMaf 0.05"
TODO="-doMajorMinor 1 -doMaf 1 -doCounts 1 -makeMatrix 1 -doIBS 1 -doCov 1 -doGeno 8 -doBcf 1 -doPost 1 -doGlf 2"
echo "angsd -b bamsNoClones -sites sitesToDo -GL 1 -P 1 $FILTERS $TODO -minInd 176 -out sintFiltSnps" > sintFiltSnps
launcher_creator.py -j sintFiltSnps -n sintFiltSnps -q shortq7 -t 06:00:00 -e $EMAIL -w 2 -N 1
sbatch sintFiltSnps.slurm
mv sintFilt* filteredSNPS/
cd ~/2bRAD/sint/fknms/ANGSD
FILTERS="-uniqueOnly 1 -remove_bads 1 -minMapQ 20 -minQ 25 -dosnpstat 1 -doHWE 1 -hwe_pval 1e-5 -sb_pval 1e-5 -hetbias_pval 1e-5 -skipTriallelic 1 -snp_pval 1e-5 -minMaf 0.05"
TODO="-doMajorMinor 1 -doMaf 1 -doCounts 1 -makeMatrix 1 -doIBS 1 -doCov 1 -doGeno 8 -doBcf 1 -doPost 1 -doGlf 3"
echo '#!/bin/bash' > sintFiltRelate.sh
echo srun angsd -b bamsNoClones -GL 1 -sites sitesToDo $FILTERS $TODO -P 1 -minInd 176 -out sintFiltRelate >> sintFiltRelate.sh
sbatch --mem=200GB -o sintFiltRelate.o%j -e sintFiltRelate.e%j -p shortq7 --mail-type=ALL --mail-user=reckert2017@fau.edu sintFiltRelate.sh
cd ngsRelate
mv ~/2bRAD/sint/fknms/ANGSD/*Relate* .
zcat sintFiltRelate.mafs.gz | cut -f5 |sed 1d >freq
echo '#!/bin/bash' > ngsFiltRelate.sh
echo ngsRelate -g sintFiltRelate.glf.gz -n 220 -f freq -O filtRelate >> ngsFiltRelate.sh
sbatch -e ngsFiltRelate.e%j -o ngsFiltRelate.o%j --mem=200GB --mail-user reckert2017@fau.edu --mail-type=ALL ngsFiltRelate.sh
Running SFS calculations within lineage to use for Fst and to run
StairwayPlot
cd ~/2bRAD/sint/fknms/ANGSD
## scp bamsClusters text file to directory
awk 'BEGIN { FS=" " } $2 == "Blue" { print $1 }' bamsClusters >> blueBams
awk 'BEGIN { FS=" " } $2 == "Teal" { print $1 }' bamsClusters >> tealBams
awk 'BEGIN { FS=" " } $2 == "Green" { print $1 }' bamsClusters >> greenBams
awk 'BEGIN { FS=" " } $2 == "Yellow" { print $1 }' bamsClusters >> yellowBams
## No filters to distort allelic frequencies
FILTERS="-uniqueOnly 1 -remove_bads 1 -skipTriallelic 1 -minMapQ 30 -minQ 35 -doHWE 1 -sb_pval 1e-5 -hetbias_pval 1e-5 -maxHetFreq 0.5"
TODO="-doMajorMinor 1 -doMaf 1 -dosnpstat 1 -doPost 2 -doGeno 11 -doGlf 2"
echo "angsd -b blueBams -GL 1 $FILTERS $TODO -P 1 -minInd 104 -out sintBlueSFS
angsd -b tealBams -GL 1 $FILTERS $TODO -P 1 -minInd 32 -out sintTealSFS
angsd -b greenBams -GL 1 $FILTERS $TODO -P 1 -minInd 24 -out sintGreenSFS
angsd -b yellowBams -GL 1 $FILTERS $TODO -P 1 -minInd 12 -out sintYellowSFS" > sfsClusters
launcher_creator.py -j sfsClusters -n sfsClusters -q shortq7 -t 06:00:00 -e $EMAIL -N 1
sbatch sfsClusters.slurm
for file in sint*SFS.geno.gz; do
echo '#!/bin/bash' > ${file%%.*}.sh;
echo "zcat $file | awk '{print \$1\"\t\"\$2}' > ${file%%.*}sites" >> ${file%%.*}.sh;
sbatch -e ${file%%.*}.e%j -o ${file%%.*}.o%j -p longq7 --mem=100GB --mail-user reckert2017@fau.edu --mail-type=ALL ${file%%.*}.sh;
done
Now, compile common sites from all lineages using awk
srun awk '(++c[$0])==(ARGC-1)' *SFSsites > sfsSitesToDo
cat sfsSitesToDo | wc -l
#index sites for ANGSD and re-run ANGSD using ```-sites```
angsd sites index sfsSitesToDo
export GENOME_FASTA=~/2bRAD/sint/fknms/mappedReads/sint_denovo_cc.fasta
TODO="-doSaf 1 -ref $GENOME_FASTA -anc $GENOME_FASTA -doMaf 1 -doMajorMinor 4"
echo "angsd -sites sfsSitesToDo -b blueBams -GL 1 -P 1 $TODO -out sintBlue
angsd -sites sfsSitesToDo -b tealBams -GL 1 -P 1 $TODO -out sintTeal
angsd -sites sfsSitesToDo -b greenBams -GL 1 -P 1 $TODO -out sintGreen
angsd -sites sfsSitesToDo -b yellowBams -GL 1 -P 1 $TODO -out sintYellow" >sfs
launcher_creator.py -j sfs -n sfs -q shortq7 -t 06:00:00 -e $EMAIL -N 1
sbatch sfs.slurm
Once all jobs run:
mkdir ../SFS
mv sfs* ../SFS
mv sintBlue* ../SFS
mv sintTeal* ../SFS
mv sintGreen* ../SFS
mv sintYellow* ../SFS
cd ../SFS
###-- 1d-SFS
echo "realSFS sintBlue.saf.idx >sintBlue.sfs
realSFS sintTeal.saf.idx >sintTeal.sfs
realSFS sintGreen.saf.idx >sintGreen.sfs
realSFS sintYellow.saf.idx >sintYellow.sfs" >realSFS
launcher_creator.py -j realSFS -n realSFS -q shortq7 -t 06:00:00 -e $EMAIL -w 4 -N 1
sbatch realSFS.slurm
###-- 2d-SFS
echo "realSFS sintBlue.saf.idx sintTeal.saf.idx -P 20 > pBlTl.sfs ; realSFS fst index sintBlue.saf.idx sintTeal.saf.idx -sfs pBlTl.sfs -fstout pBlTl
realSFS sintBlue.saf.idx sintGreen.saf.idx -P 20 > pBlGn.sfs ; realSFS fst index sintBlue.saf.idx sintGreen.saf.idx -sfs pBlGn.sfs -fstout pBlGn
realSFS sintBlue.saf.idx sintYellow.saf.idx -P 20 > pBlYl.sfs ; realSFS fst index sintBlue.saf.idx sintYellow.saf.idx -sfs pBlYl.sfs -fstout pBlYl
realSFS sintTeal.saf.idx sintGreen.saf.idx -P 20 > pTlGn.sfs ; realSFS fst index sintTeal.saf.idx sintGreen.saf.idx -sfs pTlGn.sfs -fstout pTlGn
realSFS sintTeal.saf.idx sintYellow.saf.idx -P 20 > pTlYl.sfs ; realSFS fst index sintTeal.saf.idx sintYellow.saf.idx -sfs pTlYl.sfs -fstout pTlYl
realSFS sintGreen.saf.idx sintYellow.saf.idx -P 20 > pGnYl.sfs ; realSFS fst index sintGreen.saf.idx sintYellow.saf.idx -sfs pGnYl.sfs -fstout pGnYl" >2dSFS
launcher_creator.py -j 2dSFS -n 2dSFS -q shortq7 -t 06:00:00 -e $EMAIL -w 20 -N 1
sbatch 2dSFS.slurm
realSFS fst stats pBlYl.fst.idx
realSFS fst stats pTlYl.fst.idx
realSFS fst stats pGnYl.fst.idx
realSFS fst stats pBlTl.fst.idx
realSFS fst stats pBlGn.fst.idx
realSFS fst stats pTlGn.fst.idx
launcher_creator.py -j sintKFst -n sintKFst -q shortq7 -t 06:00:00 -e $EMAIL -N 1
sbatch sintKFst.slurm
First, we filter Fst outliers from sites:
FILTERS="-minMapQ 20 -minQ 25 -minInd 88 -doHWE 1 -sb_pval 1e-5 -hetbias_pval 1e-5 -uniqueOnly 1 -remove_bads 1 -skipTriallelic 1 -snp_pval 1e-5 -minMaf 0.05"
-sites sitesToDo -GL 1 -P 1 -uniqueOnly 1 -remove_bads 1 -minMapQ 20 -minQ 25 -dosnpstat 1 -doHWE 1 -hwe_pval 1e-5 -sb_pval 1e-5 -hetbias_pval 1e-5 -skipTriallelic 1 -snp_pval 1e-5 -minMaf 0.05 -doMajorMinor 1 -doMaf 1 -doCounts 1 -makeMatrix 1 -doIBS 1 -doCov 1 -doGeno 8 -doBcf 1 -doPost 1 -doGlf 2 -minInd 187 -out sintFiltSnps
TODO="-doMajorMinor 4 -ref $GENOME_FASTA -doMaf 1 -dosnpstat 1 -doPost 2 -doBcf 1 --ignore-RG 0 -doGeno 11 -doCounts 1"
mkdir ~/2bRAD/sint/fknms/filteredSNPS/bayescan
cd ~/2bRAD/sint/fknms/filteredSNPS/bayescan
Create a file called vcf2bayescan.spid containing this text:
echo "############
# VCF Parser questions
PARSER_FORMAT=VCF
# Do you want to include a file with population definitions?
VCF_PARSER_POP_QUESTION=true
# Only input following regions (refSeqName:start:end, multiple regions: whitespace separated):
VCF_PARSER_REGION_QUESTION=
# What is the ploidy of the data?
VCF_PARSER_PLOIDY_QUESTION=DIPLOID
# Only output following individuals (ind1, ind2, ind4, ...):
VCF_PARSER_IND_QUESTION=
# Output genotypes as missing if the read depth of a position for the sample is below:
VCF_PARSER_READ_QUESTION=
# Take most likely genotype if "PL" or "GL" is given in the genotype field?
VCF_PARSER_PL_QUESTION=true
# Do you want to exclude loci with only missing data?
VCF_PARSER_EXC_MISSING_LOCI_QUESTION=false
# Select population definition file:
VCF_PARSER_POP_FILE_QUESTION=./bspops.txt
# Only output SNPs with a phred-scaled quality of at least:
VCF_PARSER_QUAL_QUESTION=
# Do you want to include non-polymorphic SNPs?
VCF_PARSER_MONOMORPHIC_QUESTION=false
# Output genotypes as missing if the phred-scale genotype quality is below:
VCF_PARSER_GTQUAL_QUESTION=
# GESTE / BayeScan Writer questions
WRITER_FORMAT=GESTE_BAYE_SCAN
# Specify which data type should be included in the GESTE / BayeScan file (GESTE / BayeScan can only analyze one data type per file):
GESTE_BAYE_SCAN_WRITER_DATA_TYPE_QUESTION=SNP
############" >vcf2bayescan.spid
Need a file “bspops” that has samples in .bams order with the assigned populations/lineages
cp ../../ANGSD/bamsClusters ./bspops.txt
cp ../ANGSD/sintFiltSnps.bcf .
srun java -Xmx1024m -Xms512m -jar ~/bin/PGDSpider_2.0.7.1/PGDSpider2-cli.jar -inputfile sintFiltSnps.bcf -outputfile sintFilt.bayescan -spid vcf2bayescan.spid
echo "bayescan sintFilt.bayescan -threads=100" >bayeScan
launcher_creator.py -j bayeScan -n bayeScan -q mediumq7 -t 06:00:00 -e $EMAIL -N 5
sbatch bayeScan.slurm
cut -d" " -f2- sintFilt.baye_fst.txt > sintFst
## removing all the .bcf data before snps
tail --lines=+70 sintFiltSnps.bcf | cut -f 1,2 | paste --delimiters "\t" - sintFst > sint.baye_fst_pos.txt
Identify sites with q-value < 0.5
cp ../../ANGSD/sfsSitesToDo .
awk '$5<0.5 {print $1"\t"$2}' sint.baye_fst_pos.txt > ss_bayeOuts
grep -Fvxf ss_bayeOuts sfsSitesToDo > sitesToDo_filtBaye
cp sitesToDo_filtBaye ../../ANGSD
cd ../../ANGSD
angsd sites index sitesToDo_filtBaye
export GENOME_FASTA=~/2bRAD/sint/fknms/mappedReads/sint_denovo_cc.fasta
TODO="-doSaf 1 -anc $GENOME_FASTA -ref $GENOME_FASTA -doMaf 1 -doMajorMinor 4"
echo "angsd -sites sitesToDo_filtBaye -b blueBams -GL 1 -P 1 $TODO -out sintBlueSFSbaye
angsd -sites sitesToDo_filtBaye -b tealBams -GL 1 -P 1 $TODO -out sintTealSFSbaye
angsd -sites sitesToDo_filtBaye -b greenBams -GL 1 -P 1 $TODO -out sintGreenSFSbaye
angsd -sites sitesToDo_filtBaye -b yellowBams -GL 1 -P 1 $TODO -out sintYellowSFSbaye" >sfs_filtBaye
launcher_creator.py -j sfs_filtBaye -n sfs_filtBaye -q shortq7 -t 06:00:00 -e $EMAIL -N 1
sbatch sfs_filtBaye.slurm
mv *SFS* ../SFS
mv *sfs* ../SFS
Generating per-population SFS
cd ../SFS
echo "realSFS sintBlueSFSbaye.saf.idx -fold 1 >sintFiltBlue.sfs
realSFS sintTealSFSbaye.saf.idx -fold 1 >sintFiltTeal.sfs
realSFS sintGreenSFSbaye.saf.idx -fold 1 >sintFiltGreen.sfs
realSFS sintYellowSFSbaye.saf.idx -fold 1 >sintFiltYellow.sfs" >sintFiltSFS
launcher_creator.py -j sintFiltSFS -n sintFiltSFS -q shortq7 -t 06:00:00 -e $EMAIL -N 1
sbatch sintFiltSFS.slurm
Making blueprint files for StairwayPlot
mkdir swPlot
cd swPlot
# SWplot only wants the mutations, so have to omit first number in SFS file (0 mutations).
# L = cat sitesToDo_filtBaye | wc -l
# which in our case =1830041
echo "#input setting
popid: sintBlue # id of the population (no white space)
nseq: 262 # number of sequences (2*n)
L: 1830041 # total number of observed nucleic sites, including polymorphic and monomorphic
whether_folded: true # whethr the SFS is folded (true or false)
SFS: 54968.740090 18444.475031 9443.254125 6460.637387 4497.061230 3519.062453 2895.388432 2422.638318 1956.686134 1717.620993 1378.797384 1416.463385 1227.364278 1021.759822 1073.007228 873.080148 845.752484 800.981800 595.181349 794.836408 571.203766 628.706337 527.980210 429.664844 721.559927 315.050869 481.708978 307.300287 394.221545 394.074502 466.456788 274.440765 363.386447 273.479395 296.832588 320.575804 315.126815 252.266556 213.566298 275.958738 257.535768 185.436902 340.324052 175.413491 179.986660 138.658719 287.299094 47.316965 238.946119 266.961035 111.840623 223.114384 45.279077 165.697377 296.385206 83.585503 214.297286 125.366633 111.661200 219.281119 139.052795 160.386831 131.614256 76.786039 157.239727 159.636010 60.218127 143.194085 99.001725 148.481068 84.514873 126.043678 104.077862 96.209692 141.031878 149.956901 101.599229 56.392862 101.079956 80.280370 135.261593 40.235595 27.424630 121.969309 189.776245 74.226144 50.242446 84.001571 107.542195 31.137798 29.916997 67.201712 131.755256 60.641361 64.148078 116.233605 108.268881 36.805898 51.451547 51.855920 67.043933 99.173703 25.902900 28.275804 101.866575 47.917146 36.437218 48.157593 146.198005 27.417829 34.145564 72.762140 95.264702 6.290856 58.686022 89.500602 76.097667 40.816598 63.178597 39.973295 69.971593 108.340421 51.045114 61.909694 103.159211 33.569557 142.186072 6.764939 112.449083 86.473221 155.056690
#smallest_size_of_SFS_bin_used_for_estimation: 1 # default is 1; to ignore singletons, uncomment this line and change this number to 2
#largest_size_of_SFS_bin_used_for_estimation: 131 # default is nseq/2 for folded SFS
pct_training: 0.67 # percentage of sites for training
nrand: 65 130 195 260 # number of random break points for each try (separated by white space)
project_dir: sintBlue # project directory
stairway_plot_dir: $HOME/bin/stairway_plot/stairway_plot_es # directory to the stairway plot files
ninput: 200 # number of input files to be created for each estimation
#random_seed: 6
#output setting
mu: 2e-8 # assumed mutation rate per site per generation
year_per_generation: 15 # assumed generation time (in years)
#plot setting
plot_title: sintBlue # title of the plot
xrange: 0.1,10000 # Time (1k year) range; format: xmin,xmax; "0,0" for default
yrange: 0,0 # Ne (1k individual) range; format: xmin,xmax; "0,0" for default
xspacing: 2 # X axis spacing
yspacing: 2 # Y axis spacing
fontsize: 12 # Font size" >sintBlue.blueprint
echo "#input setting
popid: sintTeal # id of the population (no white space)
nseq: 80 # number of sequences ->
L: 1830041 # total number of observed nucleic sites, including polymorphic and monomorphic
whether_folded: true # whethr the SFS is folded (true or false)
SFS: 19027.916655 8774.712038 5079.474815 3679.957335 2677.663655 2157.865160 1778.801580 1440.323144 1319.701029 998.536711 934.144099 871.915368 812.451003 750.938932 623.670542 540.641827 671.476401 526.080209 501.266127 456.155482 447.647628 330.610513 426.093935 326.858819 276.638676 294.294057 347.649095 269.002785 189.682393 303.585744 226.318022 274.625068 201.149769 196.103761 173.848938 191.193373 267.200748 249.900867 248.353637 325.131914
#smallest_size_of_SFS_bin_used_for_estimation: 1 # default is 1; to ignore singletons, uncomment this line and change this number to 2
#largest_size_of_SFS_bin_used_for_estimation: 80 # default is nseq/2 for folded SFS
pct_training: 0.67 # percentage of sites for training
nrand: 19 39 58 78 # number of random break points for each try (separated by white space)
project_dir: sintTeal # project directory
stairway_plot_dir: $HOME/bin/stairway_plot/stairway_plot_es # directory to the stairway plot files
ninput: 200 # number of input files to be created for each estimation
#random_seed: 6
#output setting
mu: 2e-8 # assumed mutation rate per site per generation
year_per_generation: 15 # assumed generation time (in years)
#plot setting
plot_title: sintTeal # title of the plot
xrange: 0.1,10000 # Time (1k year) range; format: xmin,xmax; "0,0" for default
yrange: 0,0 # Ne (1k individual) range; format: xmin,xmax; "0,0" for default
xspacing: 2 # X axis spacing
yspacing: 2 # Y axis spacing
fontsize: 12 # Font size" >sintTeal.blueprint
echo "#input setting
popid: sintGreen # id of the population (no white space)
nseq: 62 # number of sequences
L: 1830041 # total number of observed nucleic sites, including polymorphic and monomorphic
whether_folded: true # whethr the SFS is folded (true or false)
SFS: 18387.776643 7847.061378 4651.990927 3141.107325 2303.550391 1773.451011 1508.736063 1285.307124 1143.097857 949.314445 859.933600 748.766734 719.856903 655.836831 564.083017 512.608619 517.212306 456.491290 461.208427 387.023802 341.543152 363.322278 311.201884 282.317211 353.999771 232.677877 288.694402 321.652486 380.855557 268.842073 519.097976
#smallest_size_of_SFS_bin_used_for_estimation: 1 # default is 1; to ignore singletons, uncomment this line and change this number to 2
#largest_size_of_SFS_bin_used_for_estimation: 80 # default is nseq/2 for folded SFS
pct_training: 0.67 # percentage of sites for training
nrand: 15 30 45 60 # number of random break points for each try (separated by white space)
project_dir: sintGreen # project directory
stairway_plot_dir: $HOME/bin/stairway_plot/stairway_plot_es # directory to the stairway plot files
ninput: 200 # number of input files to be created for each estimation
#random_seed: 6
#output setting
mu: 2e-8 # assumed mutation rate per site per generation
year_per_generation: 15 # assumed generation time (in years)
#plot setting
plot_title: sintGreen # title of the plot
xrange: 0.1,10000 # Time (1k year) range; format: xmin,xmax; "0,0" for default
yrange: 0,0 # Ne (1k individual) range; format: xmin,xmax; "0,0" for default
xspacing: 2 # X axis spacing
yspacing: 2 # Y axis spacing
fontsize: 12 # Font size" >sintGreen.blueprint
echo "#input setting
popid: sintYellow # id of the population (no white space)
nseq: 30 # number of sequences
L: 1830041 # total number of observed nucleic sites, including polymorphic and monomorphic
whether_folded: true # whethr the SFS is folded (true or false)
SFS: 16199.629544 7221.873982 4153.983827 2802.707609 2381.014773 1825.956297 1494.476878 1079.971463 1099.834349 768.376639 713.072962 618.778692 676.597541 626.859962 790.416252
#smallest_size_of_SFS_bin_used_for_estimation: 1 # default is 1; to ignore singletons, uncomment this line and change this number to 2
#largest_size_of_SFS_bin_used_for_estimation: 80 # default is nseq/2 for folded SFS
pct_training: 0.67 # percentage of sites for training
nrand: 7 14 21 28 # number of random break points for each try (separated by white space)
project_dir: sintYellow # project directory
stairway_plot_dir: $HOME/bin/stairway_plot/stairway_plot_es # directory to the stairway plot files
ninput: 200 # number of input files to be created for each estimation
#random_seed: 6
#output setting
mu: 2e-8 # assumed mutation rate per site per generation
year_per_generation: 15 # assumed generation time (in years)
#plot setting
plot_title: sintYellow # title of the plot
xrange: 0.1,10000 # Time (1k year) range; format: xmin,xmax; "0,0" for default
yrange: 0,0 # Ne (1k individual) range; format: xmin,xmax; "0,0" for default
xspacing: 2 # X axis spacing
yspacing: 2 # Y axis spacing
fontsize: 12 # Font size" >sintYellow.blueprint
Create batch files and launch:
for blueprint in *blueprint; do
java -cp $HOME/bin/stairway_plot/stairway_plot_es Stairbuilder $blueprint;
done
# split each .sh script and launch all (second job waiting with dependency on first finishing)
for file in *.sh; do
grep -B 805 "# Step 2: determine number of break points" $file > ${file%.*}1;
grep -A 805 "# Step 2: determine number of break points" $file > ${file%.*}2;
echo '#!/bin/bash' | cat - ${file%.*}2 >temp;
mv temp ${file%.*}2;
launcher_creator.py -j ${file%.*}1 -n ${file%.*}1 -q shortq7 -t 06:00:00 -e $EMAIL -N 5;
DEP=$(sbatch ${file%.*}1.slurm) && sbatch --dependency=afterok:${DEP##* } -e ${file%.*}2.e%j -o ${file%.*}2.o%j -p shortq7 --mem=200GB --mail-user reckert2017@fau.edu --mail-type=ALL ${file%.*}2;
done
launcher_creator.py -j sintYellow.blueprint1 -n sintYellow.blueprint1 -q shortq7 -t 06:00:00 -e $EMAIL -N 5;
DEP=$(sbatch sintYellow.blueprint1.slurm) && sbatch --dependency=afterok:${DEP##* } -e sintYellow.blueprint2.e%j -o sintYellow.blueprint2.o%j -p shortq7 --mem=200GB --mail-user reckert2017@fau.edu --mail-type=ALL sintYellow.blueprint2;
After all have run:
mv sint*/*final.summary .
Calculating Heterozygosity across all loci (variant//invariant) using
ANGSD
and R
script from Misha Matz (https://github.com/z0on/2bRAD_denovo)
cd ~/2bRAD/sint/fknms/ANGSD
cp ../SFS/sfsSitesToDo .
angsd sites index sfsSitesToDo
export GENOME_FASTA=~/2bRAD/sint/fknms/mappedReads/sint_denovo_cc.fasta
FILTERS="-maxHetFreq 0.5 -uniqueOnly 1 -remove_bads 1 -skipTriallelic 1 -minMapQ 25 -minQ 30 -doHWE 1 -sb_pval 1e-5 -hetbias_pval 1e-5 -minInd 187"
TODO="-ref $GENOME_FASTA -makeMatrix 1 -doIBS 1 -doCov 1 -doGeno 32 -doPost 1 -doGlf 2 -doCounts 1 -doMajorMinor 1 -dosnpstat 1 -doMaf 1"
echo "angsd -sites sfsSitesToDo -b bamsNoClones -GL 1 -P 1 $TODO $FILTERS -out sintHet" >angsdHet
launcher_creator.py -j angsdHet -n angsdHet -q shortq7 -t 06:00:00 -e $EMAIL -N 1
sbatch angsdHet.slurm
mkdir ../heterozygosity
cp sintHet* ../heterozygosity/
cd ../heterozygosity
echo heterozygosity_beagle.R sintHet.beagle.gz >filtHet
launcher_creator.py -j filtHet -n filtHet -q mediumq7 -t 24:00:00 -e $EMAIL -N 1
sbatch filtHet.slurm
tail -n 220 filtHet.e* > sintHet
cd ~/2bRAD/sint/fknms/
mkdir theta
cd ANGSD
GENOME_FASTA=~/2bRAD/sint/fknms/mappedReads/sint_denovo_cc.fasta
>kThetas
for POP in yellow green teal blue; do
echo "angsd -b ${POP}Bams -GL 1 -P 1 -sites sitesToDo_filtBaye -anc $GENOME_FASTA -doSaf 1 -out ${POP}Out &&\
realSFS ${POP}Out.saf.idx -fold 1 > ${POP}Out.sfs &&\
realSFS saf2theta ${POP}Out.saf.idx -sfs ${POP}Out.sfs -outname ${POP} -fold 1 &&\
thetaStat do_stat ${POP}.thetas.idx" >> kThetas
done
launcher_creator.py -j kThetas -n kThetas -q shortq7 -t 6:00:00 -e $EMAIL -w 4 -N 1
sbatch kThetas.slurm
mv *theta* ../theta
To use BayesAss3 (XXX) we first need to convert our ANGSD output into genotype format
cd~/2bRAD/sint/fknms
mkdir bayesAss
cd bayesAss
cp ../ANGSD/sintNoClones.bcf .
cp ../bayescan/fkSintBsPops.txt .
echo "# VCF Parser questions
PARSER_FORMAT=VCF
# Only output SNPs with a phred-scaled quality of at least:
VCF_PARSER_QUAL_QUESTION=
# Select population definition file:
VCF_PARSER_POP_FILE_QUESTION=./fkSintBsPops.txt
# What is the ploidy of the data?
VCF_PARSER_PLOIDY_QUESTION=DIPLOID
# Do you want to include a file with population definitions?
VCF_PARSER_POP_QUESTION=true
# Output genotypes as missing if the phred-scale genotype quality is below:
VCF_PARSER_GTQUAL_QUESTION=
# Do you want to include non-polymorphic SNPs?
VCF_PARSER_MONOMORPHIC_QUESTION=false
# Only output following individuals (ind1, ind2, ind4, ...):
VCF_PARSER_IND_QUESTION=
# Only input following regions (refSeqName:start:end, multiple regions: whitespace separated):
VCF_PARSER_REGION_QUESTION=
# Output genotypes as missing if the read depth of a position for the sample is below:
VCF_PARSER_READ_QUESTION=
# Take most likely genotype if "PL" or "GL" is given in the genotype field?
VCF_PARSER_PL_QUESTION=true
# Do you want to exclude loci with only missing data?
VCF_PARSER_EXC_MISSING_LOCI_QUESTION=true
# Immanc (BayesAss) Writer questions
WRITER_FORMAT=IMMANC
# Specify the locus/locus combination you want to write to the Immanc (BayesAss) file:
IMMANC_WRITER_LOCUS_COMBINATION_QUESTION=
# Specify which data type should be included in the Immanc (BayesAss)) file (Immanc (BayesAss) can only analyze one data type per file):
IMMANC_WRITER_DATA_TYPE_QUESTION=SNP" >fkSintBA.spid
module load pgdspider-2.1.1.2-gcc-9.2.0-ghxvd4c
pgdSpider=/opt/ohpc/pub/spack/opt/spack/linux-centos7-x86_64/gcc-9.2.0/pgdspider-2.1.1.2-ghxvd4c4ieqngkbutakc7x6j4pfkqm5e/bin/PGDSpider2-cli.jar
echo '#!/bin/bash' > pgdSpider.sh
echo "java -Xmx1024m -Xms512m -jar $pgdSpider -inputformat VCF -outputformat IMMANC vcf -inputfile sintNoClones.bcf -outputfile fkSintBayesAss.txt -spid fkSintBA.spid" >>pgdSpider.sh
sbatch -e pgdSpider.e%j -o pgdSpider.o%j -p mediumq7 --mail-user reckert2017@fau.edu --mail-type=ALL pgdSpider.sh
## Default params:
#MCMC reps: 1,000,000
#burn in: 100,000
#sampling freq: 100
#delta migration (1): 0.1
#delta allele freq (3): 0.1
#delta inbreeding (4): 0.1
rm sintNoClones.bcf
rm fkSintBsPops.txt
module load gcc-9.2.0-gcc-8.3.0-ebpgkrt gsl-2.5-gcc-9.2.0-i6lf4jb netlib-lapack-3.9.1-gcc-9.2.0-gcqg2b2 BayesAss/3.0.4.2
# Run a test with verbose [-v] output to see the acceptance rates in the terminal (takes a few minutes to compute)
# check the output file [less BATest.o*] and kill the job once you get output with acceptance rates [scancel {yourJobID}]
# After ~10—15 min you should start seeing output in the BATest.o* file
echo '#!/bin/bash' > BATest
echo BA3SNP -v -i 5000000 -b 500000 -n 100 fkSintBayesAss.txt >> BATest
sbatch -e BATest.e%j -o BATest.o%j -p shortq7 --mail-user reckert2017@fau.edu --mail-type=ALL BATest
#logP(M): -638.54 logL(G): -3494125.80 logL: -3494764.33 % done: [0.00] % accepted: (0.41, 0.01, 0.80, 0.06, 0.76)
# we are looking for 20—60% acceptance, ideally somewhere near 20—30%
# relationships between mixing parameters and acceptance rates are inverse
# defaults are 0.1 (all parameters are scale 0—1)
# increase [-m] increase [-a] and decrease [-f]
echo '#!/bin/bash' > BATest
echo BA3SNP -v -i 5000000 -b 500000 -n 1000 -m 0.2 -a 0.6 -f 0.04 fkSintBayesAss.txt >> BATest
sbatch -e BATest.e%j -o BATest.o%j -p shortq7 --mail-user reckert2017@fau.edu --mail-type=ALL BATest
# % accepted: (0.26, 0.02, 0.29, 0.15, 0.76)
echo '#!/bin/bash' > BATest
echo BA3SNP -v -i 5000000 -b 500000 -n 1000 -m 0.2 -a 0.6 -f 0.03 fkSintBayesAss.txt >> BATest
# % accepted: (0.26, 0.02, 0.29, 0.20, 0.76)
echo '#!/bin/bash' > BATest
echo BA3SNP -v -i 5000000 -b 500000 -n 1000 -m 0.2 -a 0.6 -f 0.025 fkSintBayesAss.txt >> BATest
sbatch -e BATest.e%j -o BATest.o%j -p shortq7 --mail-user reckert2017@fau.edu --mail-type=ALL BATest
# % accepted: (0.25, 0.01, 0.28, 0.23, 0.76)
Make and launch 10 iterations of BayesAss, each in its own run directory so we can keep all trace files (saved as ‘BA3trace.txt’ and would overwrite if not in separate directories). We are using [-s $RANDOM] to use a random start seed for each independent run
module load gcc-9.2.0-gcc-8.3.0-ebpgkrt gsl-2.5-gcc-9.2.0-i6lf4jb netlib-lapack-3.9.1-gcc-9.2.0-gcqg2b2 BayesAss/3.0.4.2
for i in {01..10}; do
echo '#!/bin/bash' > BayesAss$i.sh
echo BA3SNP -v -u -s $RANDOM -i 6500000 -b 3000000 -n 100 -m 0.2 -a 0.6 -f 0.025 -t -o fkSintBARun${i}Out.txt ../fkSintBayesAss.txt >> BayesAss$i.sh;
mkdir run$i;
mv BayesAss$i.sh run$i;
cd run$i;
sbatch -e BayesAss$i.e%j -o BayesAss$i.o%j -p longq7 --mem=100GB --exclusive --mail-user reckert2017@fau.edu --mail-type=ALL BayesAss$i.sh
cd ..;
done
# after all runs complete copy files to main BayesAss directory
cd ~/2bRAD/sint/fknms/bayesAss
cp run*/*Out.txt .
for i in {01..10}; do
cp ../run$i/BA3trace.txt BA3trace$i.txt;
done