This walks through the steps to process 2bRAD reads without an existing genome.
Be sure to read through what you are doing and follow instructions before copy/pasting code chunks.
Copy the code chunks into your terminal, taking care to change the necessary portions to fit your data/ your directory structure, etc.
First, you will need to replace reckert2017
with your
user name and reckert2017@fau.edu
with your email
throughout this code.
Download all necessary scripts and load modules for processing/analysis
ssh reckert2017@koko-login.hpc.fau.edu
or you can add to ~/.bashrc to load at login (use
nano .bashrc
)
module load angsd-0.933-gcc-9.2.0-65d64pp
module load bayescan-2.1-gcc-8.3.0-7gakqmd
module load qt-5.15.2-gcc-9.2.0-zi7wcem BayeScEnv/1.1
module load bcftools-1.9-gcc-8.3.0-il4d373
module load bowtie2-2.3.5.1-gcc-8.3.0-63cvhw5
module load cdhit-4.8.1-gcc-8.3.0-bcay75d
module load htslib-1.9-gcc-8.3.0-jn7ehrc
module load kraken2-2.1.1-gcc-9.2.0-ocivj3u
module load python-3.7.4-gcc-8.3.0-3tniqr5
module load launcher
module load miniconda3-4.6.14-gcc-8.3.0-eenl5dj
module load ncbi-toolkit-22_0_0-gcc-9.2.0-jjhd2wa
module load ngsadmix-32-gcc-8.3.0-qbnwmpq
module load ngsRelate/v2
module load R/3.6.1
module load samtools-1.10-gcc-8.3.0-khgksad
module load vcftools-0.1.14-gcc-8.3.0-safy5vc
Put scripts needed into ~/bin or similar directory that is mapped in
.bashrc path IF you don’t have a bin
directory in your
$HOME
you can make one now (mkdir ~/bin
)
cd ~/bin
svn checkout https://github.com/RyanEckert/Stephanocoenia_FKNMS_PopGen/trunk/scripts/
mv scripts/* .
rm scripts
wget http://www.cmpg.unibe.ch/software/PGDSpider/PGDSpider_2.0.7.1.zip
unzip PGDSpider_2.0.7.1.zip
rm PGDSpider_2.0.7.1.zip
git clone https://github.com/Rosemeis/pcangsd.git
cd pcangsd
conda activate 2bRAD
pip install --user -r requirements.txt
python setup.py build_ext --inplace
pip3 install -e .
cd ~/bin
git clone https://bitbucket.org/simongravel/moments.git
cd moments
conda activate 2bRAD
conda install -c bioconda moments
pip install --user -r requirements.txt
python setup.py build_ext --inplace
pip3 install -e .
svn checkout https://github.com/xiaoming-liu/stairway-plot-v2.git
mv stairway-plot-v2.git/trunk/stairway_plot_v2.1.1.zip .
unzip stairway_plot_v2.1.1.zip
rm -r stairway_plot_v2.1.1.zip stairway-plot-v2.git
Make all scripts executable
chmod +x *.sh *.pl *.py
IF not already, it is useful to add ~/bin
to your
$PATH
This way you can easily access your executable
scripts without specifying the absolute path to them.
Otherwise
skip to “Build working directory”
PATH="$HOME/bin:$PATH";
To permanently add this to your $PATH
add to
.bashrc
use nano
text editor
nano ~/.bashrc
ADD the following text to file under PATHS section if in your
.bashrc: export PATH="$HOME/bin:$PATH";
exit nano with
ctrl+x
source ~/.bashrc
echo $PATH
cd
mkdir 2bRAD/sint/
mkdir 2bRAD/sint/fknms/
mkdir 2bRAD/sint/fknms/rawReads/
cd 2bRAD/sint/fknms/rawReads/
If you have not previously, download BaseSpaceCLI
wget "https://launch.basespace.illumina.com/CLI/latest/amd64-linux/bs" -O $HOME/bin/bs
chmod +x ~/bin/bs
Go to the website and confirm authorization by logging in to your basespace acct.
bs auth
Making a script to download the reads and merge samples across 2 NovaSeq lanes
echo '#!/bin/bash' > downloadReads.sh
echo 'bs download project --concurrency=high -q -n JA21001 -o .' >> downloadReads.sh
# -n is the project name and -o is the output directory
echo "find . -name '*.gz' -exec mv {} . \;" >> downloadReads.sh
echo 'rmdir SA*' >>downloadReads.sh
echo 'mkdir ../concatReads' >> downloadReads.sh
echo 'cp *.gz ../concatReads' >> downloadReads.sh
echo 'cd ../concatReads' >> downloadReads.sh
echo 'mergeReads.sh -o mergeTemp' >> downloadReads.sh
# -o is the directory to put output files in
echo 'rm *L00*' >> downloadReads.sh
echo "find . -name '*.gz' -exec mv {} . \;" >> downloadReads.sh
echo 'gunzip *.gz' >> downloadReads.sh
echo 'rmdir mergeTemp' >> downloadReads.sh
chmod +x downloadReads.sh
launcher_creator.py -b 'srun downloadReads.sh' -n downloadReads -q shortq7 -t 06:00:00 -e reckert2017@fau.edu
sbatch downloadReads.slurm
How many reads before filtering?
echo '#!/bin/bash' >rawReads
echo readCounts.sh -e .fastq -o sintRaw >>rawReads
sbatch -o rawReads.o%j -e rawReads.e%j rawReads --mail-type=ALL --mail-user=reckert2017@fau.edu
cd ../concatReads
2bRAD_trim_launch_dedup.pl fastq > trims.sh
launcher_creator.py -j trims.sh -n trims -q shortq7 -t 06:00:00 -e reckert2017@fau.edu
sbatch --mem=200GB trims.slurm
Check that we have the correct number of trim files (226 in this case)
ls -l *.tr0 | wc -l
mkdir ../trimmedReads
srun mv *.tr0 ../trimmedReads &
zipper.py -f fastq -a -9 --launcher -e reckert2017@fau.edu
sbatch --mem=200GB zip.slurm
cd ../trimmedReads
Rename sequence files using sampleRename.py This script needs a .csv with XXX Make sure you use the reverse complement of your inline BCs!
srun sampleRename.py -i sampleList -n fk_ -f tr0
I can’t get KoKo’s module for cutadapt to work, so we’ll do it in miniconda. Run below if you don’t have a conda env. set up, otherwise you can skip to the next chunk
module load miniconda3-4.6.14-gcc-8.3.0-eenl5dj
conda config --add channels defaults
conda config --add channels bioconda
conda create -n 2bRAD cutadapt
Removing reads with qualities at ends less than Q15 for de novo analysis
source activate 2bRAD
echo '#!/bin/bash' > trimse.sh
echo 'module load miniconda3-4.6.14-gcc-8.3.0-eenl5dj' >> trimse.sh
echo 'source activate 2bRAD' >> trimse.sh
for file in *.tr0; do
echo "cutadapt -q 15,15 -m 36 -o ${file/.tr0/}.trim $file > ${file/.tr0/}.trimlog.txt" >> trimse.sh;
done
I can’t get it to run through launcher so just run it serially, it takes a while to run, so consider breaking up in several jobs.
sbatch -o trimse.o%j -e trimse.e%j --mem=200GB trimse.sh
Do we have expected number of *.trim files created?
conda deactivate
ls -l *.trim | wc -l
How many reads in each sample?
echo '#!/bin/bash' >sintReads
echo readCounts.sh -e trim -o sintFilt >>sintReads
sbatch --mem=200GB sintReads
mkdir ../filteredReads
mv *.trim ../filteredReads
zipper.py -f tr0 -a -9 --launcher -e reckert2017@fau.edu
sbatch zip.slurm
cat sintFiltReadCounts
Construct denovo reference for aligning reads
If you’ve not already, build a bt reference for the concatenated zoox genomes, otherwise skip ahead to the next chunk
mkdir ~/bin/symGenomes
cd ~/bin/symGenomes
echo "bowtie2-build symbConcatGenome.fasta symbConcatGenome" > bowtie2-build
launcher_creator.py -j bowtie2-build -n bowtie2-build -q shortq7 -t 06:00:00 -e reckert2017@fau.edu
module load bowtie2-2.3.5.1-gcc-8.3.0-63cvhw5
sbatch --mem=200GB bowtie2-build.slurm
module load samtools-1.10-gcc-8.3.0-khgksad
srun samtools faidx symbConcatGenome.fasta
Mapping reads to concatenated Symbiodinaceae genome
cd ~/2bRAD/fknms/sint/filteredReads
mkdir symbionts
SYMGENOME=~/bin/symGenomes/symbConcatGenome
2bRAD_bowtie2_launcher.py -g $SYMGENOME -f .trim -n zooxMaps --split -u un -a zoox --aldir symbionts --launcher -e reckert2017@fau.edu
sbatch zooxMaps.slurm
Checking Symbiodiniaceae read mapping rates
>zooxAlignmentRates
for F in `ls *trim`; do
M=`grep -E '^[ATGCN]+$' $F | wc -l | grep -f - zooxMaps.e* -A 4 | tail -1 | perl -pe 's/zooxMaps\.e\d+-|% overall alignment rate//g'` ;
echo "$F.sam $M">>zooxAlignmentRates;
done
# ls *trim | cut -d '.' -f 1 >align1
# grep "% overall" zooxMaps.e* | cut -d ' ' -f 1 >align2
# paste <(awk -F' ' '{print $1}' align1) <(awk -F' ' '{print $1}' align2) >zooxAlignmentRates
# rm align1 align2
less zooxAlignmentRates
Clean up directory
zipper.py -f .trim -a -9 --launcher -e reckert2017@fau.edu
sbatch --mem=200GB zip.slurm
mv *.sam symbionts/
cd ~/2bRAD/sint/fknms/
mv filteredReads/symbionts .
‘stacking’ individual trimmed fastq reads:
cd filteredReads
ls *.trim.un | perl -pe 's/^(.+)$/uniquerOne.pl $1 >$1\.uni/' > unique
launcher_creator.py -j unique -n unique -q shortq7 -t 06:00:00 -e reckert2017@fau.edu
sbatch --mem=200GB unique.slurm
Checking there is a .uni for all samples
ls -l *.uni | wc -l
We can remove contamination sequences from our denovo reference with
kraken
We need a Kraken database to search against. If you don’t already have one, you need to build one now. otherwise you can skip to running Kraken
We can download a compiled standard database:
mkdir ~/bin/krakenDB
cd ~/bin/krakenDB
srun wget https://genome-idx.s3.amazonaws.com/kraken/k2_pluspf_20210127.tar.gz
echo tar -xvf k2_pluspf_20210127.tar.gz >tar
launcher_creator.py -j tar -n tar -q mediumq7 -t 24:00:00 -e reckert2017@fau.edu
sbatch tar.slurm
echo '#!/bin/bash' >krakendb.sh
echo kraken2-build --download-taxonomy --db ~/bin/krakenDB >>krakendb.sh
echo kraken2-build --download-library archaea --threads 16 --db ~/bin/krakenDB >>krakendb.sh
echo kraken2-build --download-library bacteria --threads 16 --db ~/bin/krakenDB >>krakendb.sh
echo kraken2-build --download-library viral --threads 16 --db ~/bin/krakenDB >>krakendb.sh
echo kraken2-build --download-library human --threads 16 --db ~/bin/krakenDB >>krakendb.sh
echo kraken2-build --download-library fungi --threads 16 --db ~/bin/krakenDB >>krakendb.sh
echo kraken2-build --download-library protozoa --threads 16 --db ~/bin/krakenDB >>krakendb.sh
echo kraken2-build --download-library UniVec_Core --threads 16 --db ~/bin/krakenDB >>krakendb.sh
sbatch --mem=200GB -p longq7 -e krakenDB.e%j -o krakenDB.o%j krakendb.sh
cd ~/bin/symGenomes
# Symbiodinium microadriaticum
sed '/>/ s/$/|kraken:taxid|2951/' Symbiodinium_microadriacticum_genome.scaffold.fasta >S_microadriacticum.fa
# Breviolum minutum
sed '/>/ s/$/|kraken:taxid|2499525/' Breviolum_minutum.v1.0.genome.fa >B_minutum.fa
# Cladocopium goreaui
sed '/>/ s/$/|kraken:taxid|2562237/' Cladocopium_goreaui_Genome.Scaffolds.fasta >C_goreaui.fa
# Durusdinium trenchii
sed '/>/ s/$/|kraken:taxid|1381693/' 102_symbd_genome_scaffold.fa >D_trenchii.fa
echo '#!/bin/bash' >kdbAdd
echo kraken2-build --add-to-library ~/bin/symGenomes/S_microadriacticum.fa --db ~/bin/krakenDB >>kdbAdd
echo kraken2-build --add-to-library ~/bin/symGenomes/B_minutum.fa --db ~/bin/krakenDB >>kdbAdd
echo kraken2-build --add-to-library ~/bin/symGenomes/C_goreaui.fa --db ~/bin/krakenDB >>kdbAdd
echo kraken2-build --add-to-library ~/bin/symGenomes/D_trenchii.fa --db ~/bin/krakenDB >>kdbAdd
sbatch --mem=200GB -o kdbAdd.o%j -e kdbAdd.e%j kdbAdd
echo '#!/bin/bash' >kdbBuild
echo kraken2-build --download-taxonomy --threads 16 --db /mnt/beegfs/home/reckert2017/bin/krakenDB >>kdbBuild
echo kraken2-build --build --db ~/bin/krakenDB >>kdbBuild
sbatch --mem=200GB -o kdbBuild.o%j -e kdbBuild.e%j kdbBuild
Remove potential contamination from reference
cd ~/2bRAD/sint/fknms/filteredReads
echo '#!/bin/bash' >krakenDB
echo kraken2 --db ~/bin/krakenDB cdh_alltags.fas --threads 16 --classified-out cdh_alltags.contam.fa --unclassified-out cdh_alltags.unclass.fa --report krakenDB.report --output krakenDB.out >>krakenDB
sbatch --mem=200GB -o krakenDB.o%j -e krakenDB.e%j krakenDB
With 30 pseudo chromosomes from clean major allele tags
mkdir ../mappedReads
mv sint_denovo.fa ../mappedReads
cd ../mappedReads
concatFasta.pl fasta=sint_denovo.fa num=30
Format pseudo genome
GENOME_FASTA=sint_denovo_cc.fasta
echo '#!/bin/bash' >genomeBuild.sh
echo bowtie2-build $GENOME_FASTA $GENOME_FASTA >>genomeBuild.sh
echo samtools faidx $GENOME_FASTA >>genomeBuild.sh
sbatch -o genomeBuild.o%j -e genomeBuild.e%j --mem=200GB genomeBuild.sh
Mapping reads to reference and formatting bam files
Map reads to fake genome:
mv ../filteredReads/*.un .
mv ../filteredReads/symbionts .
GENOME_FASTA=sint_denovo_cc.fasta
# mapping with --local option, enables clipping of mismatching ends (guards against deletions near ends of RAD tags)
2bRAD_bowtie2_launcher.py -f un -g $GENOME_FASTA --launcher -e reckert2017@fau.edu
sbatch --mem=200GB maps.slurm
Do we have the right number of SAM files?
ls *.sam | wc -l
Checking alignment rates
ls *un | cut -d '.' -f 1 >align1
grep "% overall" maps.e* | cut -d ' ' -f 1 >align2
>alignmentRates
paste <(awk -F' ' '{print $1}' align1) <(awk -F' ' '{print $1}' align2) >alignmentRates
rm align1 align2
less alignmentRates
BAM files will be used for genotyping, population structure, etc.
>s2b
for file in *.sam; do
echo "samtools sort -O bam -o ${file/.sam/}.bam $file && samtools index ${file/.sam/}.bam">>s2b;
done
launcher_creator.py -j s2b -n s2b -q shortq7 -t 06:00:00 -e reckert2017@fau.edu
sbatch --mem=200GB s2b.slurm
Do we have enough BAM files?
ls *bam | wc -l # should be the same number as number of trim files
Clean up directory
zipper.py -a -9 -f sam --launcher -e reckert2017@fau.edu
sbatch zip.slurm
rm *.un
“FUZZY genotyping” with ANGSD - without calling actual genotypes but working with genotype likelihoods at each SNP. Optimal for low-coverage data (<10x).
mkdir ../ANGSD
cd ../ANGSD
mv ../mappedReads/*.bam* .
ls *bam >bamsClones
ANGSD
settings: -minMapQ 20
: only highly
unique mappings (prob of erroneous mapping =< 1%)
-baq 1
: realign around indels (not terribly relevant for
2bRAD reads mapped with –local option) -maxDepth
: highest
total depth (sum over all samples) to assess; set to 10x number of
samples -minInd
: the minimal number of individuals the site
must be genotyped in. Reset to 50% of total N at this stage.
export FILTERS="-uniqueOnly 1 -remove_bads 1 -minMapQ 20 -maxDepth 2260 -minInd 113"
export TODO="-doQsDist 1 -doDepth 1 -doCounts 1 -dumpCounts 2"
echo '#!/bin/bash' >sintDD.sh
echo angsd -b bamsClones -GL 1 $FILTERS $TODO -P 1 -out dd >>sintDD.sh
sbatch --mem=200GB -o sintDD.o%j -e sintDD.e%j --mail-user=reckert2017@fau.edu --mail-type=ALL sintDD.sh
Summarizing results (using Misha Matz modified script by Matteo Fumagalli)
echo '#!/bin/bash' >RQC.sh
echo Rscript ~/bin/plotQC.R prefix=dd >>RQC.sh
echo gzip -9 dd.counts >>RQC.sh
sbatch -e RQC.e%j -o RQC.o%j --dependency=afterok:460550 --mem=200GB RQC.sh
Proportion of sites covered at >5X:
cat quality.txt
scp
dd.pdf to laptop to look at distribution of base
quality scores, fraction of sites in each sample passing coverage
thresholds and fraction of sites passing genotyping rates cutoffs. Use
these to guide choices of -minQ
, -minIndDepth
and -minInd
filters in subsequent ANGSD
runs
FILTERS="-uniqueOnly 1 -remove_bads 1 -minMapQ 20 -minQ 30 -dosnpstat 1 -doHWE 1 -hwe_pval 1e-5 -sb_pval 1e-5 -hetbias_pval 1e-5 -skipTriallelic 1 -minInd 170 -snp_pval 1e-6 -minMaf 0.05 -setMinDepthInd 3"
TODO="-doMajorMinor 1 -doMaf 1 -doCounts 1 -makeMatrix 1 -doIBS 1 -doCov 1 -doGeno 8 -doBcf 1 -doPost 1 -doGlf 2"
echo '#!/bin/bash' > sintClones3x.sh
echo angsd -b bamsClones -GL 1 $FILTERS $TODO -P 1 -out sintClones3x >>sintClones3x.sh
sbatch --mem=200GB -o sintClones3x.o%j -e sintClones3x.e%j -p shortq7 --mail-type=ALL --mail-user=reckert2017@fau.edu sintClones3x.sh
Use ibs matrix to identify clones with hierachial clustering in
R
. scp
to local machine and run chunk below in
R
if (!require("pacman")) install.packages("pacman")
pacman::p_load("dendextend", "ggdendro", "tidyverse")
cloneBams = read.csv("../data/stephanocoeniaMetaData.csv") # list of bam files
cloneMa = as.matrix(read.table("../data/snps/clones/sintClones.ibsMat")) # reads in IBS matrix produced by ANGSD
dimnames(cloneMa) = list(cloneBams[,1],cloneBams[,1])
clonesHc = hclust(as.dist(cloneMa),"ave")
clonePops = cloneBams$region
cloneDepth = cloneBams$depthZone
cloneDend = cloneMa %>% as.dist() %>% hclust(.,"ave") %>% as.dendrogram()
cloneDData = cloneDend %>% dendro_data()
# Making the branches hang shorter so we can easily see clonal groups
cloneDData$segments$yend2 = cloneDData$segments$yend
for(i in 1:nrow(cloneDData$segments)) {
if (cloneDData$segments$yend2[i] == 0) {
cloneDData$segments$yend2[i] = (cloneDData$segments$y[i] - 0.01)}}
cloneDendPoints = cloneDData$labels
cloneDendPoints$pop = clonePops[order.dendrogram(cloneDend)]
cloneDendPoints$depth=cloneDepth[order.dendrogram(cloneDend)]
rownames(cloneDendPoints) = cloneDendPoints$label
# Making points at the leaves to place symbols for populations
point = as.vector(NA)
for(i in 1:nrow(cloneDData$segments)) {
if (cloneDData$segments$yend[i] == 0) {
point[i] = cloneDData$segments$y[i] - 0.01
} else {
point[i] = NA}}
cloneDendPoints$y = point[!is.na(point)]
techReps = c("S066.1", "S066.2", "S066.3", "S162.1", "S162.2", "S162.3", "S205.1", "S205.2", "S205.3")
cloneDendPoints$depth = factor(cloneDendPoints$depth,levels(cloneDendPoints$depth)[c(2,1)])
cloneDendPoints$pop = factor(cloneDendPoints$pop,levels(cloneDendPoints$pop)[c(4, 1, 3, 2)])
flPal = paletteer_d("vapoRwave::jazzCup")[c(2:5)]
cloneDendA = ggplot() +
geom_segment(data = segment(cloneDData), aes(x = x, y = y, xend = xend, yend = yend2), size = 0.5) +
geom_point(data = cloneDendPoints, aes(x = x, y = y, fill = pop, shape = depth), size = 4, stroke = 0.25) +
#scale_fill_brewer(palette = "Dark2", name = "Population") +
scale_fill_manual(values = flPal, name= "Population")+
scale_shape_manual(values = c(24, 25), name = "Depth Zone")+
geom_hline(yintercept = 0.12, color = "red", lty = 5, size = 0.75) + # creating a dashed line to indicate a clonal distance threshold
geom_text(data = subset(cloneDendPoints, subset = label %in% techReps), aes(x = x, y = (y - .015), label = label), angle = 90) + # spacing technical replicates further from leaf
geom_text(data = subset(cloneDendPoints, subset = !label %in% techReps), aes(x = x, y = (y - .010), label = label), angle = 90) +
labs(y = "Genetic distance (1 - IBS)") +
guides(fill = guide_legend(override.aes = list(shape = 22)))+
theme_classic()
cloneDend = cloneDendA + theme(
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.line.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.y = element_text(size = 12, color = "black", angle = 90),
axis.text.y = element_text(size = 10, color = "black"),
axis.line.y = element_line(),
axis.ticks.y = element_line(),
panel.grid = element_blank(),
panel.border = element_blank(),
panel.background = element_blank(),
plot.background = element_blank(),
legend.key = element_blank(),
legend.title = element_text(size = 12),
legend.text = element_text(size = 10),
legend.position = "bottom")
cloneDend
ggsave("../figures/cloneDend.png", plot = cloneDend, height = 8, width = 35, units = "in", dpi = 300)
ggsave("../figures/cloneDend.eps", plot = cloneDend, height = 8, width = 35, units = "in", dpi = 300)
mkdir clones
mv sintClones* clones
ls *.bam > bamsNoClones
cat bamsClones | grep -v 'fk_S066.1.trim.un.bt2.bam\|fk_S066.3.trim.un.bt2.bam\|fk_S162.1.trim.un.bt2.bam\|fk_S162.3.trim.un.bt2.bam\|fk_S205.1.trim.un.bt2.bam\|fk_S205.3.trim.un.bt2.bam' >bamsNoClones
FILTERS="-uniqueOnly 1 -remove_bads 1 -minMapQ 20 -minQ 30 -dosnpstat 1 -doHWE 1 -hwe_pval 1e-5 -sb_pval 1e-5 -hetbias_pval 1e-5 -skipTriallelic 1 -minInd 165 -snp_pval 1e-6 -minMaf 0.05 -setMinDepthInd 3"
TODO="-doMajorMinor 1 -doMaf 1 -doCounts 1 -makeMatrix 1 -doIBS 1 -doCov 1 -doGeno 8 -doBcf 1 -doPost 1 -doGlf 2"
echo '#!/bin/bash' > sintNoClones3x.sh
echo srun angsd -b bamsNoClones -GL 1 $FILTERS $TODO -P 1 -out sintNoClones3x >> sintNoClones3x.sh
sbatch --mem=200GB -o sintNoClones3x.o%j -e sintNoClones3x.e%j -p shortq7 --mail-type=ALL --mail-user=reckert2017@fau.edu sintNoClones3x.sh
How many SNPs?
grep "filtering:" sintNoClones.e*
18,763 SNPs
Using pcangsd
to discern population structure
cd ~/2bRAD/sint/fknms
mkdir pcangsd
cd pcangsd
cp ../ANGSD/sintNoClones3x.beagle.gz .
source activate 2bRAD
echo '#!/bin/bash' > pcangsd.sh
echo 'pcangsd -b sintNoClones3x.beagle.gz -o fkSintPcangsd3x --admix --inbreedSamples --pcadapt --selection' >> pcangsd
launcher_creator.py -j pcangsd -n pcangsd -q shortq7 -t 06:00:00 -e $EMAIL -w 1 -N 1
sbatch pcangsd.slurm
Calculate population structure from genotype likelihoods using
NGSadmix
for K from 2 to 8 : FIRST remove all
clones/genotyping replicates! (we did this).
mkdir ../ngsAdmix
cp *beagle* ../ngsAdmix
zipper.py -f bam -a -9 --launcher -e reckert2017@fau.edu
sbatch zip.slurm
cd ../ngsAdmix
Create a file with 50 replicate simulations for each value of K 1-11 (num pops + 3)
ngsAdmixLauncher.py -f sintNoClones3x.beagle.gz --maxK 11 -r 50 -n fkSint3x --launcher -e reckert2017@fau.edu
sbatch --mem=200GB fkSint3xNgsAdmix.slurm
Next, take the likelihood value from each run of NGSadmix and put them into a file that can be used with Clumpak to calculate the most likely K using the methods of Evanno et al. (2005).
>fkSint3xNgsAdmixLogfile
for log in fkSint3x*.log; do
grep -Po 'like=\K[^ ]+' $log >> fkSint3xNgsAdmixLogfile;
done
Format for CLUMPAK in R
R
You are now using R in the terminal
logs <- as.data.frame(read.table("fkSint3xNgsAdmixLogfile"))
#output is organized with 10, 11 preceding 1, 2, 3 etc.
logs$K <- c(rep("10", 50), rep("11", 50), rep("1", 50), rep("2", 50), rep("3", 50),
rep("4", 50), rep("5", 50), rep("6", 50),
rep("7", 50), rep("8", 50), rep("9", 50))
write.table(logs[, c(2, 1)], "fkSint3xNgsAdmixLogfile_formatted", row.names = F,
col.names = F, quote = F)
quit()
# No need to save workspace image [press 'n']
n
Check that your formatted logfile has the appropriate number of entries
cat fkSint3xNgsAdmixLogfile_formatted | wc -l
make copies of .qopt files to run structure selector on (.Q files)
for file in fkSint3x*.qopt; do
filename=$(basename -- "$file" .qopt);
cp "$file" "$filename".Q;
done
mkdir fkSint3xQ
mv fkSint3x*Q fkSint3xQ
zip -r fkSint3xQ.zip fkSint3xQ
scp
.zip and formatted logfile to local machine and
upload to CLUMPAK
(http://clumpak.tau.ac.il/bestK.html) and structure
selector (https://lmme.ac.cn/StructureSelector/index.html)
scp
sintNoClones* to local machine for further analyses
with R
Since we have 4 distinct lineages, we will pare down SNPs to only those found within all lineages. This helps avoid issues of ascertainment bias.
cd ~/2bRAD/sint/fknms/ANGSD3x
# created 'bamsClusters' in R, based on Admixture analysis
awk 'BEGIN { FS=" " } $2 == "Blue" { print $1 }' bamsClusters >> blueBams
awk 'BEGIN { FS=" " } $2 == "Teal" { print $1 }' bamsClusters >> tealBams
awk 'BEGIN { FS=" " } $2 == "Green" { print $1 }' bamsClusters >> greenBams
awk 'BEGIN { FS=" " } $2 == "Yellow" { print $1 }' bamsClusters >> yellowBams
FILTERS="-uniqueOnly 1 -remove_bads 1 -minMapQ 20 -minQ 30 -dosnpstat 1 -doHWE 1 -hwe_pval 1e-5 -sb_pval 1e-5 -hetbias_pval 1e-5 -skipTriallelic 1 -snp_pval 1e-5 -minMaf 0.05 -setMinDepthInd 3"
TODO="-doMajorMinor 1 -doMaf 1 -doGeno 8 -doPost 1 -doGlf 2"
echo "angsd -b blueBams -GL 1 -P 1 $FILTERS $TODO -minInd 98 -out sintBlueSnps
angsd -b tealBams -GL 1 -P 1 $FILTERS $TODO -minInd 30 -out sintTealSnps
angsd -b greenBams -GL 1 -P 1 $FILTERS $TODO -minInd 23 -out sintGreenSnps
angsd -b yellowBams -GL 1 -P 1 $FILTERS $TODO -minInd 11 -out sintYellowSnps" > kSnps
launcher_creator.py -j kSnps -n kSnps -q shortq7 -t 06:00:00 -e $EMAIL -w 2 -N 1
sbatch kSnps.slurm
Filtering down the sites to those found within each lineage with the given ANGSD filtering criteria
for file in sint*Snps.geno.gz; do
echo '#!/bin/bash' > ${file%%.*}.sh;
echo "zcat $file | awk '{print \$1\"\t\"\$2}' > ${file%%.*}sites" >> ${file%%.*}.sh;
sbatch -e ${file%%.*}.e%j -o ${file%%.*}.o%j -p shortq7 --mem=100GB --mail-user reckert2017@fau.edu --mail-type=ALL ${file%%.*}.sh;
done
srun awk '(++c[$0])==(ARGC-1)' *Snpssites > sitesToDo
mkdir ../filteredSNPS3x
mv sint*Snps* ../filteredSNPS3x/
mv kSnps* ../filteredSNPS3x/
Now index the sites file and run angsd on all samples with only the
reduced number of sites using -sites
flag
cd ~/2bRAD/fknms/sint/ANGSD3x
angsd sites index sitesToDo
FILTERS="-uniqueOnly 1 -remove_bads 1 -minMapQ 20 -minQ 30 -dosnpstat 1 -doHWE 1 -hwe_pval 1e-5 -sb_pval 1e-5 -hetbias_pval 1e-5 -skipTriallelic 1 -snp_pval 1e-5 -minMaf 0.05 -setMinDepthInd 3"
TODO="-doMajorMinor 1 -doMaf 1 -doCounts 1 -makeMatrix 1 -doIBS 1 -doCov 1 -doGeno 8 -doBcf 1 -doPost 1 -doGlf 2"
echo "angsd -b bamsNoClones -sites sitesToDo -GL 1 -P 1 $FILTERS $TODO -minInd 165 -out sintFiltSnps" > sintFiltSnps3x
launcher_creator.py -j sintFiltSnps3x -n sintFiltSnps3x -q shortq7 -t 06:00:00 -e $EMAIL -w 2 -N 1
sbatch sintFiltSnps3x.slurm
mv sintFilt* ../filteredSNPS3x/
cd ~/2bRAD/sint/fknms/ANGSD3x
FILTERS="-uniqueOnly 1 -remove_bads 1 -minMapQ 20 -minQ 30 -dosnpstat 1 -doHWE 1 -hwe_pval 1e-5 -sb_pval 1e-5 -hetbias_pval 1e-5 -skipTriallelic 1 -snp_pval 1e-5 -minMaf 0.05 -setMinDepthInd 3"
TODO="-doMajorMinor 1 -doMaf 1 -doCounts 1 -makeMatrix 1 -doIBS 1 -doCov 1 -doGeno 8 -doBcf 1 -doPost 1 -doGlf 3"
echo '#!/bin/bash' > sintFiltRelate3x.sh
echo srun angsd -b bamsNoClones -GL 1 -sites sitesToDo $FILTERS $TODO -P 1 -minInd 165 -out sintFiltRelate3x >> sintFiltRelate3x.sh
sbatch --mem=200GB -o sintFiltRelate3x.o%j -e sintFiltRelate3x.e%j -p shortq7 --mail-type=ALL --mail-user=reckert2017@fau.edu sintFiltRelate3x.sh
mkdir ../ngsRelate3x
cd ../ngsRelate3x
mv ~/2bRAD/fknms/sint/ANGSD3x/*Relate* .
zcat sintFiltRelate3x.mafs.gz | cut -f5 |sed 1d >freq
echo '#!/bin/bash' > ngsFiltRelate3x.sh
echo ngsRelate -g sintFiltRelate3x.glf.gz -n 220 -f freq -z ../ANGSD3x/bamsNoClones -O filtRelate3x >> ngsFiltRelate3x.sh
sbatch -e ngsFiltRelate3x.e%j -o ngsFiltRelate3x.o%j --mem=200GB --mail-user reckert2017@fau.edu --mail-type=ALL ngsFiltRelate3x.sh
module load ngstools-master-gcc-8.3.0-qcbecbz
module load ngsf-master-gcc-8.3.0-yscrrl3
mkdir ~/2bRAD/fknms/sint/ngsF3x
cd ~/2bRAD/fknms/sint/ngsF3x
cp ../ngsRelate3x/sintFilt* .
N_SITES=$((`zcat sintFiltRelate3x.mafs.gz | wc -l`-1))
srun zcat sintFiltRelate3x.glf.gz | ngsF --n_ind 220 --n_sites $N_SITES --glf - --min_epsilon 0.001 --out fkSint.approx_indF --approx_EM --seed 12345 --init_values r
srun zcat sintFiltRelate3x.glf.gz | ngsF --n_ind 220 --n_sites $N_SITES --glf - --min_epsilon 0.001 --out fkSintF.indF --init_values fkSint.approx_indF.pars
Running SFS calculations within lineage to use for Fst and to run
StairwayPlot
cd ~/2bRAD/fknms/sint/ANGSD3x
## scp bamsClusters text file to directory
awk 'BEGIN { FS=" " } $2 == "Blue" { print $1 }' bamsClusters >> blueBams
awk 'BEGIN { FS=" " } $2 == "Teal" { print $1 }' bamsClusters >> tealBams
awk 'BEGIN { FS=" " } $2 == "Green" { print $1 }' bamsClusters >> greenBams
awk 'BEGIN { FS=" " } $2 == "Yellow" { print $1 }' bamsClusters >> yellowBams
## No filters to distort allelic frequencies
FILTERS="-uniqueOnly 1 -remove_bads 1 -skipTriallelic 1 -minMapQ 30 -minQ 35 -doHWE 1 -sb_pval 1e-5 -hetbias_pval 1e-5 -maxHetFreq 0.5 -setMinDepthInd 3"
TODO="-doMajorMinor 1 -doMaf 1 -dosnpstat 1 -doPost 2 -doGeno 11 -doGlf 2"
echo "angsd -b blueBams -GL 1 -P 1 $FILTERS $TODO -minInd 98 -out sintBlueSFS3x
angsd -b tealBams -GL 1 -P 1 $FILTERS $TODO -minInd 30 -out sintTealSFS3x
angsd -b greenBams -GL 1 -P 1 $FILTERS $TODO -minInd 23 -out sintGreenSFS3x
angsd -b yellowBams -GL 1 -P 1 $FILTERS $TODO -minInd 11 -out sintYellowSFS3x" > sfsClusters3x
launcher_creator.py -j sfsClusters3x -n sfsClusters3x -q shortq7 -t 06:00:00 -e $EMAIL -N 1
sbatch sfsClusters3x.slurm
for file in sint*SFS3x.geno.gz; do
echo '#!/bin/bash' > ${file%%.*}.sh;
echo "zcat $file | awk '{print \$1\"\t\"\$2}' > ${file%%.*}sites" >> ${file%%.*}.sh;
sbatch -e ${file%%.*}.e%j -o ${file%%.*}.o%j -p longq7 --mem=100GB --mail-user reckert2017@fau.edu --mail-type=ALL ${file%%.*}.sh;
done
Now, compile common sites from all lineages using awk
srun awk '(++c[$0])==(ARGC-1)' *SFS3xsites > sfs3xSitesToDo
cat sfs3xSitesToDo | wc -l
#index sites for ANGSD and re-run ANGSD using ```-sites```
angsd sites index sfs3xSitesToDo
export GENOME_FASTA=~/2bRAD/fknms/sint/mappedReads/sint_denovo_cc.fasta
TODO="-doSaf 1 -ref $GENOME_FASTA -anc $GENOME_FASTA -doMaf 1 -doMajorMinor 4"
echo "angsd -sites sfs3xSitesToDo -b blueBams -GL 1 -P 1 $TODO -out sintBlue3x
angsd -sites sfs3xSitesToDo -b tealBams -GL 1 -P 1 $TODO -out sintTeal3x
angsd -sites sfs3xSitesToDo -b greenBams -GL 1 -P 1 $TODO -out sintGreen3x
angsd -sites sfs3xSitesToDo -b yellowBams -GL 1 -P 1 $TODO -out sintYellow3x" >sfs3x
launcher_creator.py -j sfs3x -n sfs3x -q shortq7 -t 06:00:00 -e $EMAIL -N 1
sbatch sfs3x.slurm
Once all jobs run:
mkdir ../SFS3x
mv sfs* ../SFS3x
mv sintBlue* ../SFS3x
mv sintTeal* ../SFS3x
mv sintGreen* ../SFS3x
mv sintYellow* ../SFS3x
cd ../SFS3x
###-- 1d-SFS
echo "realSFS sintBlue3x.saf.idx >sintBlue3x.sfs
realSFS sintTeal3x.saf.idx >sintTeal3x.sfs
realSFS sintGreen3x.saf.idx >sintGreen3x.sfs
realSFS sintYellow3x.saf.idx >sintYellow3x.sfs" >realSFS3x
launcher_creator.py -j realSFS3x -n realSFS3x -q shortq7 -t 06:00:00 -e $EMAIL -w 4 -N 1
sbatch realSFS3x.slurm
###-- 2d-SFS
echo "realSFS sintBlue3x.saf.idx sintTeal3x.saf.idx -P 20 > pBlTl3x.sfs ; realSFS fst index sintBlue3x.saf.idx sintTeal3x.saf.idx -sfs pBlTl3x.sfs -fstout pBlTl3x
realSFS sintBlue3x.saf.idx sintGreen3x.saf.idx -P 20 > pBlGn3x.sfs ; realSFS fst index sintBlue3x.saf.idx sintGreen3x.saf.idx -sfs pBlGn3x.sfs -fstout pBlGn3x
realSFS sintBlue3x.saf.idx sintYellow3x.saf.idx -P 20 > pBlYl3x.sfs ; realSFS fst index sintBlue3x.saf.idx sintYellow3x.saf.idx -sfs pBlYl3x.sfs -fstout pBlYl3x
realSFS sintTeal3x.saf.idx sintGreen3x.saf.idx -P 20 > pTlGn3x.sfs ; realSFS fst index sintTeal3x.saf.idx sintGreen3x.saf.idx -sfs pTlGn3x.sfs -fstout pTlGn3x
realSFS sintTeal3x.saf.idx sintYellow3x.saf.idx -P 20 > pTlYl3x.sfs ; realSFS fst index sintTeal3x.saf.idx sintYellow3x.saf.idx -sfs pTlYl3x.sfs -fstout pTlYl3x
realSFS sintGreen3x.saf.idx sintYellow3x.saf.idx -P 20 > pGnYl3x.sfs ; realSFS fst index sintGreen3x.saf.idx sintYellow3x.saf.idx -sfs pGnYl3x.sfs -fstout pGnYl3x" >2dSFS3x
launcher_creator.py -j 2dSFS3x -n 2dSFS3x -q shortq7 -t 06:00:00 -e $EMAIL -w 20 -N 1
sbatch 2dSFS3x.slurm
> sintKFst3x
realSFS fst stats pBlYl3x.fst.idx >> sintKFst3x
realSFS fst stats pTlYl3x.fst.idx >> sintKFst3x
realSFS fst stats pGnYl3x.fst.idx >> sintKFst3x
realSFS fst stats pBlTl3x.fst.idx >> sintKFst3x
realSFS fst stats pBlGn3x.fst.idx >> sintKFst3x
realSFS fst stats pTlGn3x.fst.idx >> sintKFst3x
echo "Blue
Teal
Green
Blue
Blue
Teal">fstPops1
echo "Yellow
Yellow
Yellow
Teal
Green
Green" >fstPops2
paste fstPops1 fstPops2 sintKFst3x > sintKFst3x.o
echo -e "pop1\tpop2\tfst\tweightedFst" | cat - sintKFst3x.o > sintKFst3x.out
rm fstPops1 fstPops2 sintKFst3x sintKFst3x.o
cat sintKFst3x.out
First, we filter Fst outliers from sites:
cd ~/2bRAD/fknms/sint/ANGSD3x
FILTERS="-minMapQ 20 -minQ 30 -minInd 165 -doHWE 1 -sb_pval 1e-5 -hetbias_pval 1e-5 -uniqueOnly 1 -remove_bads 1 -skipTriallelic 1 -snp_pval 1e-5 -minMaf 0.05 -setMinDepthInd 3"
TODO="-doMajorMinor 4 -ref $GENOME_FASTA -doMaf 1 -dosnpstat 1 -doPost 2 -doBcf 1 --ignore-RG 0 -doGeno 11 -doCounts 1"
echo "angsd -b bamsNoClones -sites sfs3xSitesToDo -GL 1 $FILTERS $TODO -P 1 -out sintFiltSnps3xBayesFst" > sintFiltSnps3xBayesFst
launcher_creator.py -j sintFiltSnps3xBayesFst -n sintFiltSnps3xBayesFst -t 6:00:00 -e $EMAIL -w 1 -q shortq7
sbatch sintFiltSnps3xBayesFst.slurm
mkdir ~/2bRAD/fknms/sint/bayescan3x
cd ~/2bRAD/fknms/sint/bayescan3x
Create a file called vcf2bayescan.spid containing this text:
echo "############
# VCF Parser questions
PARSER_FORMAT=VCF
# Do you want to include a file with population definitions?
VCF_PARSER_POP_QUESTION=true
# Only input following regions (refSeqName:start:end, multiple regions: whitespace separated):
VCF_PARSER_REGION_QUESTION=
# What is the ploidy of the data?
VCF_PARSER_PLOIDY_QUESTION=DIPLOID
# Only output following individuals (ind1, ind2, ind4, ...):
VCF_PARSER_IND_QUESTION=
# Output genotypes as missing if the read depth of a position for the sample is below:
VCF_PARSER_READ_QUESTION=
# Take most likely genotype if "PL" or "GL" is given in the genotype field?
VCF_PARSER_PL_QUESTION=true
# Do you want to exclude loci with only missing data?
VCF_PARSER_EXC_MISSING_LOCI_QUESTION=false
# Select population definition file:
VCF_PARSER_POP_FILE_QUESTION=./bspops.txt
# Only output SNPs with a phred-scaled quality of at least:
VCF_PARSER_QUAL_QUESTION=
# Do you want to include non-polymorphic SNPs?
VCF_PARSER_MONOMORPHIC_QUESTION=false
# Output genotypes as missing if the phred-scale genotype quality is below:
VCF_PARSER_GTQUAL_QUESTION=
# GESTE / BayeScan Writer questions
WRITER_FORMAT=GESTE_BAYE_SCAN
# Specify which data type should be included in the GESTE / BayeScan file (GESTE / BayeScan can only analyze one data type per file):
GESTE_BAYE_SCAN_WRITER_DATA_TYPE_QUESTION=SNP
############" >vcf2bayescan.spid
Need a file “bspops” that has samples in .bams order with the assigned populations/lineages
cp ../ANGSD3x/bamsClusters ./bspops.txt
cp ../ANGSD3x/sintFiltSnps3xBayesFst.bcf .
srun bcftools convert -O v -o sintFiltSnps3xBayesFst.vcf sintFiltSnps3xBayesFst.bcf
srun java -Xmx1024m -Xms512m -jar ~/bin/PGDSpider_2.0.7.1/PGDSpider2-cli.jar -inputfile sintFiltSnps3xBayesFst.vcf -outputfile sintFilt3x.bayescan -spid vcf2bayescan.spid
echo '#!/bin/bash' >bayeScan3x.sh
echo "bayescan sintFilt3x.bayescan -threads=100" >>bayeScan3x.sh
sbatch -e bayeScan3x.e%j -o bayeScan3x.o%j -p mediumq7 --mail-user reckert2017@fau.edu --mail-type=ALL bayeScan3x.sh
cut -d" " -f2- sintFilt3x.baye_fst.txt > sint3xFst
## removing all the .bcf data before snps
tail --lines=+70 sintFiltSnps3xBayesFst.vcf | cut -f 1,2 | paste --delimiters "\t" - sint3xFst > sint3x.baye_fst_pos.txt
Identify sites with q-value < 0.5
cp ../ANGSD3x/sfs3xSitesToDo .
awk '$5<0.5 {print $1"\t"$2}' sint3x.baye_fst_pos.txt > sint3xBayeOuts
grep -Fvxf sint3xBayeOuts sfs3xSitesToDo > sitesToDo3x_filtBaye
cp sitesToDo3x_filtBaye ../ANGSD3x
cd ../ANGSD3x
angsd sites index sitesToDo3x_filtBaye
export GENOME_FASTA=~/2bRAD/fknms/sint/mappedReads/sint_denovo_cc.fasta
TODO="-doSaf 1 -anc $GENOME_FASTA -ref $GENOME_FASTA -doMaf 1 -doMajorMinor 4"
echo "angsd -sites sitesToDo3x_filtBaye -b blueBams -GL 1 -P 1 $TODO -out sintBlueSFS3xbaye
angsd -sites sitesToDo3x_filtBaye -b tealBams -GL 1 -P 1 $TODO -out sintTealSFS3xbaye
angsd -sites sitesToDo3x_filtBaye -b greenBams -GL 1 -P 1 $TODO -out sintGreenSFS3xbaye
angsd -sites sitesToDo3x_filtBaye -b yellowBams -GL 1 -P 1 $TODO -out sintYellowSFS3xbaye" >sfs3x_filtBaye
launcher_creator.py -j sfs3x_filtBaye -n sfs3x_filtBaye -q shortq7 -t 06:00:00 -e $EMAIL -N 1
sbatch sfs3x_filtBaye.slurm
mv *SFS* ../SFS3x
mv *sfs* ../SFS3x
Generating per-population SFS
cd ../SFS3x
echo "realSFS sintBlueSFS3xbaye.saf.idx -fold 1 >sint3xFiltBlue.sfs
realSFS sintTealSFS3xbaye.saf.idx -fold 1 >sint3xFiltTeal.sfs
realSFS sintGreenSFS3xbaye.saf.idx -fold 1 >sint3xFiltGreen.sfs
realSFS sintYellowSFS3xbaye.saf.idx -fold 1 >sint3xFiltYellow.sfs" >sintFiltSFS3x
launcher_creator.py -j sintFiltSFS3x -n sintFiltSFS3x -q shortq7 -t 06:00:00 -e $EMAIL -N 1
sbatch sintFiltSFS3x.slurm
Making blueprint files for StairwayPlot
mkdir ../swPlot3x
cd ../swPlot3x
# SWplot only wants the mutations, so have to omit first number in SFS file (0 mutations).
# L = cat ../ANGSD3x/sitesToDo3x_filtBaye | wc -l
# which in our case =1602283
echo "#input setting
popid: sintBlue # id of the population (no white space)
nseq: 262 # number of sequences (2*n)
L: 1602283 # total number of observed nucleic sites, including polymorphic and monomorphic
whether_folded: true # whethr the SFS is folded (true or false)
SFS: 48412.081585 16072.295629 8136.263471 5566.281044 3884.862859 3000.493939 2490.542066 2044.605947 1677.829824 1462.808464 1153.734320 1210.340336 996.550586 853.912239 871.393691 745.252259 707.343914 644.722151 524.781440 619.940399 453.336102 512.412506 418.589833 345.809706 589.674490 224.318676 425.283171 225.893123 296.762772 325.835638 366.133932 265.969924 263.448203 232.699638 255.827769 255.852748 232.754720 243.526873 148.221226 258.233364 208.669644 144.653762 264.557705 161.092656 138.914035 101.710255 253.695151 29.148930 181.051140 221.657390 97.162859 184.658342 21.433043 121.318353 251.518539 100.839142 157.846619 89.114729 107.951232 187.816747 136.096324 102.504921 119.852484 74.043546 120.152182 118.940635 73.374622 110.083987 103.685855 114.348029 54.475104 95.164183 92.298819 77.948983 105.927514 121.353275 85.980549 43.879627 90.136484 73.330610 122.956757 37.319611 21.488750 103.526304 176.889672 45.691251 30.019737 68.840477 95.416782 24.877401 27.192546 76.326701 94.728179 44.204632 55.961376 107.185398 98.646553 28.767601 41.776037 41.180102 58.361194 90.205366 17.981080 20.352570 93.415816 43.464174 31.050422 35.300437 126.310322 20.320044 27.939147 75.448620 71.715330 3.034916 48.578866 86.238681 51.132361 33.057738 65.638127 27.008721 64.533759 77.421571 40.131592 121.553633 31.729413 50.015735 126.942517 14.357809 92.065024 66.165010 149.756469
#smallest_size_of_SFS_bin_used_for_estimation: 1 # default is 1; to ignore singletons, uncomment this line and change this number to 2
#largest_size_of_SFS_bin_used_for_estimation: 131 # default is nseq/2 for folded SFS
pct_training: 0.67 # percentage of sites for training
nrand: 65 130 195 260 # number of random break points for each try (separated by white space)
project_dir: sintBlue # project directory
stairway_plot_dir: $HOME/bin/stairway_plot/stairway_plot_es # directory to the stairway plot files
ninput: 200 # number of input files to be created for each estimation
#random_seed: 6
#output setting
mu: 2e-8 # assumed mutation rate per site per generation
year_per_generation: 15 # assumed generation time (in years)
#plot setting
plot_title: sintBlue # title of the plot
xrange: 0.1,10000 # Time (1k year) range; format: xmin,xmax; "0,0" for default
yrange: 0,0 # Ne (1k individual) range; format: xmin,xmax; "0,0" for default
xspacing: 2 # X axis spacing
yspacing: 2 # Y axis spacing
fontsize: 12 # Font size" >sintBlue3x.blueprint
echo "#input setting
popid: sintTeal # id of the population (no white space)
nseq: 80 # number of sequences ->
L: 1602283 # total number of observed nucleic sites, including polymorphic and monomorphic
whether_folded: true # whethr the SFS is folded (true or false)
SFS: 16583.279005 7587.040220 4303.689077 3107.127953 2274.114237 1798.221276 1496.358048 1173.763214 1103.075623 822.648285 753.066621 730.392423 645.355034 602.195016 506.652940 428.804524 534.251728 450.322840 403.078661 380.013903 342.235477 286.348260 373.893637 243.034076 245.770095 249.644525 277.998715 224.384919 164.261869 253.896033 168.473848 240.951964 159.797919 171.288663 147.072982 148.058161 244.680742 217.024433 217.696620 289.787897
#smallest_size_of_SFS_bin_used_for_estimation: 1 # default is 1; to ignore singletons, uncomment this line and change this number to 2
#largest_size_of_SFS_bin_used_for_estimation: 40 # default is nseq/2 for folded SFS
pct_training: 0.67 # percentage of sites for training
nrand: 19 39 58 78 # number of random break points for each try (separated by white space)
project_dir: sintTeal # project directory
stairway_plot_dir: $HOME/bin/stairway_plot/stairway_plot_es # directory to the stairway plot files
ninput: 200 # number of input files to be created for each estimation
#random_seed: 12
#output setting
mu: 2e-8 # assumed mutation rate per site per generation
year_per_generation: 15 # assumed generation time (in years)
#plot setting
plot_title: sintTeal # title of the plot
xrange: 0.1,10000 # Time (1k year) range; format: xmin,xmax; "0,0" for default
yrange: 0,0 # Ne (1k individual) range; format: xmin,xmax; "0,0" for default
xspacing: 2 # X axis spacing
yspacing: 2 # Y axis spacing
fontsize: 12 # Font size" >sintTeal3x.blueprint
echo "#input setting
popid: sintGreen # id of the population (no white space)
nseq: 62 # number of sequences
L: 1602283 # total number of observed nucleic sites, including polymorphic and monomorphic
whether_folded: true # whethr the SFS is folded (true or false)
SFS: 16061.203222 6825.579753 3958.728333 2661.799656 1970.474821 1495.152098 1276.663012 1067.091402 978.810625 796.180817 728.985927 639.560007 622.381059 552.760773 472.054998 428.264832 427.700157 381.403228 396.408490 314.339368 265.966511 300.508523 255.298912 252.674516 282.144474 192.814169 240.587680 273.439733 316.896586 225.258840 456.920758
#smallest_size_of_SFS_bin_used_for_estimation: 1 # default is 1; to ignore singletons, uncomment this line and change this number to 2
#largest_size_of_SFS_bin_used_for_estimation: 31 # default is nseq/2 for folded SFS
pct_training: 0.67 # percentage of sites for training
nrand: 15 30 45 60 # number of random break points for each try (separated by white space)
project_dir: sintGreen # project directory
stairway_plot_dir: $HOME/bin/stairway_plot/stairway_plot_es # directory to the stairway plot files
ninput: 200 # number of input files to be created for each estimation
#random_seed: 71
#output setting
mu: 2e-8 # assumed mutation rate per site per generation
year_per_generation: 15 # assumed generation time (in years)
#plot setting
plot_title: sintGreen # title of the plot
xrange: 0.1,10000 # Time (1k year) range; format: xmin,xmax; "0,0" for default
yrange: 0,0 # Ne (1k individual) range; format: xmin,xmax; "0,0" for default
xspacing: 2 # X axis spacing
yspacing: 2 # Y axis spacing
fontsize: 12 # Font size" >sintGreen3x.blueprint
echo "#input setting
popid: sintYellow # id of the population (no white space)
nseq: 30 # number of sequences
L: 1602283 # total number of observed nucleic sites, including polymorphic and monomorphic
whether_folded: true # whethr the SFS is folded (true or false)
SFS: 14098.279848 6216.748731 3541.171487 2370.205010 2002.251792 1516.454204 1284.282892 890.364677 919.754597 657.957085 598.193012 515.892221 576.527408 537.284192 673.801264
#smallest_size_of_SFS_bin_used_for_estimation: 1 # default is 1; to ignore singletons, uncomment this line and change this number to 2
#largest_size_of_SFS_bin_used_for_estimation: 15 # default is nseq/2 for folded SFS
pct_training: 0.67 # percentage of sites for training
nrand: 7 14 21 28 # number of random break points for each try (separated by white space)
project_dir: sintYellow # project directory
stairway_plot_dir: $HOME/bin/stairway_plot/stairway_plot_es # directory to the stairway plot files
ninput: 200 # number of input files to be created for each estimation
#random_seed: 23
#output setting
mu: 2e-8 # assumed mutation rate per site per generation
year_per_generation: 15 # assumed generation time (in years)
#plot setting
plot_title: sintYellow # title of the plot
xrange: 0.1,10000 # Time (1k year) range; format: xmin,xmax; "0,0" for default
yrange: 0,0 # Ne (1k individual) range; format: xmin,xmax; "0,0" for default
xspacing: 2 # X axis spacing
yspacing: 2 # Y axis spacing
fontsize: 12 # Font size" >sintYellow3x.blueprint
Create batch files and launch:
for blueprint in *blueprint; do
java -cp $HOME/bin/stairway_plot/stairway_plot_es Stairbuilder $blueprint;
done
# split each .sh script and launch all (second job waiting with dependency on first finishing)
for file in *.sh; do
grep -B 805 "# Step 2: determine number of break points" $file > ${file%.*}1;
grep -A 805 "# Step 2: determine number of break points" $file > ${file%.*}2;
echo '#!/bin/bash' | cat - ${file%.*}2 >temp;
mv temp ${file%.*}2;
launcher_creator.py -j ${file%.*}1 -n ${file%.*}1 -q shortq7 -t 06:00:00 -e $EMAIL -N 5;
DEP=$(sbatch ${file%.*}1.slurm) && sbatch --dependency=afterok:${DEP##* } -e ${file%.*}2.e%j -o ${file%.*}2.o%j -p shortq7 --mem=200GB --mail-user reckert2017@fau.edu --mail-type=ALL ${file%.*}2;
done
After all have run:
mv sint*/*final.summary .
Calculating Heterozygosity across all loci (variant//invariant) using
ANGSD
and R
script from Misha Matz (https://github.com/z0on/2bRAD_denovo)
cd ~/2bRAD/fknms/sint/ANGSD3x
cp ../SFS3x/sfs3xSitesToDo .
angsd sites index sfs3xSitesToDo
export GENOME_FASTA=~/2bRAD/fknms/sint/mappedReads/sint_denovo_cc.fasta
FILTERS="-maxHetFreq 0.5 -uniqueOnly 1 -remove_bads 1 -skipTriallelic 1 -minMapQ 30 -minQ 35 -doHWE 1 -sb_pval 1e-5 -hetbias_pval 1e-5 -minInd 165 -setMinDepthInd 3"
TODO="-ref $GENOME_FASTA -makeMatrix 1 -doIBS 1 -doCov 1 -doGeno 32 -doPost 1 -doGlf 2 -doCounts 1 -doMajorMinor 1 -dosnpstat 1 -doMaf 1"
echo "angsd -sites sfs3xSitesToDo -b bamsNoClones -GL 1 -P 1 $TODO $FILTERS -out sintHet3x" >angsdHet3x
launcher_creator.py -j angsdHet3x -n angsdHet3x -q shortq7 -t 06:00:00 -e $EMAIL -N 1
sbatch angsdHet3x.slurm
mkdir ../heterozygosity3x
cp sintHet* ../heterozygosity3x/
cd ../heterozygosity3x
echo heterozygosity_beagle.R sintHet3x.beagle.gz >filtHet3x
launcher_creator.py -j filtHet3x -n filtHet3x -q shortq7 -t 6:00:00 -e $EMAIL -N 1
sbatch --mem=200GB filtHet3x.slurm
tail -n 220 filtHet3x.e* > sintHet3x
cat
cd ~/2bRAD/fknms/sint/
mkdir theta3x
cd ANGSD3x
GENOME_FASTA=~/2bRAD/fknms/sint/mappedReads/sint_denovo_cc.fasta
>kThetas3x
for POP in yellow green teal blue; do
echo "angsd -b ${POP}Bams -GL 1 -P 1 -sites sfs3xSitesToDo -anc $GENOME_FASTA -doSaf 1 -out ${POP}3xOut &&\
realSFS ${POP}3xOut.saf.idx -fold 1 > ${POP}3xOut.sfs &&\
realSFS saf2theta ${POP}3xOut.saf.idx -sfs ${POP}3xOut.sfs -outname ${POP}3x -fold 1 &&\
thetaStat do_stat ${POP}3x.thetas.idx" >> kThetas3x
done
launcher_creator.py -j kThetas3x -n kThetas3x -q shortq7 -t 6:00:00 -e $EMAIL -w 4 -N 1
sbatch kThetas3x.slurm
mv *theta* ../theta3x
cd ~/2bRAD/sint/fknms/ANGSD3x
FILTERS="-uniqueOnly 1 -remove_bads 1 -minMapQ 20 -minQ 30 -dosnpstat 1 -doHWE 1 -hwe_pval 1e-5 -sb_pval 1e-5 -hetbias_pval 1e-5 -skipTriallelic 1 -snp_pval 1e-5 -minMaf 0.05 -setMinDepthInd 3"
TODO="-doMajorMinor 1 -doMaf 1 -doCounts 1 -makeMatrix 1 -doIBS 1 -doCov 1 -doGeno 8 -doBcf 1 -doPost 1 -doGlf 2"
echo "angsd -b blueBams -GL 1 -P 1 $FILTERS $TODO -minInd 98 -sites sitesToDo -out sintBlueSnps3x" > blueSnps3x
launcher_creator.py -j blueSnps3x -n blueSnps3x -q shortq7 -t 06:00:00 -e $EMAIL -w 2 -N 1
sbatch blueSnps3x.slurm
To use BayesAss3 (XXX) we first need to convert our ANGSD output into genotype format
cd ~/2bRAD/fknms/sint
mkdir BA33x
cd BA33x
cp ../ANGSD3x/blueBams .
cp ../ANGSD3x/sintBlueSnps3x.bcf .
#need file bspops that lists bams and the sampling site, tab delimited.
cp ../bayescan/fkSintBsPops.txt bspops.txt
awk -F ' ' 'NR==FNR{a[$1]; next} ($1 in a) || ($2 in a)' blueBams bspops.txt > fkSintBlueBA3Pops.txt
cat fkSintBlueBA3Pops.txt | wc -l
# 131 — Good!
bcftools view sintBlueSnps3x.bcf > sintBlueSnps3x.vcf
echo "# VCF Parser questions
PARSER_FORMAT=VCF
# Only output SNPs with a phred-scaled quality of at least:
VCF_PARSER_QUAL_QUESTION=
# Select population definition file:
VCF_PARSER_POP_FILE_QUESTION=./fkSintBlueBA3Pops.txt
# What is the ploidy of the data?
VCF_PARSER_PLOIDY_QUESTION=DIPLOID
# Do you want to include a file with population definitions?
VCF_PARSER_POP_QUESTION=true
# Output genotypes as missing if the phred-scale genotype quality is below:
VCF_PARSER_GTQUAL_QUESTION=
# Do you want to include non-polymorphic SNPs?
VCF_PARSER_MONOMORPHIC_QUESTION=false
# Only output following individuals (ind1, ind2, ind4, ...):
VCF_PARSER_IND_QUESTION=
# Only input following regions (refSeqName:start:end, multiple regions: whitespace separated):
VCF_PARSER_REGION_QUESTION=
# Output genotypes as missing if the read depth of a position for the sample is below:
VCF_PARSER_READ_QUESTION=
# Take most likely genotype if "PL" or "GL" is given in the genotype field?
VCF_PARSER_PL_QUESTION=true
# Do you want to exclude loci with only missing data?
VCF_PARSER_EXC_MISSING_LOCI_QUESTION=true
# Immanc (BayesAss) Writer questions
WRITER_FORMAT=IMMANC
# Specify the locus/locus combination you want to write to the Immanc (BayesAss) file:
IMMANC_WRITER_LOCUS_COMBINATION_QUESTION=
# Specify which data type should be included in the Immanc (BayesAss)) file (Immanc (BayesAss) can only analyze one data type per file):
IMMANC_WRITER_DATA_TYPE_QUESTION=SNP" >fkSintBA.spid
module load pgdspider-2.1.1.2-gcc-9.2.0-ghxvd4c
pgdSpider=/opt/ohpc/pub/spack/opt/spack/linux-centos7-x86_64/gcc-9.2.0/pgdspider-2.1.1.2-ghxvd4c4ieqngkbutakc7x6j4pfkqm5e/bin/PGDSpider2-cli.jar
echo '#!/bin/bash' > pgdSpider.sh
echo "java -Xmx1024m -Xms512m -jar $pgdSpider -inputformat VCF -outputformat IMMANC vcf -inputfile sintBlueSnps3x.vcf -outputfile fkSintBayesAss3x.txt -spid fkSintBA.spid" >>pgdSpider.sh
sbatch -e pgdSpider.e%j -o pgdSpider.o%j -p mediumq7 --mail-user reckert2017@fau.edu --mail-type=ALL pgdSpider.sh
## Default params:
#MCMC reps: 1,000,000
#burn in: 100,000
#sampling freq: 100
#delta migration (1): 0.1
#delta allele freq (3): 0.1
#delta inbreeding (4): 0.1
rm sintBlueSnps3x.bcf
rm bspops.txt
module load gcc-9.2.0-gcc-8.3.0-ebpgkrt gsl-2.5-gcc-9.2.0-i6lf4jb netlib-lapack-3.9.1-gcc-9.2.0-gcqg2b2 BayesAss/3.0.4.2
# Run a test with verbose [-v] output to see the acceptance rates in the terminal (takes a few minutes to compute)
# check the output file [less BATest.o*] and kill the job once you get output with acceptance rates [scancel {yourJobID}]
# After ~5 min you should start seeing output in the BATest.o* file
echo '#!/bin/bash' > BATest
echo BA3SNP -v -i 30000000 -b 10000000 -n 1000 fkSintBayesAss3x.txt >> BATest
sbatch -e BATest.e%j -o BATest.o%j -p shortq7 --mail-user reckert2017@fau.edu --mail-type=ALL BATest
#logL: -545609.14 % done: [0.00] % accepted: (0.49, 0.00, 0.91, 0.13, 0.79)
# we are looking for 20—60% acceptance, ideally somewhere nearer 20—30%
# relationships between mixing parameters and acceptance rates are inverse
# defaults are 0.1 (all parameters are scale 0—1)
# increase [-m] increase [-a] and decrease [-f]
echo '#!/bin/bash' > BATest
echo BA3SNP -v -i 30000000 -b 10000000 -n 1000 -m 0.2 -a 0.6 -f 0.04 fkSintBayesAss3x.txt >> BATest
sbatch -e BATest.e%j -o BATest.o%j -p shortq7 --mail-user reckert2017@fau.edu --mail-type=ALL BATest
#logP(M): -324.29 logL(G): -546136.13 logL: -546460.42 % done: [0.00] % accepted: (0.30, 0.00, 0.34, 0.33, 0.80)
echo '#!/bin/bash' > BATest
echo BA3SNP -v -i 30000000 -b 10000000 -n 1000 -m 0.25 -a 0.7 -f 0.07 fkSintBayesAss3x.txt >> BATest
sbatch -e BATest.e%j -o BATest.o%j -p shortq7 --mail-user reckert2017@fau.edu --mail-type=ALL BATest
#logP(M): -313.10 logL(G): -545977.13 logL: -546290.23 % done: [0.00] % accepted: (0.24, 0.00, 0.31, 0.19, 0.79)
echo '#!/bin/bash' > BATest
echo BA3SNP -v -i 30000000 -b 10000000 -n 1000 -m 0.25 -a 0.85 -f 0.05 fkSintBayesAss3x.txt >> BATest
sbatch -e BATest.e%j -o BATest.o%j -p shortq7 --mail-user reckert2017@fau.edu --mail-type=ALL BATest
# logP(M): -310.70 logL(G): -545736.54 logL: -546047.24 % done: [0.00] % accepted: (0.24, 0.00, 0.25, 0.27, 0.80)
Make and launch 10 iterations of BayesAss, each in its own run directory so we can keep all trace files (saved as ‘BA3trace.txt’ and would overwrite if not in separate directories). We are using [-s $RANDOM] to use a random start seed for each independent run
module load gcc-9.2.0-gcc-8.3.0-ebpgkrt gsl-2.5-gcc-9.2.0-i6lf4jb netlib-lapack-3.9.1-gcc-9.2.0-gcqg2b2 BayesAss/3.0.4.2
for i in {01..10}; do
echo '#!/bin/bash' > BayesAss$i.sh
echo BA3SNP -v -u -s $RANDOM -i 30000000 -b 10000000 -n 1000 -m 0.25 -a 0.85 -f 0.05 -t -o fkSintBARun${i}Out.txt ../fkSintBayesAss3x.txt >> BayesAss$i.sh;
mkdir run$i;
mv BayesAss$i.sh run$i;
cd run$i;
sbatch -e BayesAss$i.e%j -o BayesAss$i.o%j -p longq7 --mem=200GB --exclusive --mail-user reckert2017@fau.edu --mail-type=ALL BayesAss$i.sh
cd ..;
done
# after all runs complete copy files to main BayesAss directory
cd ~/2bRAD/fknms/sint/BA33x
#cp run*/*Out.txt .
for i in {01..10}; do
cp run$i/BA3trace.txt BA3trace.$i.txt;
done
cd ~/2bRAD/fknms/sint
cd bayescan3x
cp ../ANGSD3x/sintFiltSnps.bcf .
bcftools view sintFiltSnps.bcf > sintFiltSnps.vcf
rm sintFiltSnps.bcf
You will have to create or scp
a textfile with your .bam
file names and populations
Creating a popfile that sorts bams into lineage_depth
awk -F ' ' 'NR==FNR{a[$1]=$0; next} ($1 in a) {print a[$1]" "$2}' bspops.txt fkSintBsPops.txt > bsPopsNoAdmix.txt
awk -F ' ' '$2 != "Admixed" {print $1, $2, $3}' bsPopsNoAdmix.txt > tmp && mv tmp bsPopsNoAdmix.txt
awk '{sub(/.*_/,"",$3); print}' bsPopsNoAdmix.txt > tmp && mv tmp bsPopsNoAdmix.txt
awk -F ' ' '{print $1, $2"_"$3}' bsPopsNoAdmix.txt > tmp && mv tmp bsPopsNoAdmix.txt
cat bsPopsNoAdmix.txt
cat bsPopsNoAdmix.txt | wc -l
# looks good!
Remove admixed indivivuals from vcf
awk -F ' ' '{print $1}' bsPopsNoAdmix.txt > noAdmixedBams
cat noAdmixedBams
bcftools view -S noAdmixedBams sintFiltSnps.vcf > sintFiltSnpsNoAdmixed.vcf
Convert vcf with PGDSpider
echo "############
# VCF Parser questions
PARSER_FORMAT=VCF
# Do you want to include a file with population definitions?
VCF_PARSER_POP_QUESTION=true
# Only input following regions (refSeqName:start:end, multiple regions: whitespace separated):
VCF_PARSER_REGION_QUESTION=
# What is the ploidy of the data?
VCF_PARSER_PLOIDY_QUESTION=DIPLOID
# Only output following individuals (ind1, ind2, ind4, ...):
VCF_PARSER_IND_QUESTION=
# Output genotypes as missing if the read depth of a position for the sample is below:
VCF_PARSER_READ_QUESTION=
# Take most likely genotype if "PL" or "GL" is given in the genotype field?
VCF_PARSER_PL_QUESTION=true
# Do you want to exclude loci with only missing data?
VCF_PARSER_EXC_MISSING_LOCI_QUESTION=false
# Select population definition file:
VCF_PARSER_POP_FILE_QUESTION=./bsPopsNoAdmix.txt
# Only output SNPs with a phred-scaled quality of at least:
VCF_PARSER_QUAL_QUESTION=
# Do you want to include non-polymorphic SNPs?
VCF_PARSER_MONOMORPHIC_QUESTION=false
# Output genotypes as missing if the phred-scale genotype quality is below:
VCF_PARSER_GTQUAL_QUESTION=
# GESTE / BayeScan Writer questions
WRITER_FORMAT=GESTE_BAYE_SCAN
# Specify which data type should be included in the GESTE / BayeScan file (GESTE / BayeScan can only analyze one data type per file):
GESTE_BAYE_SCAN_WRITER_DATA_TYPE_QUESTION=SNP
############" >vcf2bayescan3x.spid
java -Xmx1024m -Xms512m -jar ~/bin/PGDSpider_2.0.7.1/PGDSpider2-cli.jar -inputfile sintFiltSnpsNoAdmixed.vcf -outputfile fkSint3x.bayescan -spid vcf2bayescan3x.spid
Launch BayeScan
(Takes 12+ hr depending on read
count/SNPs)
echo '#!/bin/bash' > sintBayescan.sh
echo bayescan fkSint3x.bayescan -threads=100 >> sintBayescan.sh
sbatch -e sintBayescan.e%j -o sintBayescan.o%j -p mediumq7 --mail-user reckert2017@fau.edu --mail-type=ALL sintBayescan.sh
Extract outlier SNPs from BayeScan
output
srun removeBayescanOutliers.pl bayescan=fkSint3x.baye_fst.txt vcf=sintFiltSnpsNoAdmixed.vcf FDR=0.05 mode=extract > fkSint3xVcfOutliers.vcf
Using BayeScEnv we can look for outliers realted to depth
# Get the average depths for each population (lineage/depth combo)
##1 Blue_Mesophotic 36.2
##2 Teal_Mesophotic 35.2
##3 Teal_Shallow 21.0
##4 Blue_Shallow 23.7
##5 Yellow_Shallow 20.2
##6 Green_Shallow 21.9
##7 Green_Mesophotic 31.4
echo "36.2 35.2 21.0 23.7 20.2 21.9 31.4" > depth.txt
cp fkSint3x.bayescan fkSint3x.bayeScEnv
echo '#!/bin/bash' > sintBayeScEnv.sh
echo bayescenv fkSint3x.bayeScEnv -env depth.txt >> sintBayeScEnv.sh
sbatch -e sintBayeScEnv.e%j -o sintBayeScEnv.o%j -p longq7 --mail-user reckert2017@fau.edu --mail-type=ALL sintBayeScEnv.sh
Extract outlier SNPs from BayeScEnv
output
srun removeBayescanOutliers.pl bayescan=fkSint.bayeS_fst.txt vcf=sintFiltSnpsNoAdmixed.vcf FDR=0.05 mode=extract > fkSintBScEnvVcfOutliers.vcf
cd ~/2bRAD/fknms/sint/
mv mappedReads/symbionts symbionts3x
cd symbionts3x
SYMGENOME=~/bin/symGenomes/symbConcatGenome
export GENOME_FASTA=~/2bRAD/fknms/sint/mappedReads/sint_denovo_cc.fasta
2bRAD_bowtie2_launcher.py -f zoox -g $GENOME_FASTA --split -a coral -u zoox --aldir coral --launcher -e reckert2017@fau.edu
mkdir coral
sbatch --mem=200GB maps.slurm
mkdir unused
mv *.sam unused/
mv *.trim.zoox unused/
2bRAD_bowtie2_launcher.py -f zoox -g $SYMGENOME --launcher -e reckert2017@fau.edu
sbatch --mem=200GB maps.slurm
>s2b
for file in *.sam; do
echo "samtools sort -O bam -o ${file/.sam/}.bam $file && samtools index ${file/.sam/}.bam">>s2b;
done
launcher_creator.py -j s2b -n s2b -t 6:00:00 -N 5 -e reckert2017@fau.edu -q shortq7
sbatch s2b.slurm
Count reads mapping to each chromosome for concatenated Symbiodiniaceae genome
>zooxReads
for i in *.bam; do
echo $i >>zooxReads;
samtools idxstats $i | cut -f 1,3 >>zooxReads;
done