This walkthrough describes the downstream processing and analysis of 2bRAD reads generated from Orbicella faveolata samples collected across urbanized and reef habitats in southeast Florida. Sequence alignments are based on an existing genome assembly by Young and colleagues: https://doi.org/10.1186/s12864-024-10092-w. For initial processing of 2bRAD reads regardless of species, please see below:
Library prep, bioinformatics, and analysis protocols are credited to the 2bRAD pipeline originally developed by Misha Matz: https://doi.org/10.1038/nmeth.2023, and further refined by Ryan Eckert: https://ryaneckert.github.io/Stephanocoenia_FKNMS_PopGen/code/
mkdir ~/db/symGenomes
cd ~/db/symGenomes
Using Symbiodiniaceae references from NCBI’s most recent genomes: Symbiodinium (NCBI GCA_965279495.1), Breviolum (GCA_965643015.1), Cladocopium (GCA_947184155.2), Durusdinium (GCA_963970005.1)
python concatFasta.py -o symbConcatGenome.fasta -s symbConcatGenome_summary.tsv -m symbConcatGenome_contig_to_fakechr.tsv GCA_965279495.1_pySymTrid1.1_genomic.fna GCA_965643015.1_pyBreMinu3.1_genomic.fna GCA_947184155.2_Cgoreaui_SCF055-01_v2.1_genomic.fna GCA_963970005.1_Durusdinium_trenchii_CCMP2556_genomic.fna
Building bowtie2 index for concatenated symbiont genomes
conda activate 2bRAD
echo '#!/bin/bash' >genomeBuild.sh
echo bowtie2-build symbConcatGenome.fasta symbConcatGenome >>genomeBuild.sh
echo samtools faidx symbConcatGenome.fasta >>genomeBuild.sh
sbatch -o genomeBuild.o%j -e genomeBuild.e%j --mail-type=ALL --mail-user=studivanms@gmail.com genomeBuild.sh
Building bowtie2 index for O. faveolata genome
conda activate 2bRAD
echo '#!/bin/bash' >genomeBuild.sh
echo bowtie2-build Orbicella_faveolata_gen_17.scaffolds.fa OfaveolataGenome >>genomeBuild.sh
echo samtools faidx Orbicella_faveolata_gen_17.scaffolds.fa >>genomeBuild.sh
sbatch -o genomeBuild.o%j -e genomeBuild.e%j --mail-type=ALL --mail-user=studivanms@gmail.com genomeBuild.sh
conda activate 2bRAD
SYMGENOME=~/db/symGenomes/symbConcatGenome
2bRAD_bowtie2_launcher.py -g $SYMGENOME -f trim -n zooxMaps --split -u un -a zoox --launcher -e studivanms@gmail.com
sbatch zooxMaps.slurm
mkdir ../mappedReads
mkdir ../mappedReads/symbionts
mv *.sam ../mappedReads/symbionts
mv *.zoox ../mappedReads/symbionts
cd ../mappedReads/symbionts
Do we have the correct number of files?
ll *.zoox | wc -l
Counting the mapped symbiont reads
echo '#!/bin/bash' >mappedZooxReads
echo readCounts.sh -e zoox -o Zoox >>mappedZooxReads
sbatch --mem=200GB --mail-type=ALL --mail-user=studivanms@gmail.com mappedZooxReads
scp ZooxReadCounts to local machine
Making script to generate indexed bam files
>s2b
for file in *.sam; do
echo "samtools sort -O bam -o ${file/.sam/}.bam $file && samtools index ${file/.sam/}.bam">>s2b;
done
launcher_creator.py -j s2b -n s2b -t 6:00:00 -N 5 -e studivanms@gmail.com -q shortq7
sbatch s2b.slurm
Counting the symbiont reads by genera
>ZooxReads
for i in *.bam; do
echo $i >>ZooxReads;
samtools idxstats $i | cut -f 1,3 >>ZooxReads;
done
scp ZooxReads to local machine
Housekeeping for long-term storage
zipper.py -a -9 -f sam --launcher -e studivanms@gmail.com
sbatch zip.slurm
zipper.py -a -9 -f zoox --launcher -e studivanms@gmail.com
sbatch zip.slurm
cd ~/urban/ofav/filteredReads
conda activate 2bRAD
HOSTGENOME=~/db/ofavGenome/OfaveolataGenome
mkdir junk
# mapping with --local option, enables clipping of mismatching ends (guards against deletions near ends of RAD tags)
2bRAD_bowtie2_launcher.py -g $HOSTGENOME -f un --split -u junk -a host --undir junk --launcher -e studivanms@gmail.com
sbatch --mem=200GB maps.slurm
Do we have the correct number of files?
ll *.sam | wc -l
Counting the mapped coral reads
echo '#!/bin/bash' >mappedReads
echo readCounts.sh -e host -o Host >>mappedReads
sbatch --mem=200GB --mail-type=ALL --mail-user=studivanms@gmail.com mappedReads
scp HostReadCounts to local machine
Making script to generate indexed bam files
>s2b
for file in *.sam; do
echo "samtools sort -O bam -o ${file/.sam/}.bam $file && samtools index ${file/.sam/}.bam">>s2b;
done
launcher_creator.py -j s2b -n s2b -q shortq7 -t 06:00:00 -e studivanms@gmail.com
sbatch --mem=200GB s2b.slurm
ll *bam | wc -l
Housekeeping for long-term storage
zipper.py -a -9 -f sam --launcher -e studivanms@gmail.com
sbatch zip.slurm
zipper.py -a -9 -f un --launcher -e studivanms@gmail.com
sbatch zip.slurm
zipper.py -a -9 -f trim --launcher -e studivanms@gmail.com
sbatch zip.slurm
zipper.py -a -9 -f host --launcher -e studivanms@gmail.com
sbatch zip.slurm
mv *.trim.gz ../../trimmedReads
mv *.sam.gz ../mappedReads
mv *.host.gz ../mappedReads
mv *.bam ../mappedReads
mv *.bai ../mappedReads
cd junk
zipper.py -a -9 -f junk --launcher -e studivanms@gmail.com
sbatch zip.slurm
Doesn’t call actual genotypes, but instead generates genotype likelihoods at each SNP. Optimal for low-coverage data (<10x) like 2bRAD.
Copy all .bam files to a new directory ‘ANGSD’
mkdir ../ANGSD
cd ../ANGSD
mv ../mappedReads/*.bam* .
ls *bam >bamsClones
Assessing base qualities and coverage depth -minMapQ 20:
only highly unique mappings (prob of erroneous mapping =< 1%)
-baq 1: realign around indels (not terribly relevant for
2bRAD reads mapped with –local option) -maxDepth: highest
total depth (sum over all samples) to assess; set to 10x number of
samples -minInd: the minimal number of individuals the site
must be genotyped in. Set to 50% of total N at this stage.
export FILTERS="-uniqueOnly 1 -remove_bads 1 -minMapQ 20 -maxDepth 1030 -minInd 52"
export TODO="-doQsDist 1 -doDepth 1 -doCounts 1 -dumpCounts 2"
echo '#!/bin/bash' >ofavDD.sh
echo angsd -b bamsClones -GL 1 $FILTERS $TODO -P 1 -out dd -nThreads 20 >>ofavDD.sh
sbatch --mem=200GB -o ofavDD.o%j -e ofavDD.e%j --mail-user=studivanms@gmail.com --mail-type=ALL --partition=shortq7 ofavDD.sh
Summarizing results
module load R/3.6.1
echo '#!/bin/bash' >RQC.sh
echo Rscript ~/bin/plotQC.R prefix=dd >>RQC.sh
echo gzip -9 dd.counts >>RQC.sh
sbatch -e RQC.e%j -o RQC.o%j --dependency=afterok:460550 --mem=200GB --mail-type=ALL --mail-user=studivanms@gmail.com RQC.sh
Proportion of sites covered at >5X:
cat quality.txt
scp dd.pdf to local machine to look at distribution of base quality
scores, fraction of sites in each sample passing coverage thresholds and
fraction of sites passing genotyping rates cutoffs. Use these to guide
choices of -minQ, -minIndDepth and
-minInd filters in subsequent ANGSD
runs.
Set the -minInd flag to 75% of the number of samples you
have
FILTERS="-uniqueOnly 1 -remove_bads 1 -minMapQ 20 -minQ 30 -dosnpstat 1 -doHWE 1 -hwe_pval 1e-5 -sb_pval 1e-5 -hetbias_pval 1e-5 -skipTriallelic 1 -minInd 77 -snp_pval 1e-6 -minMaf 0.05"
TODO="-doMajorMinor 1 -doMaf 1 -doCounts 1 -makeMatrix 1 -doIBS 1 -doCov 1 -doGeno 8 -doBcf 1 -doPost 1 -doGlf 2"
echo '#!/bin/bash' > ofavClones.sh
echo angsd -b bamsClones -GL 1 $FILTERS $TODO -P 1 -out ofavClones >>ofavClones.sh
sbatch --mem=200GB -o radClones.o%j -e radClones.e%j --mail-type=ALL --mail-user=studivanms@gmail.com --partition=shortq7 radClones.sh
scp .ibsMat and .bcf files to local machine Use ibs matrix to
identify clones with hierachial clustering in R.
scp to local machine and run chunk below in
R
pacman::p_load("dendextend", "ggdendro", "tidyverse")
cloneBams = read.csv("../../data/ofav/ofavMetadata.csv") # list of bam files
cloneMa = as.matrix(read.table("../../data/ofav/ANGSD/clones/ofavClones.ibsMat")) # reads in IBS matrix produced by ANGSD
dimnames(cloneMa) = list(cloneBams[,2],cloneBams[,2])
clonesHc = hclust(as.dist(cloneMa),"ave")
clonePops = cloneBams$region
cloneSite = cloneBams$site
cloneDend = cloneMa %>% as.dist() %>% hclust(.,"ave") %>% as.dendrogram()
cloneDData = cloneDend %>% dendro_data()
# Making the branches hang shorter so we can easily see clonal groups
cloneDData$segments$yend2 = cloneDData$segments$yend
for(i in 1:nrow(cloneDData$segments)) {
if (cloneDData$segments$yend2[i] == 0) {
cloneDData$segments$yend2[i] = (cloneDData$segments$y[i] - 0.01)}}
cloneDendPoints = cloneDData$labels
cloneDendPoints$region = clonePops[order.dendrogram(cloneDend)]
cloneDendPoints$site=cloneSite[order.dendrogram(cloneDend)]
rownames(cloneDendPoints) = cloneDendPoints$label
cloneDendPoints$region = as.factor(cloneDendPoints$region)
cloneDendPoints$site = as.factor(cloneDendPoints$site)
# Making points at the leaves to place symbols for regions
point = as.vector(NA)
for(i in 1:nrow(cloneDData$segments)) {
if (cloneDData$segments$yend[i] == 0) {
point[i] = cloneDData$segments$y[i] - 0.01
} else {
point[i] = NA}}
cloneDendPoints$y = point[!is.na(point)]
techReps = c("urban_080", "urban_080_2", "urban_080_3", "urban_289", "urban_289_2")
cloneDendPoints$site = factor(cloneDendPoints$site,levels(cloneDendPoints$site)[c(3, 10, 4, 2, 12, 8, 11, 9, 5, 7, 6, 1, 13)])
cloneDendPoints$region = factor(cloneDendPoints$region,levels(cloneDendPoints$region)[c(2, 3, 4, 5, 1)])
Pal <- c("#E69F00","#009E73","#D55E00","#CC79A7","#E41A1C","#377EB8","#FF7F00","#A65628","#FC8D62","#8DA0CB","#E78AC3","#A6D854","#FFD92F")
cloneDendA = ggplot() +
geom_segment(data = segment(cloneDData), aes(x = x, y = y, xend = xend, yend = yend2), size = 0.5) +
geom_point(data = cloneDendPoints, aes(x = x, y = y, fill = site, shape = region), size = 4, stroke = 0.25) +
# scale_fill_brewer(palette = "Dark2", name = "Site") +
scale_fill_manual(values = Pal, name= "Site")+
scale_shape_manual(values = c(21, 22, 23, 24, 25), name = "Region")+
geom_hline(yintercept = 0.12, color = "red", lty = 5, size = 0.75) + # creating a dashed line to indicate a clonal distance threshold
geom_text(data = subset(cloneDendPoints, subset = label %in% techReps), aes(x = x, y = (y - .045), label = label), angle = 90) + # spacing technical replicates further from leaf
geom_text(data = subset(cloneDendPoints, subset = !label %in% techReps), aes(x = x, y = (y - .025), label = label), angle = 90) +
labs(y = "Genetic distance (1 - IBS)") +
guides(fill = guide_legend(override.aes = list(shape = 22)))+
theme_classic()
cloneDend = cloneDendA + theme(
axis.title.x = element_blank(),
axis.text.x = element_blank(),
axis.line.x = element_blank(),
axis.ticks.x = element_blank(),
axis.title.y = element_text(size = 12, color = "black", angle = 90),
axis.text.y = element_text(size = 10, color = "black"),
axis.line.y = element_line(),
axis.ticks.y = element_line(),
panel.grid = element_blank(),
panel.border = element_blank(),
panel.background = element_rect(fill = "white", colour = NA),
plot.background = element_rect(fill = "white", colour = NA),
legend.key = element_blank(),
legend.title = element_text(size = 12),
legend.text = element_text(size = 10),
legend.position = "bottom")
cloneDend
ggsave("../../figures/ofav/cloneDend.pdf", plot = cloneDend, height = 8, width = 24, units = "in", dpi = 300)
ggsave("../../figures/ofav/cloneDend.png", plot = cloneDend, height = 8, width = 24, units = "in", dpi = 300)
Set the -minInd flag to 75% of the number of samples you
have remaining after removal of clones
mkdir clones
mv ofavClones* clones
ll *.bam > bamsNoClones
cat bamsClones | grep -v 'OFAV_urban_359.trim.un.bt2.bam\|OFAV_nmfs_035.trim.un.bt2.bam\|OFAV_nmfs_069.trim.un.bt2.bam\|OFAV_nmfs_041.trim.un.bt2.bam\|OFAV_urban_080_2.trim.un.bt2.bam\|OFAV_urban_080_3.trim.un.bt2.bam\|OFAV_urban_289_2.trim.un.bt2.bam\|OFAV_urban_081.trim.un.bt2.bam\|OFAV_urban_415.trim.un.bt2.bam' >bamsNoClones
module load angsd-0.933-gcc-9.2.0-65d64pp
FILTERS="-uniqueOnly 1 -remove_bads 1 -minMapQ 20 -minQ 30 -dosnpstat 1 -doHWE 1 -hwe_pval 1e-5 -sb_pval 1e-5 -hetbias_pval 1e-5 -skipTriallelic 1 -minInd 70 -snp_pval 1e-6 -minMaf 0.05"
TODO="-doMajorMinor 1 -doMaf 1 -doCounts 1 -makeMatrix 1 -doIBS 1 -doCov 1 -doGeno 8 -doBcf 1 -doPost 1 -doGlf 2"
echo '#!/bin/bash' > ofavNoClones.sh
echo angsd -b bamsNoClones -GL 1 $FILTERS $TODO -P 1 -out ofavNoClones >> ofavNoClones.sh
sbatch --mem=200GB -o radNoClones.o%j -e radNoClones.e%j -p shortq7 --mail-type=ALL --mail-user=studivanms@gmail.com radNoClones.sh
How many SNPs do we have?
grep "filtering:" ofavNoClones.e*
Number of sites retained after filtering: 20323
Some housekeeping
zipper.py -a -9 -f bam --launcher -e studivanms@gmail.com
sbatch zip.slurm
mkdir ../pcangsd
cd pcangsd
cp ../ANGSD/ofavNoClones.beagle.gz .
conda activate pcangsd
echo '#!/bin/bash' > pcangsd.sh
echo 'pcangsd -b ofavNoClones.beagle.gz -o ofavPcangsd --admix --inbreed-samples --pcadapt --selection' >> pcangsd
launcher_creator.py -j pcangsd -n pcangsd -q shortq7 -t 06:00:00 -e $EMAIL -w 1 -N 1
sbatch pcangsd.slurm
Calculating population structure from genotype likelihoods using NGSadmix (part of ANGSD) Simulating (50x) different values of K (minimum of 1, maximum of the number of your sites + 3)
mkdir ../ngsAdmix
cp *beagle* ../ngsAdmix
cd ../ngsAdmix
ngsAdmixLauncher.py -f ofavNoClones.beagle.gz --maxK 16 -r 50 -n ofav --launcher -e studivanms@gmail.com
sbatch --mem=200GB ofavNgsAdmix.slurm
Taking the likelihood value from each run of NGSadmix and put them into a file that can be used with Clumpak to calculate the most likely K using the methods of Evanno et al. (2005).
>ofavNgsAdmixLogfile
for log in ofav*.log; do
grep -Po 'like=\K[^ ]+' $log >> ofavNgsAdmixLogfile;
done
Start R in command line
R
logs <- as.data.frame(read.table("ofavNgsAdmixLogfile"))
# output is organized with 10, 11 preceding 1, 2, 3 etc.
logs$K <- c(rep("10", 50), rep("11", 50), rep("12", 50), rep("13", 50), rep("14", 50), rep("15", 50), rep("16", 50), rep("1", 50), rep("2", 50), rep("3", 50), rep("4", 50), rep("5", 50), rep("6", 50), rep("7", 50), rep("8", 50), rep("9", 50))
write.table(logs[, c(2, 1)], "ofavNgsAdmixLogfile_formatted", row.names = F, col.names = F, quote = F)
quit()
# No need to save workspace image [press 'n']
n
Check that the formatted log file has the correct number of entries
cat ofavNgsAdmixLogfile_formatted | wc -l
800, so yes (K=16 * 50 simulations each)
Make copies of .qopt files to run structure selector on (.Q files)
for file in ofav*.qopt; do
filename=$(basename -- "$file" .qopt);
cp "$file" "$filename".Q;
done
mkdir ofavQ
mv ofav*Q ofavQ
zip -r ofavQ.zip ofavQ
scp .zip, formatted logfile, and ofavNoClones* to local machine
Create a PopMap file from bamsNoClones: a two-column, tab-delimited text file with sampleID and populationID Reorder output Q files and PopMap file based on specific site order
library(tidyverse)
popfile <- read.table(file = "../../data/ofav/ngsAdmix/ofavPopfile.txt", header = FALSE, col.names=c("sample", "site"), stringsAsFactors=FALSE)
site_order <- c(
"EmeraldReef",
"RainbowReef",
"FisherIsland",
"CoralCityCamera",
"StarIsland",
"MacArthurNorth",
"SouthCanyonReef",
"PillarsReef",
"FIUBiscayneBay",
"GracelandReef",
"FTL4Reef",
"BC1Reef",
"T328Reef"
)
site_factor <- factor(popfile$site, levels = site_order)
ord <- order(site_factor)
popfile_sorted <- popfile[ord, ]
write.table(popfile_sorted, "../../data/ofav/ngsAdmix/ofavPopfile_sorted.txt",
quote=FALSE, row.names=FALSE, col.names=FALSE, sep="\t")
# Unzips, loops through, and reorders each .Q file based on popfile_sorted
unzip("../../data/ofav/ngsAdmix/ofavQ.zip", exdir = "../../data/ofav/ngsAdmix/")
q_dir <- "../../data/ofav/ngsAdmix/ofavQ" # where .Q files now live
out_dir <- "../../data/ofav/ngsAdmix/ofavQ_sorted" # new folder for reordered files
dir.create(out_dir, showWarnings = FALSE)
q_files <- list.files(q_dir, pattern="\\.Q$|\\.qopt$", full.names=TRUE)
length(q_files) # should be 800
for (file in q_files) {
q <- read.table(file, header=FALSE, colClasses = "character")
# must have same number of rows
stopifnot(nrow(q) == nrow(popfile_sorted))
q_sorted <- q[ord, ]
out_file <- file.path(out_dir, basename(file))
write.table(
q_sorted,
out_file,
quote = FALSE,
row.names = FALSE,
col.names = FALSE
)
}
# Cleanup
unlink(q_dir, recursive = TRUE) # Delete the uncompressed ofavQ directory
# Zip the ofavQ_sorted directory
out_dir_norm <- normalizePath(out_dir)
parent_dir <- dirname(out_dir_norm)
folder_name <- basename(out_dir_norm)
zipfile_path <- file.path(parent_dir, "ofavQ_sorted.zip")
cmd <- sprintf(
'cd "%s" && zip -r "%s" "%s"',
parent_dir,
basename(zipfile_path),
folder_name
)
cat("Running command:\n", cmd, "\n")
system(cmd)
if (file.exists(zipfile_path)) {
unlink(out_dir_norm, recursive = TRUE)
} # Delete the uncompressed ofavQ_sorted directory
Upload logfile to CLUMPAK as a Log Probability table file (http://clumpak.tau.ac.il/bestK.html) And .zip to structure selector as an ADMIXTURE file (https://lmme.qdio.ac.cn/stable/StructureSelector/index.html)
If Clumpak fails due to all simulations of K=1 having zero standard deviation (all models having the same log likelihood), your populations may be acting as a single unit (panmixia) We can run the Evanno method another way to determine if K=1 is the most likely scenario, and can evaluate potential cryptic genetic lineages
# Load libraries
library(ggplot2)
library(dplyr)
library(patchwork)
# -------------------------
# 1. Read space-delimited input file
# -------------------------
data_file <- "../../data/ofav/ngsAdmix/ofavNgsAdmixLogfile_formatted" # replace with your file
df <- read.table(data_file, header = FALSE, sep = "",
col.names = c("K", "LogLikelihood"), stringsAsFactors = FALSE)
# -------------------------
# 2. Clean columns and convert to numeric
# -------------------------
df$K <- as.numeric(trimws(df$K))
df$LogLikelihood <- as.numeric(trimws(df$LogLikelihood))
df <- df[!is.na(df$K) & !is.na(df$LogLikelihood), ]
if(nrow(df) == 0) stop("No valid data found in the file. Check file formatting.")
# -------------------------
# 3. Summarize mean and SD for each K
# -------------------------
summary_df <- df %>%
group_by(K) %>%
summarise(MeanLnP = mean(LogLikelihood, na.rm = TRUE),
SD = sd(LogLikelihood, na.rm = TRUE)) %>%
arrange(K)
# -------------------------
# 4. Calculate ΔK robustly
# -------------------------
L <- summary_df$MeanLnP
n <- nrow(summary_df)
deltaK <- rep(NA, n)
if(n >= 3) {
for(i in 2:(n-1)) {
Li_minus1 <- L[i-1]; Li <- L[i]; Li_plus1 <- L[i+1]; SDi <- summary_df$SD[i]
if(all(is.finite(c(Li_minus1, Li, Li_plus1, SDi))) && SDi != 0) {
deltaK[i] <- abs(Li_plus1 - 2*Li + Li_minus1) / SDi
}
}
}
summary_df$deltaK <- deltaK
# -------------------------
# 5. Identify all ΔK peaks
# -------------------------
is_peak <- rep(FALSE, n)
if(n >= 3 && any(!is.na(deltaK))) {
for(i in 2:(n-1)) {
if(!is.na(deltaK[i])) {
left <- deltaK[i-1]; right <- deltaK[i+1]
if((is.na(left) || deltaK[i] > left) && (is.na(right) || deltaK[i] > right)) {
is_peak[i] <- TRUE
}
}
}
}
peaks_df <- summary_df[is_peak, ]
# Identify global best K (largest ΔK)
global_best <- if(nrow(peaks_df) > 0) peaks_df[which.max(peaks_df$deltaK), ] else NULL
# -------------------------
# 6. Create plots
# -------------------------
# ΔK plot
if(!all(is.na(summary_df$deltaK))) {
p_deltaK <- ggplot(summary_df, aes(x = K, y = deltaK)) +
geom_line(color = "steelblue") +
geom_point() +
theme_minimal() +
labs(title = "ΔK Plot (Evanno Method)",
x = "K",
y = expression(Delta*K)) +
coord_cartesian(clip = "off") # prevent label clipping
if(nrow(peaks_df) > 0) {
p_deltaK <- p_deltaK +
geom_point(data = peaks_df, aes(x = K, y = deltaK), color = "red", size = 3) +
geom_text(data = peaks_df, aes(x = K, y = deltaK, label = paste("K =", K)),
vjust = -0.5, color = "red", fontface = "bold", clip = "off")
}
if(!is.null(global_best)) {
p_deltaK <- p_deltaK +
geom_point(data = global_best, aes(x = K, y = deltaK), color = "darkred", size = 5) +
geom_text(data = global_best, aes(x = K, y = deltaK, label = paste("Best K =", K)),
vjust = -1, color = "darkred", fontface = "bold", clip = "off")
}
} else {
p_deltaK <- NULL
}
# Mean LogLikelihood plot
p_mean <- ggplot(summary_df, aes(x = K, y = MeanLnP)) +
geom_line(color = "darkgreen") +
geom_point(size = 3) +
theme_minimal() +
labs(title = "Mean Log Likelihood vs K",
x = "K",
y = "Mean Log Likelihood") +
coord_cartesian(clip = "off")
# Combine plots vertically
if(!is.null(p_deltaK)) {
p_combined <- p_deltaK / p_mean + plot_layout(heights = c(1,1))
} else {
p_combined <- p_mean
}
# -------------------------
# 7. Save combined plot
# -------------------------
output_file <- "../../figures/ofav/DeltaK_and_MeanLnP_plot.png"
ggsave(output_file, plot = p_combined, width = 6, height = 8, dpi = 300)
# -------------------------
# 8. Print peak info
# -------------------------
cat("Plot saved as", output_file, "\n")
if(!is.null(global_best)) {
cat("ΔK peaks at K =", paste(peaks_df$K, collapse = ", "), "\n")
cat("ΔK values:", paste(round(peaks_df$deltaK, 2), collapse = ", "), "\n")
cat("Global best K =", global_best$K, "with ΔK =", round(global_best$deltaK, 2), "\n")
} else if(nrow(summary_df) > 0) {
cat("ΔK could not be calculated; showing Mean Log Likelihood plot.\n")
} else {
cat("No valid data to plot.\n")
}
While K=1 (panmixia) appears to be the most likely scenario based on model likelihoods, the Evanno method suggests K=3, so there may be cryptic lineages
K=1 K=2
K=3
K=4
Since there are three possible cryptic lineages, we need to re-genotype samples using ANGSD using only SNPs found across all lineages. This helps avoid issues of ascertainment bias.
cd ~/urban/ofav/ANGSD
# created 'bamsClusters' in ofavR.Rmd based on Admixture analysis
awk 'BEGIN { FS=" " } $2 == "Blue" { print $1 }' bamsClusters >> blueBams
awk 'BEGIN { FS=" " } $2 == "Orange" { print $1 }' bamsClusters >> orangeBams
awk 'BEGIN { FS=" " } $2 == "Purple" { print $1 }' bamsClusters >> purpleBams
FILTERS="-uniqueOnly 1 -remove_bads 1 -minMapQ 20 -minQ 30 -dosnpstat 1 -doHWE 1 -hwe_pval 1e-5 -sb_pval 1e-5 -hetbias_pval 1e-5 -skipTriallelic 1 -snp_pval 1e-5 -minMaf 0.05 -setMinDepthInd 3"
TODO="-doMajorMinor 1 -doMaf 1 -doGeno 8 -doPost 1 -doGlf 2"
# Remember to set -minInd to 75% of the sample size for each cluster
echo "angsd -b blueBams -GL 1 -P 1 $FILTERS $TODO -minInd 40 -out ofavBlueSnps
angsd -b orangeBams -GL 1 -P 1 $FILTERS $TODO -minInd 21 -out ofavOrangeSnps
angsd -b purpleBams -GL 1 -P 1 $FILTERS $TODO -minInd 3 -out ofavPurpleSnps" > kSnps
launcher_creator.py -j kSnps -n kSnps -q shortq7 -t 06:00:00 -e $EMAIL -w 2 -N 1
sbatch kSnps.slurm
Filtering down the sites to those found within each lineage with the given ANGSD filtering criteria
for file in ofav*Snps.geno.gz; do
echo '#!/bin/bash' > ${file%%.*}.sh;
echo "zcat $file | awk '{print \$1\"\t\"\$2}' > ${file%%.*}sites" >> ${file%%.*}.sh;
sbatch -e ${file%%.*}.e%j -o ${file%%.*}.o%j -p shortq7 --mem=100GB --mail-user $EMAIL --mail-type=ALL ${file%%.*}.sh;
done
# For gattaca:
for file in ofav*Snps.geno.gz; do
base="${file%%.*}"
echo "Processing $file → ${base}sites"
zcat "$file" | awk '{print $1"\t"$2}' > "${base}sites"
done
# The Purple lineage has so few samples and high divergence to the other clusters, so filtering to SNPs found across all results in 20 sites
# Dropping Purple from this filtering step
mv ofavPurpleSnpssites ../ofavPurpleSnpssites
srun awk '(++c[$0])==(ARGC-1)' *Snpssites > sitesToDo
# 6528
mkdir ../filteredSNPS
mv ofav*Snps* ../filteredSNPS/
mv kSnps* ../filteredSNPS/
Now index the sites file and run angsd on all samples with only the
reduced number of sites using -sites flag
angsd sites index sitesToDo
FILTERS="-uniqueOnly 1 -remove_bads 1 -minMapQ 20 -minQ 30 -dosnpstat 1 -doHWE 1 -hwe_pval 1e-5 -sb_pval 1e-5 -hetbias_pval 1e-5 -skipTriallelic 1 -snp_pval 1e-5 -minMaf 0.05 -setMinDepthInd 3"
TODO="-doMajorMinor 1 -doMaf 1 -doCounts 1 -makeMatrix 1 -doIBS 1 -doCov 1 -doGeno 8 -doBcf 1 -doPost 1 -doGlf 2"
# Remember to set -minInd to 75% of your total sample size
echo "angsd -b bamsNoClones -sites sitesToDo -GL 1 -P 1 $FILTERS $TODO -minInd 70 -out ofavFiltSnps" > ofavFiltSnps
launcher_creator.py -j ofavFiltSnps -n ofavFiltSnps -q shortq7 -t 06:00:00 -e $EMAIL -w 2 -N 1
sbatch ofavFiltSnps.slurm
mv ofavFilt* ../filteredSNPS/
cd ../filteredSNPS/
How many SNPs do we have found across all lineages?
grep "filtering:" ofavFiltSnps.e*
Number of sites retained after filtering: 5323
cd ~/urban/ofav/ANGSD
FILTERS="-uniqueOnly 1 -remove_bads 1 -minMapQ 20 -minQ 30 -dosnpstat 1 -doHWE 1 -hwe_pval 1e-5 -sb_pval 1e-5 -hetbias_pval 1e-5 -skipTriallelic 1 -snp_pval 1e-5 -minMaf 0.05 -setMinDepthInd 3"
TODO="-doMajorMinor 1 -doMaf 1 -doCounts 1 -makeMatrix 1 -doIBS 1 -doCov 1 -doGeno 8 -doBcf 1 -doPost 1 -doGlf 3"
# Remember to set -minInd to 75% of your total sample size
echo '#!/bin/bash' > ofavFiltRelate.sh
echo angsd -b bamsNoClones -GL 1 -sites sitesToDo $FILTERS $TODO -P 1 -minInd 70 -out ofavFiltRelate >> ofavFiltRelate.sh
sbatch --mem=200GB -o ofavFiltRelate.o%j -e ofavFiltRelate.e%j -p shortq7 --mail-type=ALL --mail-user=$EMAIL ofavFiltRelate.sh
mkdir ../ngsRelate
mv *Relate* ../ngsRelate
cd ../ngsRelate
zcat ofavFiltRelate.mafs.gz | cut -f5 |sed 1d >freq
# Set -n to total sample size
echo '#!/bin/bash' > ngsFiltRelate.sh
echo ngsRelate -g ofavFiltRelate.glf.gz -n 94 -f freq -z ../ANGSD/bamsNoClones -O filtRelate >> ngsFiltRelate.sh
sbatch -e ngsFiltRelate.e%j -o ngsFiltRelate.o%j --mem=200GB --mail-user $EMAIL --mail-type=ALL ngsFiltRelate.sh
module load ngstools-master-gcc-8.3.0-qcbecbz
module load ngsf-master-gcc-8.3.0-yscrrl3
mkdir ~/urban/ofav/ngsF
cd ~/urban/ofav/ngsF
cp ../ngsRelate/ofavFilt* .
# Set --n_ind to total sample size
N_SITES=$((`zcat ofavFiltRelate.mafs.gz | wc -l`-1))
srun zcat ofavFiltRelate.glf.gz | ngsF --n_ind 94 --n_sites $N_SITES --glf - --min_epsilon 0.001 --out ofav.approx_indF --approx_EM --seed 12345 --init_values r
srun zcat ofavFiltRelate.glf.gz | ngsF --n_ind 94 --n_sites $N_SITES --glf - --min_epsilon 0.001 --out ofavF.indF --init_values ofav.approx_indF.pars
/bin/ngsTools/ngsF
Running SFS calculations within lineage to use for Fst and to run
StairwayPlot
cd ~/urban/ofav/ANGSD
## No filters to distort allelic frequencies
FILTERS="-uniqueOnly 1 -remove_bads 1 -skipTriallelic 1 -minMapQ 30 -minQ 35 -doHWE 1 -sb_pval 1e-5 -hetbias_pval 1e-5 -maxHetFreq 0.5 -setMinDepthInd 3"
TODO="-doMajorMinor 1 -doMaf 1 -dosnpstat 1 -doPost 2 -doGeno 11 -doGlf 2"
# Remember to set -minInd to 75% of the sample size for each cluster
echo "angsd -b blueBams -GL 1 -P 1 $FILTERS $TODO -minInd 40 -out ofavBlueSFS
angsd -b orangeBams -GL 1 -P 1 $FILTERS $TODO -minInd 21 -out ofavOrangeSFS
angsd -b purpleBams -GL 1 -P 1 $FILTERS $TODO -minInd 3 -out ofavPurpleSFS" > sfsClusters
launcher_creator.py -j sfsClusters -n sfsClusters -q shortq7 -t 06:00:00 -e $EMAIL -N 1
sbatch sfsClusters.slurm
for file in ofav*SFS.geno.gz; do
echo '#!/bin/bash' > ${file%%.*}.sh;
echo "zcat $file | awk '{print \$1\"\t\"\$2}' > ${file%%.*}sites" >> ${file%%.*}.sh;
sbatch -e ${file%%.*}.e%j -o ${file%%.*}.o%j -p longq7 --mem=100GB --mail-user $EMAIL --mail-type=ALL ${file%%.*}.sh;
done
# For gattaca
for file in ofav*SFS.geno.gz; do
base="${file%%.*}"
echo "Processing $file → ${base}sites"
zcat "$file" | awk '{print $1"\t"$2}' > "${base}sites"
done
Now, compile common sites from all lineages using awk
srun awk '(++c[$0])==(ARGC-1)' *SFSsites > sfsSitesToDo
cat sfsSitesToDo | wc -l
# 66103
# index sites for ANGSD and re-run ANGSD using ```-sites```
angsd sites index sfsSitesToDo
export GENOME_FASTA=~/db/ofavGenome/Orbicella_faveolata_gen_17.scaffolds.fa
TODO="-doSaf 1 -ref $GENOME_FASTA -anc $GENOME_FASTA -doMaf 1 -doMajorMinor 4"
echo "angsd -sites sfsSitesToDo -b blueBams -GL 1 -P 1 $TODO -out ofavBlue
angsd -sites sfsSitesToDo -b orangeBams -GL 1 -P 1 $TODO -out ofavOrange
angsd -sites sfsSitesToDo -b purpleBams -GL 1 -P 1 $TODO -out ofavPurple" >sfs
launcher_creator.py -j sfs -n sfs -q shortq7 -t 06:00:00 -e $EMAIL -N 1
sbatch sfs.slurm
Once all jobs run:
mkdir ../SFS
mv sfs* ../SFS
mv ofavBlue* ../SFS
mv ofavOrange* ../SFS
mv ofavPurple* ../SFS
cd ../SFS
###-- 1d-SFS
echo "realSFS ofavBlue.saf.idx >ofavBlue.sfs
realSFS ofavOrange.saf.idx >ofavOrange.sfs
realSFS ofavPurple.saf.idx >ofavPurple.sfs" >realSFS
launcher_creator.py -j realSFS -n realSFS -q shortq7 -t 06:00:00 -e $EMAIL -w 4 -N 1
sbatch realSFS.slurm
###-- 2d-SFS
echo "realSFS ofavBlue.saf.idx ofavOrange.saf.idx -P 20 > pBlOr.sfs ; realSFS fst index ofavBlue.saf.idx ofavOrange.saf.idx -sfs pBlOr.sfs -fstout pBlOr
realSFS ofavBlue.saf.idx ofavPurple.saf.idx -P 20 > pBlPr.sfs ; realSFS fst index ofavBlue.saf.idx ofavPurple.saf.idx -sfs pBlPr.sfs -fstout pBlPr
realSFS ofavOrange.saf.idx ofavPurple.saf.idx -P 20 > pOrPr.sfs ; realSFS fst index ofavOrange.saf.idx ofavPurple.saf.idx -sfs pOrPr.sfs -fstout pOrPr" >2dSFS
launcher_creator.py -j 2dSFS -n 2dSFS -q shortq7 -t 06:00:00 -e $EMAIL -w 20 -N 1
sbatch 2dSFS.slurm
> ofavKFst
realSFS fst stats pBlOr.fst.idx >> ofavKFst
realSFS fst stats pBlPr.fst.idx >> ofavKFst
realSFS fst stats pOrPr.fst.idx >> ofavKFst
echo "Blue
Blue
Orange">fstPops1
echo "Orange
Purple
Purple" >fstPops2
paste fstPops1 fstPops2 ofavKFst > ofavKFst.o
echo -e "pop1\tpop2\tfst\tweightedFst" | cat - ofavKFst.o > ofavKFst.out
rm fstPops1 fstPops2 ofavKFst ofavKFst.o
cat ofavKFst.out
First, we filter Fst outliers from sites:
cd ~/urban/ofav/ANGSD
# Remember to set -minInd to 75% of your total sample size
FILTERS="-minMapQ 20 -minQ 30 -minInd 70 -doHWE 1 -sb_pval 1e-5 -hetbias_pval 1e-5 -uniqueOnly 1 -remove_bads 1 -skipTriallelic 1 -snp_pval 1e-5 -minMaf 0.05 -setMinDepthInd 3"
TODO="-doMajorMinor 4 -ref $GENOME_FASTA -doMaf 1 -dosnpstat 1 -doPost 2 -doBcf 1 --ignore-RG 0 -doGeno 11 -doCounts 1"
echo "angsd -b bamsNoClones -sites ../SFS/sfsSitesToDo -GL 1 $FILTERS $TODO -P 1 -out ofavFiltSnpsBayesFst" > ofavFiltSnpsBayesFst
launcher_creator.py -j ofavFiltSnpsBayesFst -n ofavFiltSnpsBayesFst -t 6:00:00 -e $EMAIL -w 1 -q shortq7
sbatch ofavFiltSnpsBayesFst.slurm
mkdir ../bayescan
cd ../bayescan
Create a file called vcf2bayescan.spid containing this text:
echo "############
# VCF Parser questions
PARSER_FORMAT=VCF
# Do you want to include a file with population definitions?
VCF_PARSER_POP_QUESTION=true
# Only input following regions (refSeqName:start:end, multiple regions: whitespace separated):
VCF_PARSER_REGION_QUESTION=
# What is the ploidy of the data?
VCF_PARSER_PLOIDY_QUESTION=DIPLOID
# Only output following individuals (ind1, ind2, ind4, ...):
VCF_PARSER_IND_QUESTION=
# Output genotypes as missing if the read depth of a position for the sample is below:
VCF_PARSER_READ_QUESTION=
# Take most likely genotype if "PL" or "GL" is given in the genotype field?
VCF_PARSER_PL_QUESTION=true
# Do you want to exclude loci with only missing data?
VCF_PARSER_EXC_MISSING_LOCI_QUESTION=false
# Select population definition file:
VCF_PARSER_POP_FILE_QUESTION=./bspops.txt
# Only output SNPs with a phred-scaled quality of at least:
VCF_PARSER_QUAL_QUESTION=
# Do you want to include non-polymorphic SNPs?
VCF_PARSER_MONOMORPHIC_QUESTION=false
# Output genotypes as missing if the phred-scale genotype quality is below:
VCF_PARSER_GTQUAL_QUESTION=
# GESTE / BayeScan Writer questions
WRITER_FORMAT=GESTE_BAYE_SCAN
# Specify which data type should be included in the GESTE / BayeScan file (GESTE / BayeScan can only analyze one data type per file):
GESTE_BAYE_SCAN_WRITER_DATA_TYPE_QUESTION=SNP
############" >vcf2bayescan.spid
Need a file “bspops” that has samples in .bams order with the assigned populations/lineages
cp ../ANGSD/bamsClusters ./bspops.txt
cp ../ANGSD/ofavFiltSnpsBayesFst.bcf .
conda activate bowtie2
srun bcftools convert -O v -o ofavFiltSnpsBayesFst.vcf ofavFiltSnpsBayesFst.bcf
conda deactivate
srun java -Xmx1024m -Xms512m -jar ~/bin/PGDSpider_2.0.7.1/PGDSpider2-cli.jar -inputfile ofavFiltSnpsBayesFst.vcf -outputfile ofavFilt.bayescan -spid vcf2bayescan.spid
conda activate pgdspider
# For gattaca
PGDSpider2-cli \
-inputfile ofavFiltSnpsBayesFst.vcf \
-outputfile ofavFilt.bayescan \
-spid vcf2bayescan.spid
echo '#!/bin/bash' >bayeScan.sh
echo "bayescan2 ofavFilt.bayescan -threads=100" >>bayeScan.sh
sbatch -e bayeScan.e%j -o bayeScan.o%j -p mediumq7 --mail-user $EMAIL --mail-type=ALL bayeScan.sh
cut -d" " -f2- ofavFilt.baye_fst.txt > ofavFst
## removing all the .bcf data before snps
tail --lines=+94 ofavFiltSnpsBayesFst.vcf | cut -f 1,2 | paste --delimiters "\t" - ofavFst > ofav.baye_fst_pos.txt
Identify sites with q-value < 0.5
cp ../ANGSD/sfsSitesToDo .
awk '$5<0.5 {print $1"\t"$2}' ofav.baye_fst_pos.txt > ofavBayeOuts
grep -Fvxf ofavBayeOuts sfsSitesToDo > sitesToDo_filtBaye
cp sitesToDo_filtBaye ../ANGSD
cd ../ANGSD
angsd sites index sitesToDo_filtBaye
export GENOME_FASTA=~/db/ofavGenome/Orbicella_faveolata_gen_17.scaffolds.fa
TODO="-doSaf 1 -anc $GENOME_FASTA -ref $GENOME_FASTA -doMaf 1 -doMajorMinor 4"
echo "angsd -sites sitesToDo_filtBaye -b blueBams -GL 1 -P 1 $TODO -out ofavBlueSFSbaye
angsd -sites sitesToDo_filtBaye -b orangeBams -GL 1 -P 1 $TODO -out ofavOrangeSFSbaye
angsd -sites sitesToDo_filtBaye -b purpleBams -GL 1 -P 1 $TODO -out ofavPurpleSFSbaye" >sfs_filtBaye
launcher_creator.py -j sfs_filtBaye -n sfs_filtBaye -q shortq7 -t 06:00:00 -e $EMAIL -N 1
sbatch sfs_filtBaye.slurm
mv *SFS* ../SFS
mv *sfs* ../SFS
Generating per-population SFS
cd ../SFS
echo "realSFS ofavBlueSFSbaye.saf.idx -fold 1 >ofavFiltBlue.sfs
realSFS ofavOrangeSFSbaye.saf.idx -fold 1 >ofavFiltOrange.sfs
realSFS ofavPurpleSFSbaye.saf.idx -fold 1 >ofavFiltPurple.sfs" >ofavFiltSFS
launcher_creator.py -j ofavFiltSFS -n ofavFiltSFS -q shortq7 -t 06:00:00 -e $EMAIL -N 1
sbatch ofavFiltSFS.slurm
Making blueprint files for StairwayPlot
mkdir ../swPlot
cd ../swPlot
# SWplot only wants the mutations, so have to omit first number in SFS file (0 mutations).
# L = cat ../ANGSD/sitesToDo_filtBaye | wc -l
# which in our case =65540
echo "#input setting
popid: ofavBlue # id of the population (no white space)
nseq: 216 # number of sequences (2*n)
L: 65540 # total number of observed nucleic sites, including polymorphic and monomorphic
whether_folded: true # whether the SFS is folded (true or false)
SFS: 359.384186 176.850916 79.085634 56.122558 52.988270 25.213086 29.269262 6.897390 0.088362 6.035835 6.189239 1.689968 4.810179 2.838537 0.507948 6.285491 0.287215 4.095548 1.842047 0.406760 4.856281 0.000178 3.032708 0.040124 0.000000 0.000000 2.772030 0.000000 0.000000 0.000000 5.300901 0.000073 0.000000 1.896316 0.000000 4.034110 0.000000 0.000000 0.000000 0.161838 3.349950 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000005 1.000074 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
#smallest_size_of_SFS_bin_used_for_estimation: 1 # default is 1; to ignore singletons, uncomment this line and change this number to 2
#largest_size_of_SFS_bin_used_for_estimation: 54 # default is nseq/2 for folded SFS
pct_training: 0.67 # percentage of sites for training
nrand: 54 108 162 216 # number of random break points for each try (separated by white space)
project_dir: ofavBlue # project directory
stairway_plot_dir: $HOME/bin/stairway_plot_v2.1.2/stairway_plot_es # directory to the stairway plot files
ninput: 200 # number of input files to be created for each estimation
#random_seed: 6
#output setting
mu: 2e-8 # assumed mutation rate per site per generation
year_per_generation: 15 # assumed generation time (in years)
#plot setting
plot_title: ofavBlue # title of the plot
xrange: 0.1,10000 # Time (1k year) range; format: xmin,xmax; "0,0" for default
yrange: 0,0 # Ne (1k individual) range; format: xmin,xmax; "0,0" for default
xspacing: 2 # X axis spacing
yspacing: 2 # Y axis spacing
fontsize: 12 # Font size" >ofavBlue.blueprint
echo "#input setting
popid: ofavOrange # id of the population (no white space)
nseq: 116 # number of sequences (2*n)
L: 65540 # total number of observed nucleic sites, including polymorphic and monomorphic
whether_folded: true # whether the SFS is folded (true or false)
SFS: 554.425265 199.560921 98.343868 77.942108 35.099337 36.244939 18.319782 5.658205 19.041567 0.000000 0.000060 1.938488 7.950208 0.068295 3.433731 0.000000 0.000000 5.218967 0.000000 0.000000 3.936680 0.000082 4.467446 0.000080 0.000000 1.679113 1.660934 2.325276 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
#smallest_size_of_SFS_bin_used_for_estimation: 1 # default is 1; to ignore singletons, uncomment this line and change this number to 2
#largest_size_of_SFS_bin_used_for_estimation: 29 # default is nseq/2 for folded SFS
pct_training: 0.67 # percentage of sites for training
nrand: 29 58 87 116 # number of random break points for each try (separated by white space)
project_dir: ofavOrange # project directory
stairway_plot_dir: $HOME/bin/stairway_plot_v2.1.2/stairway_plot_es # directory to the stairway plot files
ninput: 200 # number of input files to be created for each estimation
#random_seed: 12
#output setting
mu: 2e-8 # assumed mutation rate per site per generation
year_per_generation: 15 # assumed generation time (in years)
#plot setting
plot_title: ofavOrange # title of the plot
xrange: 0.1,10000 # Time (1k year) range; format: xmin,xmax; "0,0" for default
yrange: 0,0 # Ne (1k individual) range; format: xmin,xmax; "0,0" for default
xspacing: 2 # X axis spacing
yspacing: 2 # Y axis spacing
fontsize: 12 # Font size" >ofavOrange.blueprint
echo "#input setting
popid: ofavPurple # id of the population (no white space)
nseq: 16 # number of sequences (2*n)
L: 65540 # total number of observed nucleic sites, including polymorphic and monomorphic
whether_folded: true # whether the SFS is folded (true or false)
SFS: 462.902489 195.427348 56.194712 43.441765 0.000000 0.000000 0.000000 0.000000
#smallest_size_of_SFS_bin_used_for_estimation: 1 # default is 1; to ignore singletons, uncomment this line and change this number to 2
#largest_size_of_SFS_bin_used_for_estimation: 4 # default is nseq/2 for folded SFS
pct_training: 0.67 # percentage of sites for training
nrand: 4 8 12 16 # number of random break points for each try (separated by white space)
project_dir: ofavPurple # project directory
stairway_plot_dir: $HOME/bin/stairway_plot_v2.1.2/stairway_plot_es # directory to the stairway plot files
ninput: 200 # number of input files to be created for each estimation
#random_seed: 71
#output setting
mu: 2e-8 # assumed mutation rate per site per generation
year_per_generation: 15 # assumed generation time (in years)
#plot setting
plot_title: ofavPurple # title of the plot
xrange: 0.1,10000 # Time (1k year) range; format: xmin,xmax; "0,0" for default
yrange: 0,0 # Ne (1k individual) range; format: xmin,xmax; "0,0" for default
xspacing: 2 # X axis spacing
yspacing: 2 # Y axis spacing
fontsize: 12 # Font size" >ofavPurple.blueprint
Create batch files and launch:
for blueprint in *blueprint; do
java -cp $HOME/bin/stairway_plot_v2.1.2/stairway_plot_es Stairbuilder $blueprint;
done
# split each .sh script and launch all (second job waiting with dependency on first finishing)
for file in *.sh; do
grep -B 805 "# Step 2: determine number of break points" $file > ${file%.*}1;
grep -A 805 "# Step 2: determine number of break points" $file > ${file%.*}2;
echo '#!/bin/bash' | cat - ${file%.*}2 >temp;
mv temp ${file%.*}2;
launcher_creator.py -j ${file%.*}1 -n ${file%.*}1 -q shortq7 -t 06:00:00 -e $EMAIL -N 5;
DEP=$(sbatch ${file%.*}1.slurm) && sbatch --dependency=afterok:${DEP##* } -e ${file%.*}2.e%j -o ${file%.*}2.o%j -p shortq7 --mem=200GB --mail-user reckert2017@fau.edu --mail-type=ALL ${file%.*}2;
done
After all have run:
mv ofav*/*final.summary .
Calculating Heterozygosity across all loci (variant//invariant) using
ANGSD and R script from Misha Matz (https://github.com/z0on/2bRAD_denovo)
cd ~/urban/ofav/ANGSD
cp ../SFS/sfsSitesToDo .
angsd sites index sfsSitesToDo
export GENOME_FASTA=~/db/ofavGenome/Orbicella_faveolata_gen_17.scaffolds.fa
# Remember to set -minInd to 75% of your total sample size
FILTERS="-maxHetFreq 0.5 -uniqueOnly 1 -remove_bads 1 -skipTriallelic 1 -minMapQ 30 -minQ 35 -doHWE 1 -sb_pval 1e-5 -hetbias_pval 1e-5 -minInd 70 -setMinDepthInd 3"
TODO="-ref $GENOME_FASTA -makeMatrix 1 -doIBS 1 -doCov 1 -doGeno 32 -doPost 1 -doGlf 2 -doCounts 1 -doMajorMinor 1 -dosnpstat 1 -doMaf 1"
echo "angsd -sites sfsSitesToDo -b bamsNoClones -GL 1 -P 1 $TODO $FILTERS -out ofavHet" >angsdHet
launcher_creator.py -j angsdHet -n angsdHet -q shortq7 -t 06:00:00 -e $EMAIL -N 1
sbatch angsdHet.slurm
mkdir ../heterozygosity
cp ofavHet* ../heterozygosity/
cd ../heterozygosity
echo heterozygosity_beagle.R ofavHet.beagle.gz >filtHet
launcher_creator.py -j filtHet -n filtHet -q shortq7 -t 6:00:00 -e $EMAIL -N 1
sbatch --mem=200GB filtHet.slurm
tail -n 94 filtHet.e* > ofavHet
cat ofavHet
cd ~/urban/ofav/
mkdir theta
cd ANGSD
GENOME_FASTA=~/db/ofavGenome/Orbicella_faveolata_gen_17.scaffolds.fa
>kThetas
for POP in purple orange blue; do
echo "angsd -b ${POP}Bams -GL 1 -P 1 -sites sfsSitesToDo -anc $GENOME_FASTA -doSaf 1 -out ${POP}Out &&\
realSFS ${POP}Out.saf.idx -fold 1 > ${POP}Out.sfs &&\
realSFS saf2theta ${POP}Out.saf.idx -sfs ${POP}Out.sfs -outname ${POP} -fold 1 &&\
thetaStat do_stat ${POP}.thetas.idx" >> kThetas
done
launcher_creator.py -j kThetas -n kThetas -q shortq7 -t 6:00:00 -e $EMAIL -w 4 -N 1
sbatch kThetas.slurm
mv *theta* ../theta
cd ~/urban/ofav/ANGSD
FILTERS="-uniqueOnly 1 -remove_bads 1 -minMapQ 20 -minQ 30 -dosnpstat 1 -doHWE 1 -hwe_pval 1e-5 -sb_pval 1e-5 -hetbias_pval 1e-5 -skipTriallelic 1 -snp_pval 1e-5 -minMaf 0.05 -setMinDepthInd 3"
TODO="-doMajorMinor 1 -doMaf 1 -doCounts 1 -makeMatrix 1 -doIBS 1 -doCov 1 -doGeno 8 -doBcf 1 -doPost 1 -doGlf 2"
echo "angsd -b blueBams -GL 1 -P 1 $FILTERS $TODO -minInd 40 -sites sitesToDo -out ofavBlueSnps" > blueSnps
launcher_creator.py -j blueSnps -n blueSnps -q shortq7 -t 06:00:00 -e $EMAIL -w 2 -N 1
sbatch blueSnps.slurm
To use BayesAss3 we first need to convert our ANGSD output into genotype format
cd ~/urban/ofav
mkdir BA3
cd BA3
cp ../ANGSD/blueBams .
cp ../ANGSD/ofavBlueSnps.bcf .
# need file bspops that lists bams and the sampling site, tab delimited.
cut -d',' -f1,4 ../ofavMetadata.csv | tr ',' '\t' > bspops.txt
awk -F ' ' 'NR==FNR{a[$1]; next} ($1 in a) || ($2 in a)' blueBams bspops.txt > ofavBlueBA3Pops.txt
cat ofavBlueBA3Pops.txt | wc -l
# 54 — Good!
bcftools view ofavBlueSnps.bcf > ofavBlueSnps.vcf
echo "# VCF Parser questions
PARSER_FORMAT=VCF
# Only output SNPs with a phred-scaled quality of at least:
VCF_PARSER_QUAL_QUESTION=
# Select population definition file:
VCF_PARSER_POP_FILE_QUESTION=./ofavBlueBA3Pops.txt
# What is the ploidy of the data?
VCF_PARSER_PLOIDY_QUESTION=DIPLOID
# Do you want to include a file with population definitions?
VCF_PARSER_POP_QUESTION=true
# Output genotypes as missing if the phred-scale genotype quality is below:
VCF_PARSER_GTQUAL_QUESTION=
# Do you want to include non-polymorphic SNPs?
VCF_PARSER_MONOMORPHIC_QUESTION=false
# Only output following individuals (ind1, ind2, ind4, ...):
VCF_PARSER_IND_QUESTION=
# Only input following regions (refSeqName:start:end, multiple regions: whitespace separated):
VCF_PARSER_REGION_QUESTION=
# Output genotypes as missing if the read depth of a position for the sample is below:
VCF_PARSER_READ_QUESTION=
# Take most likely genotype if "PL" or "GL" is given in the genotype field?
VCF_PARSER_PL_QUESTION=true
# Do you want to exclude loci with only missing data?
VCF_PARSER_EXC_MISSING_LOCI_QUESTION=true
# Immanc (BayesAss) Writer questions
WRITER_FORMAT=IMMANC
# Specify the locus/locus combination you want to write to the Immanc (BayesAss) file:
IMMANC_WRITER_LOCUS_COMBINATION_QUESTION=
# Specify which data type should be included in the Immanc (BayesAss)) file (Immanc (BayesAss) can only analyze one data type per file):
IMMANC_WRITER_DATA_TYPE_QUESTION=SNP" >ofavBA.spid
module load pgdspider-2.1.1.2-gcc-9.2.0-ghxvd4c
pgdSpider=/opt/ohpc/pub/spack/opt/spack/linux-centos7-x86_64/gcc-9.2.0/pgdspider-2.1.1.2-ghxvd4c4ieqngkbutakc7x6j4pfkqm5e/bin/PGDSpider2-cli.jar
echo '#!/bin/bash' > pgdSpider.sh
echo "java -Xmx1024m -Xms512m -jar $pgdSpider -inputformat VCF -outputformat IMMANC vcf -inputfile ofavBlueSnps.vcf -outputfile fkSintBayesAss.txt -spid fkSintBA.spid" >>pgdSpider.sh
sbatch -e pgdSpider.e%j -o pgdSpider.o%j -p mediumq7 --mail-user reckert2017@fau.edu --mail-type=ALL pgdSpider.sh
conda activate pgdspider
# For gattaca
echo "PGDSpider2-cli \
-inputfile ofavBlueSnps.vcf \
-outputfile ofavBayesAss.txt \
-spid ofavBA.spid" > BAprep
## Default params:
#MCMC reps: 1,000,000
#burn in: 100,000
#sampling freq: 100
#delta migration (1): 0.1
#delta allele freq (3): 0.1
#delta inbreeding (4): 0.1
rm ofavBlueSnps.bcf
rm bspops.txt
module load gcc-9.2.0-gcc-8.3.0-ebpgkrt gsl-2.5-gcc-9.2.0-i6lf4jb netlib-lapack-3.9.1-gcc-9.2.0-gcqg2b2 BayesAss/3.0.4.2
# Run a test with verbose [-v] output to see the acceptance rates in the terminal (takes a few minutes to compute)
# check the output file [less BATest.o*] and kill the job once you get output with acceptance rates [scancel {yourJobID}]
# After ~5 min you should start seeing output in the BATest.o* file
echo '#!/bin/bash' > BATest
echo BA3SNP -v -i 30000000 -b 10000000 -n 1000 ofavBayesAss.txt >> BATest
sbatch -e BATest.e%j -o BATest.o%j -p shortq7 --mail-user reckert2017@fau.edu --mail-type=ALL BATest
#logL: -545609.14 % done: [0.00] % accepted: (0.49, 0.00, 0.91, 0.13, 0.79)
# we are looking for 20—60% acceptance, ideally somewhere nearer 20—30%
# relationships between mixing parameters and acceptance rates are inverse
# defaults are 0.1 (all parameters are scale 0—1)
# increase [-m] increase [-a] and decrease [-f]
echo '#!/bin/bash' > BATest
echo BA3SNP -v -i 30000000 -b 10000000 -n 1000 -m 0.2 -a 0.6 -f 0.04 ofavBayesAss.txt >> BATest
sbatch -e BATest.e%j -o BATest.o%j -p shortq7 --mail-user reckert2017@fau.edu --mail-type=ALL BATest
# logP(M): -120.99 logL(G): -237963.30 logL: -238084.29 % done: [0.00] % accepted: (0.49, 0.00, 0.55, 0.14, 0.73)
echo '#!/bin/bash' > BATest
echo BA3SNP -v -i 30000000 -b 10000000 -n 1000 -m 0.25 -a 0.7 -f 0.07 ofavBayesAss.txt >> BATest
sbatch -e BATest.e%j -o BATest.o%j -p shortq7 --mail-user reckert2017@fau.edu --mail-type=ALL BATest
# logP(M): -125.49 logL(G): -240078.35 logL: -240203.84 % done: [0.00] % accepted: (0.46, 0.00, 0.51, 0.08, 0.72)
echo '#!/bin/bash' > BATest
echo BA3SNP -v -i 30000000 -b 10000000 -n 1000 -m 0.25 -a 0.85 -f 0.05 ofavBayesAss.txt >> BATest
sbatch -e BATest.e%j -o BATest.o%j -p shortq7 --mail-user reckert2017@fau.edu --mail-type=ALL BATest
# logP(M): -118.09 logL(G): -239038.86 logL: -239156.95 % done: [0.00] % accepted: (0.46, 0.00, 0.46, 0.11, 0.72)
echo '#!/bin/bash' > BATest
echo BA3SNP -v -i 30000000 -b 10000000 -n 1000 -m 0.5 -a 0.95 -f 0.02 ofavBayesAss.txt >> BATest
sbatch -e BATest.e%j -o BATest.o%j -p shortq7 --mail-user reckert2017@fau.edu --mail-type=ALL BATest
# logP(M): -127.36 logL(G): -238075.42 logL: -238202.78 % done: [0.00] % accepted: (0.42, 0.00, 0.43, 0.22, 0.73)
Make and launch 10 iterations of BayesAss, each in its own run directory so we can keep all trace files (saved as ‘BA3trace.txt’ and would overwrite if not in separate directories). We are using [-s $RANDOM] to use a random start seed for each independent run
module load gcc-9.2.0-gcc-8.3.0-ebpgkrt gsl-2.5-gcc-9.2.0-i6lf4jb netlib-lapack-3.9.1-gcc-9.2.0-gcqg2b2 BayesAss/3.0.4.2
for i in {01..10}; do
echo '#!/bin/bash' > BayesAss$i.sh
echo BA3SNP -v -u -s $RANDOM -i 30000000 -b 10000000 -n 1000 -m 0.5 -a 0.95 -f 0.02 -t -o ofavBARun${i}Out.txt ../ofavBayesAss.txt >> BayesAss$i.sh;
mkdir run$i;
mv BayesAss$i.sh run$i;
cd run$i;
sbatch -e BayesAss$i.e%j -o BayesAss$i.o%j -p longq7 --mem=200GB --exclusive --mail-user $EMAIL --mail-type=ALL BayesAss$i.sh
cd ..;
done
# For gattaca
for i in {01..10}; do
echo '#!/bin/bash' > BayesAss$i.sh
echo ~/bin/BA3/BA3SNP -v -u -s $RANDOM -i 30000000 -b 10000000 -n 1000 -m 0.5 -a 0.95 -f 0.02 -t -o ofavBARun${i}Out.txt ../ofavBayesAss.txt >> BayesAss$i.sh;
mkdir run$i;
mv BayesAss$i.sh run$i;
cd run$i;
nohup bash BayesAss$i.sh > BayesAss$i.log 2>&1 &
cd ..;
done
# after all runs complete copy files to main BayesAss directory
cd ~/urban/ofav/BA3
#cp run*/*Out.txt .
for i in {01..10}; do
cp run$i/BA3trace.txt BA3trace.$i.txt;
done