My first variant calling workflow

Eurioste

Junior Member

Join Date: Jun 2017

Posts: 5
- Share
- Tweet
#1

My first variant calling workflow

06-30-2017, 08:50 AM

Hello, I'm currently learning how to process data from NGS using the Galaxy platform. This is the first time I work with NGS data and I find myself currently overwhelmed with the abundance of different variant call workflows and available tools. I have molecular biology background and I'm learning this on my own through on-line courses so I wish to have some feedback in case I'm not making mistakes. While I can code in python, I wish to make this workflow in Galaxy as part of a course.

For the purpose of learning, I was given FASTQ raw reads from an Illumina MiSeq, sequenced as paired ends to 125bp in length. The data refers to targetted re-sequencing data for a father, mother and child trio.I need to create a workflow to identify polymorphic sites in all three individuals.

I started a workflow based on the references bellow:

folk.universitetetioslo.no

http://folk.uio.no/jonkl/StuffForMBV-INFx410/Articles/AAltmann.pdf

Page Not Found

https://www.biomedcentral.com/content/supplementary/1756-0500-7-314-S1.pdf

My current incomplete attempt is available at the link bellow. Some steps from the references were skipped for the sake of simplicity. I'm making my best effort to actually understand what each step really does and why to use it. You can import the worklow on Galaxy for better view:

404 Not Found

https://usegalaxy.org/u/eurioste/w/variant-calling-on-trio

Briefly, the paired end reads had 3' 10 bps trimmed (based on FASTQ report, not in the workflow), resulting in high quality reads of about 140bps. The paired reads for each individual with were aligned to the reference human_g1k_v37 with BWA-MEN, generating different read group informations. The resulting alignment BAM for each individual was pre-processed with Picard sorting, removal of ambiguous reads and duplicates and update of mate-pair information. I'm omitting indel re-alignment and base quality recalibration on purpose. The resulting 3 BAMs could be used for variant calling, but now I have some questions.

I'm expected to count the number of variants of different types above a certain quality threshold.

I'm in doubt if was it a good choice to align the data for each individual separately. Is it correct to do variant calling in each individual separately? May I still merge these BAM files with Picard and do variant calling, will they retain the correct alignment information? Or I should merge the read information before the alignment? Can these alter the results of the workflow? I've read about converting FASTQ to SAM/BAM and merging them in an unmapped BAM before the alignment and subsequent pre-processing. Do I really need to do it?

Is my workflow actually producing useful data? Please let me know if I'm making a mistake, I'm a little confused if what I did is right. Make sure you describe things well because I'm still unfamiliar with NGS data processing.

Thanks in advance

Eduardo
Tags: None

Previous template Next

Essential Discoveries and Tools in Epitranscriptomics

by seqadmin

The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
- Channel: Articles
04-22-2024, 07:01 AM
Current Approaches to Protein Sequencing

by seqadmin

Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
- Channel: Articles
04-04-2024, 04:25 PM

Topics	Statistics	Last Post
Expanding the Horizons of Cellular Research with the Single Cell Atlas by seqadmin Started by seqadmin, 04-25-2024, 11:49 AM	0 responses 19 views 0 likes	Last Post by seqadmin 04-25-2024, 11:49 AM
Genetic Variants and Diabetes Risk in Childhood Cancer Survivors by seqadmin Started by seqadmin, 04-24-2024, 08:47 AM	0 responses 20 views 0 likes	Last Post by seqadmin 04-24-2024, 08:47 AM
Cancer Metastasis: A Deep Dive into Cellular Plasticity by seqadmin Started by seqadmin, 04-11-2024, 12:08 PM	0 responses 62 views 0 likes	Last Post by seqadmin 04-11-2024, 12:08 PM
Proteogenomic Profiles Offer New Clues in Prostate Cancer by seqadmin Started by seqadmin, 04-10-2024, 10:19 PM	0 responses 60 views 0 likes	Last Post by seqadmin 04-10-2024, 10:19 PM

Seqanswers Leaderboard Ad

Announcement

My first variant calling workflow

Latest Articles

ad_right_rmr

News