Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • LeonDK
    replied
    Originally posted by dpryan View Post
    Welcome to the NGS side of bioinfo!

    a) Sounds good to me, in fact that's what I'd do (well, it essentially is what I do, though I work mostly on mice).
    b) It shouldn't much matter. It's certainly simpler to just use the primary assembly. If you choose the top level assembly then make sure to remove any chromosome with "_hap_" in the name. Not doing that will artificially inflate the number of multimappers.
    c) Don't forget to trim adapters from the fastq files. You can quality trim a bit too, but don't be too aggressive there.

    Anyway, it sounds like you're on the right track!
    Thanks! So far, I find it a bit of a strange world, with a lot of 'best guesses'!

    Regarding 'trimming', I have been told, that this is not necessary, since the aligner will simply 'skip' adapters/bad quality bases, apparently using something called 'soft-clipping'?

    Leave a comment:


  • dpryan
    replied
    Welcome to the NGS side of bioinfo!

    a) Sounds good to me, in fact that's what I'd do (well, it essentially is what I do, though I work mostly on mice).
    b) It shouldn't much matter. It's certainly simpler to just use the primary assembly. If you choose the top level assembly then make sure to remove any chromosome with "_hap_" in the name. Not doing that will artificially inflate the number of multimappers.
    c) Don't forget to trim adapters from the fastq files. You can quality trim a bit too, but don't be too aggressive there.

    Anyway, it sounds like you're on the right track!

    Leave a comment:


  • LeonDK
    replied
    Originally posted by dpryan View Post
    The primary assembly contains only contigs assembled as chromosomes. The top level assembly contains that plus contigs that either (A) are known to belong to a specific chromosome but we're not sure where yet (unplaced contigs) or (B) are believed to be human but we're not even sure what chromosome to align them to (unlocalized contigs). If you've ever seen things like chr17_random, then that's from the first group. For humans, the top level assembly also contains many of the alternate haplotypes (i.e., full length chromosomes with all but a smallish region hard masked, where the non-masked region is commonly variant in the population). Newer versions of BWA can use these to increase the quality of mapping (to my knowledge, BWA is the only aligner that can natively use these at the moment...and this is all as of the last couple weeks). Presumably more aligners will add this feature in, since it really is the proper way to do it.

    BTW, with RNAseq you can get rid of the *hap* files (actually, you should do this unless you're using BWA).
    First: Thank you so much for your input - I really appreciate your time and hope that others, which later on potentially will end up reading this via google, will also benefit!

    Next: Having recently finished my Bioinformatics PhD in statistical methodology and vaccine development, the world of NGS is new to me!

    Finally: I am using the STAR aligner (STAR_2.4.0h1) and I have build genome indices using GRCh38 [1] as described in [2]. I have then aligned my human 101bp Illumina paired-end reads to this build, using options "ENCODE standard options for long RNA-seq" in [2] section 3.2.1 and then finally I am trying to use featureCounts (Version 1.4.6) to generate a count matrix. Based on this I aim at performing DEG analysis.

    Questions:
    a) Is this approach valid/recommendable or have I missed something?
    b) Given the above scenario, would you recommend using the 'primary' or the 'top level' assembly?
    c) Any other input?

    Cheers again!

    1. ftp://ftp.ensembl.org/pub/release-78...assembly.fa.gz
    2. https://github.com/alexdobin/STAR/bl...STARmanual.pdf

    Leave a comment:


  • dpryan
    replied
    The primary assembly contains only contigs assembled as chromosomes. The top level assembly contains that plus contigs that either (A) are known to belong to a specific chromosome but we're not sure where yet (unplaced contigs) or (B) are believed to be human but we're not even sure what chromosome to align them to (unlocalized contigs). If you've ever seen things like chr17_random, then that's from the first group. For humans, the top level assembly also contains many of the alternate haplotypes (i.e., full length chromosomes with all but a smallish region hard masked, where the non-masked region is commonly variant in the population). Newer versions of BWA can use these to increase the quality of mapping (to my knowledge, BWA is the only aligner that can natively use these at the moment...and this is all as of the last couple weeks). Presumably more aligners will add this feature in, since it really is the proper way to do it.

    BTW, with RNAseq you can get rid of the *hap* files (actually, you should do this unless you're using BWA).

    Leave a comment:


  • LeonDK
    replied
    Originally posted by shi View Post
    There are also a lot of reads that were not assigned due to the paired end distance issue ("Unassigned_FragementLength 13813566"). I would suggest you to allow a bigger paired end distance, eg -D 10000, since your data is RNA-seq and many read pairs span an intron that has a length much greater than the nominal fragment/template length.

    You also got a lot of multi-mapping reads (Unassigned_MultiMapping 13953741). This number seems too high to me, given that you have paired end reads and most of them are expected to be mapped uniquely.

    Also note that featureCounts can match chromosome names 'chr1' and '1' automatically.
    I will try to re-do the mapping based on the 'top level' assembly, as suggested by dpryan and then I will re-run featureCounts with updated parameters as suggested by you. I will return with the results, once they are ready!

    Cheers

    Leave a comment:


  • LeonDK
    replied
    Originally posted by dpryan View Post
    BTW, you'll only get the "extra things" in the top level assembly (rather than the primary).
    What is the difference between 'primary' [1] and 'top level' [2] assembly?

    1. ftp://ftp.ensembl.org/pub/release-78...assembly.fa.gz
    2. ftp://ftp.ensembl.org/pub/release-78...toplevel.fa.gz

    Leave a comment:


  • shi
    replied
    There are also a lot of reads that were not assigned due to the paired end distance issue ("Unassigned_FragementLength 13813566"). I would suggest you to allow a bigger paired end distance, eg -D 10000, since your data is RNA-seq and many read pairs span an intron that has a length much greater than the nominal fragment/template length.

    You also got a lot of multi-mapping reads (Unassigned_MultiMapping 13953741). This number seems too high to me, given that you have paired end reads and most of them are expected to be mapped uniquely.

    Also note that featureCounts can match chromosome names 'chr1' and '1' automatically.

    Leave a comment:


  • LeonDK
    replied
    Originally posted by dpryan View Post
    The TruSeq kit is dUTP based, so "reverse strand" is correct. It's stranded either way, the only question is which read in a pair denotes the strand. The most common kits (of which TruSeq is one of the most common) all need "-s 2".
    Thank you!

    Leave a comment:


  • dpryan
    replied
    The TruSeq kit is dUTP based, so "reverse strand" is correct. It's stranded either way, the only question is which read in a pair denotes the strand. The most common kits (of which TruSeq is one of the most common) all need "-s 2".

    Leave a comment:


  • LeonDK
    replied
    Originally posted by dpryan View Post
    If you got them both from Ensembl then they'll be compatible (it's often the case that people will try to mix and match between Ensembl and UCSC...which doesn't work). BTW, you'll only get the "extra things" in the top level assembly (rather than the primary).

    Have you looked at these alignments in IGV or a similar browser? I also wonder if you'll get better results with "-s 2", which is more common than "-s 1". I have no clue what's up with the Unassigned_FragementLength alignments, but that's the next thing to look into.
    If I use "-s 2", which is the option for "reverse stranded", the mapping dramatically increases to ~50%. But the kit is called "TruSeq stranded", not "reverse stranded"?

    Leave a comment:


  • dpryan
    replied
    If you got them both from Ensembl then they'll be compatible (it's often the case that people will try to mix and match between Ensembl and UCSC...which doesn't work). BTW, you'll only get the "extra things" in the top level assembly (rather than the primary).

    Have you looked at these alignments in IGV or a similar browser? I also wonder if you'll get better results with "-s 2", which is more common than "-s 1". I have no clue what's up with the Unassigned_FragementLength alignments, but that's the next thing to look into.

    Leave a comment:


  • LeonDK
    replied
    Originally posted by dpryan View Post
    The question is whether chromosome 1 is called "chr1" or just "1". There are also a number of things in the human genome besides the standard 22 chromosomes +M/X/Y. I count 194 total chromosomes/scaffolds in the most recent release (this is after excluding the various "haplotype" chromosomes). These extra "things" are unplaced and/or unlocalized contigs (i.e., they likely contain human DNA, but we still don't know where to put them in the genome).

    BTW, to just see if the names match, just compare the first column of your annotation file and the header of the BAM file.
    I'm using this annotation file:
    ftp://ftp.ensembl.org/pub/release-78...Ch38.78.gtf.gz

    And this genome:
    ftp://ftp.ensembl.org/pub/release-78...assembly.fa.gz

    The first column of the annotation file contain 1..22+MT/X/Y and other things, such as e.g. 'CHR_HG142_HG150_NOVEL_TEST' and the header of my SAM file contain:
    cat mySAMPLE/Aligned.out.sam | grep '@SQ'
    @SQ SN:10 LN:133797422
    @SQ SN:11 LN:135086622
    @SQ SN:12 LN:133275309
    @SQ SN:13 LN:114364328
    @SQ SN:14 LN:107043718
    @SQ SN:15 LN:101991189
    @SQ SN:16 LN:90338345
    @SQ SN:17 LN:83257441
    @SQ SN:18 LN:80373285
    @SQ SN:19 LN:58617616
    @SQ SN:1 LN:248956422
    @SQ SN:20 LN:64444167
    @SQ SN:21 LN:46709983
    @SQ SN:22 LN:50818468
    @SQ SN:2 LN:242193529
    @SQ SN:3 LN:198295559
    @SQ SN:4 LN:190214555
    @SQ SN:5 LN:181538259
    @SQ SN:6 LN:170805979
    @SQ SN:7 LN:159345973
    @SQ SN:8 LN:145138636
    @SQ SN:9 LN:138394717
    @SQ SN:MT LN:16569
    @SQ SN:X LN:156040895
    @SQ SN:Y LN:57227415

    Leave a comment:


  • dpryan
    replied
    Originally posted by LeonDK View Post
    How?

    What other 'things' exist in the human genome besides the 'standard' 22 + X/Y chromosomes?
    The question is whether chromosome 1 is called "chr1" or just "1". There are also a number of things in the human genome besides the standard 22 chromosomes +M/X/Y. I count 194 total chromosomes/scaffolds in the most recent release (this is after excluding the various "haplotype" chromosomes). These extra "things" are unplaced and/or unlocalized contigs (i.e., they likely contain human DNA, but we still don't know where to put them in the genome).

    BTW, to just see if the names match, just compare the first column of your annotation file and the header of the BAM file.

    Leave a comment:


  • LeonDK
    replied
    Originally posted by GenoMax View Post
    Obvious thing to check - if chromosome names in your sam file match the annotation.
    How?
    Originally posted by GenoMax View Post
    There are 270 "chromosomes" since there are entries of things like (besides the standard chromosomes 1,2,3 etc) in the GTF file.
    What other 'things' exist in the human genome besides the 'standard' 22 + X/Y chromosomes?

    Leave a comment:


  • LeonDK
    replied
    Originally posted by shi View Post
    featureCounts also outputs assignment stat in a .summary file. Could you show the content of that file?
    Status mySAMPLE/Aligned.out.sam
    Assigned 696763
    Unassigned_Ambiguity 11448
    Unassigned_MultiMapping 13953741
    Unassigned_NoFeatures 17772725
    Unassigned_Unmapped 0
    Unassigned_MappingQuality 0
    Unassigned_FragementLength 13813566
    Unassigned_Chimera 0
    Unassigned_Secondary 0
    Unassigned_Nonjunction 0
    Unassigned_Duplicate 0

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Quality Control Essentials for Next-Generation Sequencing Workflows
    by seqadmin




    Like all molecular biology applications, next-generation sequencing (NGS) workflows require diligent quality control (QC) measures to ensure accurate and reproducible results. Proper QC begins at nucleic acid extraction and continues all the way through to data analysis. This article outlines the key QC steps in an NGS workflow, along with the commonly used tools and techniques.

    Nucleic Acid Quality Control
    Preparing for NGS starts with isolating the...
    02-10-2025, 01:58 PM
  • seqadmin
    An Introduction to the Technologies Transforming Precision Medicine
    by seqadmin


    In recent years, precision medicine has become a major focus for researchers and healthcare professionals. This approach offers personalized treatment and wellness plans by utilizing insights from each person's unique biology and lifestyle to deliver more effective care. Its advancement relies on innovative technologies that enable a deeper understanding of individual variability. In a joint documentary with our colleagues at Biocompare, we examined the foundational principles of precision...
    01-27-2025, 07:46 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 02-07-2025, 09:30 AM
0 responses
72 views
0 likes
Last Post seqadmin  
Started by seqadmin, 02-05-2025, 10:34 AM
0 responses
113 views
0 likes
Last Post seqadmin  
Started by seqadmin, 02-03-2025, 09:07 AM
0 responses
87 views
0 likes
Last Post seqadmin  
Started by seqadmin, 01-31-2025, 08:31 AM
0 responses
48 views
0 likes
Last Post seqadmin  
Working...
X