Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • zorgster
    replied
    Originally posted by Brian Bushnell View Post
    Normally, I use the defaults But for 30bp ribosomal reads, you could add "maxindel=10" (just a random small number I picked). Searching for long indels (which BBMap does by default) is not necessary when aligning to ribosomes (which as far as I know are never spliced); [...]

    [...] with any mismatches, or quality-trim to Q30 prior to mapping, etc. These are almost never good ideas! They are typically devised by biologists on the assumption that "My data has variable quality, and is annotated with its actual quality. Therefore, if I throw away low-quality data, my results will be strictly better."
    I was here for adapter clipping. But thought I'd add something to this old post:

    1. https://www.ncbi.nlm.nih.gov/books/NBK21729/ (Molecular Cell Biology) summarises:
    - "Like pre-mRNAs, the primary transcripts produced from pre-rRNA and tRNA genes undergo extensive processing." - some of the processes are described in the linked chapter. Some things to be aware of if aligning to rRNA...

    2. Low quality reads... I have just such reads to back this up... what does Q2 mean here? The reads line up - but the sequence was still read where the quality was #/Q2. To me that looks like a valid read. [Read ID anonymised to SRR888888]

    @SRR888888.19.1 19 length=76
    CCCCCATGGAGCACAGGCAGACAGAAGCCCCCGCCCCAGCTCTGTGGCCTCAAGCCAGCCTTCCGCTCCTTGAAGC
    +
    =<7<=<AABCBBABB3?+<A<B>@A###################################################
    @SRR888888.20.1 20 length=76
    CCCCATGGAGCACAGGCAGACAGAAGTCCCCGCCCCAGCTGTGTGGCCTCAAGCCAGCCTTCCGCTCCTTGAAGCT
    +
    @@@DDBDDF?BBF?FE=EE1CFDCFHBC<CEFDFA@@F>AGDG@FEHE>EEH>=CCED?CCCC<@@@@CCACCCA@
    @SRR888888.29.1 29 length=76
    CAGCTGTGTGGCCTCAAGCCAGCCTTCCGCTCCTTGAAGCTGGTCTCCACACAGTGCTGGTTCCGTCACCCCCTCC
    +
    @@@FFFDE<DFHBAGGEGIHIB@GIIIGIIAHIGIIIGIGHGCHIIIIICFH;FHGGIII@C=?AEHBBCBCCCCC


    Oliver
    [biologist and programmer ]

    Leave a comment:


  • GenoMax
    replied
    If you have a multi-core machine, BBMap will be fast. Use the threads=N option to start with available threads.

    Leave a comment:


  • guilhem
    replied
    Thanks for all of your advices!
    I did not know about BBMap software, thank you!
    Is it faster as Hisat2? I have billions of read to map, although I think I will try for now to restrict my mapping to the transcriptome (especially for Eukarotic genome) -- full genome is very slow.

    Leave a comment:


  • Brian Bushnell
    replied
    Originally posted by GenoMax View Post
    @Brian may have a suggestion about parameters to use with BBMap.
    Normally, I use the defaults But for 30bp ribosomal reads, you could add "maxindel=10" (just a random small number I picked). Searching for long indels (which BBMap does by default) is not necessary when aligning to ribosomes (which as far as I know are never spliced); it decreases both speed and sensitivity. BBMap does have a "perfectmode" flag which allows only perfect alignments, but I do not really think it is appropriate in this case (or most situations, especially those involving quantification).

    There are a lot of papers written by people who do not fully understand all aspects of what they are doing - who can, these days, in any paper that is not purely theoretical? Often people try to make choices they think are safer and more conservative, overriding the suggested defaults, to minimize risk of a paper being rejected because something was hard to describe or explain. Particularly, in bioinformatics, it is common for people to throw out all reads with any mismatches, or quality-trim to Q30 prior to mapping, etc. These are almost never good ideas! They are typically devised by biologists on the assumption that "My data has variable quality, and is annotated with its actual quality. Therefore, if I throw away low-quality data, my results will be strictly better."

    This is absolutely wrong, as it relies on a lot of implicit assumptions (that quality is unrelated to sequence, that quality scores are correct, that trimming low-quality bases yields better mapping, that differences between a read and the reference are due to errors, etc) which may seem obvious, but are false.

    I am not trying to slam biologists here - they are experts in their field. It's just important to understand that being an expert in biology does not make one also an expert in statistics, or photonics, or any of the other numerous areas that go in to bioinformatics. So, bioinformatics papers written, reviewed, and published solely by biologists will often have subtle errors in the non-biological part of the methodology - as in this case.

    Leave a comment:


  • GenoMax
    replied
    You should add BBMap alignment as well. I wonder what fraction of your reads would be straight alignment and what fraction would have a splice site, with just 30 nt to work with. @Brian may have a suggestion about parameters to use with BBMap.

    Leave a comment:


  • guilhem
    replied
    Thanks Brian.
    I am not very familiar with NGS data analysis so I tried to apply the exact protocol described in the original paper: Ribosome profiling is a technique to track the translation pausing (Ingolia 2009). In fact, we freeze the translation at a t time and digest the uncover messenger RNA. Then, we obtained only footprint of the ribosome -- part of the messenger covers by the ribosome. These footprints are sequenced and I use the SRA data from these sequencing.
    In the original method introduced by Ingolia et al. 2009, they clipped the adapter, mapped to the genome assembly and they keep only reads with a perfect match (retains only NM tag = 0).

    I am not very familiar with NGS data so, I tried to respect closely the original protocol. I have just switched to hisat2 since I found bowtie2 and tophat rather slow.

    Leave a comment:


  • Brian Bushnell
    replied
    If you need perfect mapping, then absolutely, adapter-trimming is crucial. In general, requiring perfect mapping will incur sequence-dependent bias (as sequencing error rates are sequence-dependent), but that's more of an issue with long reads and may not matter with 30bp reads. Still, it also might matter since ribosomal sequences are typically low-diversity which makes them especially susceptible to sequence-dependent errors.

    So... why are you requiring perfect mapping?

    Leave a comment:


  • guilhem
    replied
    After clipping the read length is around 30nt. This is ribo-seq data (ribosomal footprint: RNA-seq covered by ribosome).

    Leave a comment:


  • GenoMax
    replied
    What was the original read length (if post clip is 30 bp)? Is this miRNA data?

    Leave a comment:


  • guilhem
    replied
    My adapter is CTGTAGGCACCATCAAT -- quite long I think. My reads are about 30 nt after clipping. And I need to perfect mapping (no mismatch) so I think clipping is necessary here. I am trying the software you adviced me, thanks GenoMax!

    Leave a comment:


  • GenoMax
    replied
    If the adapter contamination is short/minimal then the aligner should be able to manage but if you know you have short inserts/adapter dimers etc then it would be best to trim independently. I like to pass all data through a trimming program. If there is no contamination then only thing invested is a bit of time.
    Last edited by GenoMax; 02-19-2016, 09:10 AM.

    Leave a comment:


  • guilhem
    replied
    But if I do not clip the adapters, mapping will be biased by the adapter sequence, won't it?

    Leave a comment:


  • GenoMax
    replied
    You don't have to trim but if you need clean sequence files then a pass through the trimming program would keep that data available.

    Leave a comment:


  • guilhem
    replied
    Thanks a lot for the link. So, it seems that I would need to clip in an additional step before mapping with hisat2. I am gonna use BBDuk.sh thanks GenoMax!

    Leave a comment:


  • GenoMax
    replied
    Soft-clipping won't actually remove data. In that sense it is not the same thing as clipping adapter sequences using a dedicated trimming program.

    Leave a comment:

Latest Articles

Collapse

  • seqadmin
    Latest Developments in Precision Medicine
    by seqadmin



    Technological advances have led to drastic improvements in the field of precision medicine, enabling more personalized approaches to treatment. This article explores four leading groups that are overcoming many of the challenges of genomic profiling and precision medicine through their innovative platforms and technologies.

    Somatic Genomics
    “We have such a tremendous amount of genetic diversity that exists within each of us, and not just between us as individuals,”...
    05-24-2024, 01:16 PM
  • seqadmin
    Recent Advances in Sequencing Analysis Tools
    by seqadmin


    The sequencing world is rapidly changing due to declining costs, enhanced accuracies, and the advent of newer, cutting-edge instruments. Equally important to these developments are improvements in sequencing analysis, a process that converts vast amounts of raw data into a comprehensible and meaningful form. This complex task requires expertise and the right analysis tools. In this article, we highlight the progress and innovation in sequencing analysis by reviewing several of the...
    05-06-2024, 07:48 AM

ad_right_rmr

Collapse

News

Collapse

Topics Statistics Last Post
Started by seqadmin, 05-24-2024, 07:15 AM
0 responses
16 views
0 likes
Last Post seqadmin  
Started by seqadmin, 05-23-2024, 10:28 AM
0 responses
18 views
0 likes
Last Post seqadmin  
Started by seqadmin, 05-23-2024, 07:35 AM
0 responses
22 views
0 likes
Last Post seqadmin  
Started by seqadmin, 05-22-2024, 02:06 PM
0 responses
11 views
0 likes
Last Post seqadmin  
Working...
X