Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sort/create list of locations with highest # of reads?

    Hello,

    Hopefully there is a simple way to do this that I am unaware of.

    I have fastq and bam files (100bp paired end Illumina). In the alignment files there are lots of areas with one or a few reads spread throughout the genome.

    What I would like to do is either sort results to visualize, for example in Integrative Genomics Viewer, or generate a list that will show in descending order the locations that have the greatest number of reads or the greatest coverage. This would most likely be within a window, say 50bp.

    For example:

    chr21:33,031,597 has 58 reads
    chr21:33,031,200 has 56 reads
    chr21:33,025,000 has 42 reads

    and so on.

    Any advice or guidance on different ways I might achieve this is very appreciated.

    Thank you

  • #2
    Check out samtools ...

    $ samtools depth

    Usage: samtools depth [options] in1.bam [in2.bam [...]]
    Options:
    -b <bed> list of positions or regions
    -f <list> list of input BAM filenames, one per line [null]
    -l <int> minQLen
    -q <int> base quality threshold
    -Q <int> mapping quality threshold
    -r <chr:from-to> region

    Example:
    $ samtools depth -r chr1:1-100000 1221792.bam | head
    chr1 10033 1
    chr1 10034 1
    chr1 10035 1
    chr1 10036 1
    chr1 10037 1
    chr1 10038 1
    chr1 10039 1
    chr1 10040 1
    chr1 10041 1
    chr1 10042 1


    Use "awk" to filter on column 3 and "sort" on column 3 and "tail" it.
    Like this ...
    samtools depth 1221792.bam | awk '{if ($3>300) print $0}' | sort -k3n | tail
    Last edited by Richard Finney; 06-06-2014, 06:43 AM.

    Comment


    • #3
      Thanks so much Richard,

      So if I use:

      samtools depth 1221792.bam | awk '{if ($3>50) print $0}' | sort -k3n | tail

      for example, this will give me a list of all locations with more than 50 reads, correct?

      But no 'window,' or in other words, it will likely print the individual locations for each block of reads, rather than one result for each bin or chunk of 50 bases or something like that. Meaning, at chr1 10038 +/- 25bp, there are max 50 reads.

      Certainly a start, thank you for the help.

      Comment


      • #4
        Code:
        samtools view -F 4 sorted.bam | cut -f 3,4 | uniq -c | sort -nr | head -n 20 > most_common.txt
        That'll give you the top 20 most common start sites for your reads.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Recent Developments in Metagenomics
          by seqadmin





          Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
          09-23-2024, 06:35 AM
        • seqadmin
          Understanding Genetic Influence on Infectious Disease
          by seqadmin




          During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.

          Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...
          09-09-2024, 10:59 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 10-02-2024, 04:51 AM
        0 responses
        13 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 10-01-2024, 07:10 AM
        0 responses
        21 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 09-30-2024, 08:33 AM
        0 responses
        25 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 09-26-2024, 12:57 PM
        0 responses
        18 views
        0 likes
        Last Post seqadmin  
        Working...
        X