Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Sort/create list of locations with highest # of reads?

    Hello,

    Hopefully there is a simple way to do this that I am unaware of.

    I have fastq and bam files (100bp paired end Illumina). In the alignment files there are lots of areas with one or a few reads spread throughout the genome.

    What I would like to do is either sort results to visualize, for example in Integrative Genomics Viewer, or generate a list that will show in descending order the locations that have the greatest number of reads or the greatest coverage. This would most likely be within a window, say 50bp.

    For example:

    chr21:33,031,597 has 58 reads
    chr21:33,031,200 has 56 reads
    chr21:33,025,000 has 42 reads

    and so on.

    Any advice or guidance on different ways I might achieve this is very appreciated.

    Thank you

  • #2
    Check out samtools ...

    $ samtools depth

    Usage: samtools depth [options] in1.bam [in2.bam [...]]
    Options:
    -b <bed> list of positions or regions
    -f <list> list of input BAM filenames, one per line [null]
    -l <int> minQLen
    -q <int> base quality threshold
    -Q <int> mapping quality threshold
    -r <chr:from-to> region

    Example:
    $ samtools depth -r chr1:1-100000 1221792.bam | head
    chr1 10033 1
    chr1 10034 1
    chr1 10035 1
    chr1 10036 1
    chr1 10037 1
    chr1 10038 1
    chr1 10039 1
    chr1 10040 1
    chr1 10041 1
    chr1 10042 1


    Use "awk" to filter on column 3 and "sort" on column 3 and "tail" it.
    Like this ...
    samtools depth 1221792.bam | awk '{if ($3>300) print $0}' | sort -k3n | tail
    Last edited by Richard Finney; 06-06-2014, 06:43 AM.

    Comment


    • #3
      Thanks so much Richard,

      So if I use:

      samtools depth 1221792.bam | awk '{if ($3>50) print $0}' | sort -k3n | tail

      for example, this will give me a list of all locations with more than 50 reads, correct?

      But no 'window,' or in other words, it will likely print the individual locations for each block of reads, rather than one result for each bin or chunk of 50 bases or something like that. Meaning, at chr1 10038 +/- 25bp, there are max 50 reads.

      Certainly a start, thank you for the help.

      Comment


      • #4
        Code:
        samtools view -F 4 sorted.bam | cut -f 3,4 | uniq -c | sort -nr | head -n 20 > most_common.txt
        That'll give you the top 20 most common start sites for your reads.

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Essential Discoveries and Tools in Epitranscriptomics
          by seqadmin




          The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
          04-22-2024, 07:01 AM
        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, Yesterday, 08:47 AM
        0 responses
        14 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        60 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        60 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        54 views
        0 likes
        Last Post seqadmin  
        Working...
        X