Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • lenght distribution range in a a fasta file

    Hi folks,

    I have to calculate the lenght distribution range in a a fasta file, for example how many sequences are less than 100 bp, how many are in their lenght from 101 to 300 bp, from 301 to 500 bases and so on.. Any script or tool for doing this job?

    Thanks in advance!

  • #2
    If you index it with "samtools faidx" the resulting ".fai" file will be a text file containing the length of each of the sequences (among other information). You could then plot the distribution in R with whatever binning strategy you want.

    Comment


    • #3
      Dear dpryan,

      I have retrieved the ".fai" file and I have the lenghts of the genes that I wanted (2 columns file, first column are the gene names, second column the gene lenghts, as in the following):

      gene_120397 43056
      gene_240653 224380
      gene_150423 68254
      gene_143456 10090
      gene_141140 15291
      gene_253613 3088

      Could you please indicate me an R code for plotting these distributions as I am not so familiar with plotting in R

      Thank you!!!

      Comment


      • #4
        mydata <- read.table("inputfile.txt")
        plot(mydata)

        Comment


        • #5
          FastQC Length Distribution

          Hello.

          I believe that Fastqc has this information, but not in a fasta file.

          if anyone has used FastQC to find the length distribution, I am wondering in what conditions does it consider the distribution to be a fail or a pass.

          advice?

          Comment


          • #6
            Originally posted by arcolombo698 View Post
            Hello.

            I believe that Fastqc has this information, but not in a fasta file.

            if anyone has used FastQC to find the length distribution, I am wondering in what conditions does it consider the distribution to be a fail or a pass.

            advice?
            From the documentation:

            Warning

            This module will raise a warning if all sequences are not the same length.

            Failure

            This module will raise an error if any of the sequences have zero length.

            Comment


            • #7
              hi @antoza,
              did you find way to solve your problem? if u did could you share your experience?
              i am facing with the same task right now.
              thanks.

              Comment


              • #8
                FastQC does plot this information so one can visually see the distribution of length. this is a quick/easy approach.

                if you use samtools, one can plot the lengths using R.

                Comment


                • #9
                  The BBMap package has a couple programs for this purpose:

                  stats.sh in=file.fasta shist=shist.txt
                  (only works on fasta input)

                  readlength.sh in=file.fasta out=hist.txt

                  (works on fasta, fastq, or sam)

                  The way they display output is a little different, but both are easy to plot.

                  Comment


                  • #10
                    yeah , that will work. thanks @Brian Bushnell

                    Comment


                    • #11
                      hi @arcolombo698,
                      could you be little bit more specific?!
                      how could i do what i want by using FastQC when my input file is fasta?

                      Comment


                      • #12
                        hey @arcolombo698,
                        i sued samtools got the fai file. and when i tried to do the length distribution by R i changed my gene.fa.fai file to gene.txt. then i used this commend as its mentioned by@rnaeye:

                        mydata <- read.table("gene.txt")
                        plot(mydata)

                        and got this error:

                        Code:
                        > mydata <- read.table("gene.txt")
                        > plot(mydata)
                        Error: cannot allocate vector of size 156.2 Gb
                        i am new at R, so could u explain to me where did it go wrong?

                        Comment

                        Latest Articles

                        Collapse

                        • seqadmin
                          New Genomics Tools and Methods Shared at AGBT 2025
                          by seqadmin


                          This year’s Advances in Genome Biology and Technology (AGBT) General Meeting commemorated the 25th anniversary of the event at its original venue on Marco Island, Florida. While this year’s event didn’t include high-profile musical performances, the industry announcements and cutting-edge research still drew the attention of leading scientists.

                          The Headliner
                          The biggest announcement was Roche stepping back into the sequencing platform market. In the years since...
                          03-03-2025, 01:39 PM
                        • seqadmin
                          Investigating the Gut Microbiome Through Diet and Spatial Biology
                          by seqadmin




                          The human gut contains trillions of microorganisms that impact digestion, immune functions, and overall health1. Despite major breakthroughs, we’re only beginning to understand the full extent of the microbiome’s influence on health and disease. Advances in next-generation sequencing and spatial biology have opened new windows into this complex environment, yet many questions remain. This article highlights two recent studies exploring how diet influences microbial...
                          02-24-2025, 06:31 AM

                        ad_right_rmr

                        Collapse

                        News

                        Collapse

                        Topics Statistics Last Post
                        Started by seqadmin, Today, 12:50 PM
                        0 responses
                        10 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 03-03-2025, 01:15 PM
                        0 responses
                        181 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 02-28-2025, 12:58 PM
                        0 responses
                        276 views
                        0 likes
                        Last Post seqadmin  
                        Started by seqadmin, 02-24-2025, 02:48 PM
                        0 responses
                        663 views
                        0 likes
                        Last Post seqadmin  
                        Working...
                        X