Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How do I shorten a genome sequence to secure my workflow is properly functioning?

    I am Moritz from the University Heidelberg in Germany.

    For my bachelor thesis I have 20 large (25-30 GB) genome files (.txt.gz) by patients with hepatocellular carcinoma. I have Bpipe installed on my Ubuntu server, which I have got to try out several approaches.

    Steps included are:

    Alignment (BWA (Transform sai and sam)) against hg19.fasta
    Transform (samtols)
    Dedupe
    The problem I have is that in order to try out my bpipe workflow, I have to take a whole sequence of 30 GB and start from the beginning. That takes a lot of time. So my questions are:

    How can I shorten one file?

    Where can I find a short sequence that I can use to test my pipeline?

  • #2
    You can use the suggestions in this thread to sample reads from your sequence files for testing small subsets: http://seqanswers.com/forums/showthread.php?t=16505

    Comment


    • #3
      If you want to have a little number of reads (suppose you have a .fastq file) in your "test sample", an easy way is :
      head -10000 your_big_file.fastq > sample_test_2500_reads.fastq
      One read = 4 rows in .fastq....so, in order to have (for example!!) 2500 reads in your "test sample", you have to set -10000 in head option.

      Changing this option, you can change the number of reads.

      This is a fisrt easy way.

      Comment


      • #4
        Like I wrote, I have several files which end on .txt.gz
        one example:
        D1n67_GDNA_12s00128-1-2_Hose_lane00128_1_2_sequence.txt.gz
        Is that an answer to the file format? (I'm totally new to Genetics)
        So can I take my file, gunzip it and then run your line?

        Comment


        • #5
          Yes, you can gunzip it and then do what mattia suggested.

          Keep in mind that the first lines of many fastq files will NOT have sequence data that is of similar quality to the majority of the file. If you solely want to make the pipeline work that may not be a problem (unless the reads are all "NNN...NNN"s and you're trying to align them), but if you want to see reads that are representative of the file as a whole then you should look in the thread that GenoMax linked too.

          Comment


          • #6
            Now I am stuck with this failure:

            Can anybody help me out if this? This is my Bpipe pipeline:
            REFERENCE="/genedata/human_genome_GRCh37/hg19.fa"
            PICARD_HOME="/home/trr/picard-tools-1.93/picard-tools-1.93"
            PLATYPUS_HOME="~/bin/Platypus_0.1.9"
            STAMPY_HOME="~/bin/stampy-1.0.17"
            BWA_HOME="/home/trr/bwa-0.7.5a"

            seq1="/genedata/sample_test_10k_reads/sample_test_10k_reads.txt.gz"

            //readgroup information:
            //rg_id="lane711s003155"
            rg_id="lane712s006433"
            rg_lb="nextera"
            rg_pl="ILLUMINA"
            rg_pu="flowcell-barcode.lane"
            rg_sm="GDNA"

            //##############################################################

            //Alignment

            //##############################################################
            //BWA
            @Transform("sai")
            align_bwa = {
            exec "$BWA_HOME/bwa aln -t 4 -q 10 $REFERENCE $input > $output"
            }

            @Transform("sam")
            sampe_bwa = {
            exec "$BWA_HOME/bwa sampe -P -r '@RG\tID:$rg_id\tLB:$rg_lb\tPL:$rg_pl\tPU:$rg_pu\tSM:$rg_sm' $REFERENCE $inputs.sai $seq1 $seq2 > $output"
            }

            //##############################################################

            //SAM to BAM

            //##############################################################

            //@Transform("bam")
            sort = {
            exec "samtools view -bSu $input | samtools sort - $output"
            }

            //##############################################################

            //Remove Duplicates

            //##############################################################

            @Filter("dedupe")
            dedupe = {
            exec """
            java -Xmx1g -jar $PICARD_HOME/MarkDuplicates.jar
            MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000
            METRICS_FILE=out.metrics
            REMOVE_DUPLICATES=true
            ASSUME_SORTED=true
            VALIDATION_STRINGENCY=LENIENT
            INPUT=$input
            OUTPUT=$output
            """
            }

            //##############################################################

            //Run Pipeline

            //##############################################################

            Bpipe.run {
            "sample_test_10k_reads*" * [align_bwa] + sampe_bwa + sort + dedupe
            }
            and I run it with
            Code:
            bpipe run sample_test.pipe sample_test_10k_reads.txt.gz
            Last edited by moritz1; 07-12-2013, 08:40 AM.

            Comment


            • #7
              I may be wrong as I haven't used this, but it looks like your pipeline wants a $seq1 and a $seq2, but you've only supplied a $seq1. Is that possibly the problem?

              When I have issues with alignment commands, I try to do it with the bare necessities to make it run and then add in more complicated aspects of the command once I can make sure I'm typing in the basic command correctly.

              Comment


              • #8
                Oh yeah thanks a lot, I think that should be it. I'll edit later when I can try it out!

                Comment


                • #9
                  Now I am getting this error, although I have changed the input to two files:



                  Code:
                  REFERENCE="/genedata/human_genome_GRCh37/hg19.fa"
                  PICARD_HOME="/home/trr/picard-tools-1.93/picard-tools-1.93"
                  PLATYPUS_HOME="~/bin/Platypus_0.1.9"
                  STAMPY_HOME="~/bin/stampy-1.0.17"
                  BWA_HOME="/home/trr/bwa-0.7.5a"
                  
                  seq1="/genedata/sample_test_10k_reads/sample_test_10k_reads1.txt.gz"
                  seq2="/genedata/sample_test_10k_reads/sample_test_10k_reads2.txt.gz"
                   
                  //readgroup information:
                  rg_id="lane712s006433"
                  rg_lb="nextera"
                  rg_pl="ILLUMINA"
                  rg_pu="flowcell-barcode.lane"
                  rg_sm="GDNA"
                   
                  //#############################################################
                  //Alignment
                  //##############################################################
                  //BWA
                  @Transform("sai")
                  align_bwa = {
                        exec "$BWA_HOME/bwa aln -t 4 -q 10 $REFERENCE $input > $output"
                  }
                   
                  @Transform("sam")
                  sampe_bwa = {
                        exec "$BWA_HOME/bwa sampe -P -r '@RG\tID:$rg_id\tLB:$rg_lb\tPL:$rg_pl\tPU:$rg_pu\tSM:$rg_sm' $REFERENCE $inputs.sai $seq1 $seq2 > $output"
                  }
                   
                  //##############################################################
                  //SAM to BAM
                  //##############################################################
                   
                  //@Transform("bam")
                  sort = {
                          exec "samtools view -bSu $input  | samtools sort - $output"
                  }
                   
                  //##############################################################
                   
                  //Remove Duplicates
                   
                  //##############################################################
                   
                  @Filter("dedupe")
                  dedupe = {
                          exec """
                             java -Xmx1g -jar $PICARD_HOME/MarkDuplicates.jar
                                             MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000
                                             METRICS_FILE=out.metrics
                                             REMOVE_DUPLICATES=true
                                             ASSUME_SORTED=true  
                                             VALIDATION_STRINGENCY=LENIENT
                                             INPUT=$input
                                             OUTPUT=$output
                         """
                  }
                   
                  //##############################################################
                  //Run Pipeline
                  //##############################################################
                   
                  Bpipe.run {
                      "sample_test_10k_reads%" * [align_bwa] + sampe_bwa + sort + dedupe
                  }
                  and I run it with:
                  Code:
                  bpipe run sample_test_pipeline.pipe sample_test_10k_reads*

                  Comment


                  • #10
                    Shouldn't that one line read:

                    Code:
                    exec "$BWA_HOME/bwa sampe -P -r '@RG\tID:$rg_id\tLB:$rg_lb\tPL:$rg_pl\tPU:$rg_pu\tSM:$rg_sm' $REFERENCE [B]$inputs1.sai $inputs2.sai[/B] $seq1 $seq2 > $output"

                    Comment


                    • #11
                      Yes, I got it working. Thanks a lot.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Non-Coding RNA Research and Technologies
                        by seqadmin


                        Non-coding RNAs (ncRNAs) do not code for proteins but play important roles in numerous cellular processes including gene silencing, developmental pathways, and more. There are numerous types including microRNA (miRNA), long ncRNA (lncRNA), circular RNA (circRNA), and more. In this article, we discuss innovative ncRNA research and explore recent technological advancements that improve the study of ncRNAs.

                        [Article Coming Soon!]...
                        Today, 08:07 AM
                      • seqadmin
                        Recent Developments in Metagenomics
                        by seqadmin





                        Metagenomics has improved the way researchers study microorganisms across diverse environments. Historically, studying microorganisms relied on culturing them in the lab, a method that limits the investigation of many species since most are unculturable1. Metagenomics overcomes these issues by allowing the study of microorganisms regardless of their ability to be cultured or the environments they inhabit. Over time, the field has evolved, especially with the advent...
                        09-23-2024, 06:35 AM
                      • seqadmin
                        Understanding Genetic Influence on Infectious Disease
                        by seqadmin




                        During the COVID-19 pandemic, scientists observed that while some individuals experienced severe illness when infected with SARS-CoV-2, others were barely affected. These disparities left researchers and clinicians wondering what causes the wide variations in response to viral infections and what role genetics plays.

                        Jean-Laurent Casanova, M.D., Ph.D., Professor at Rockefeller University, is a leading expert in this crossover between genetics and infectious...
                        09-09-2024, 10:59 AM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, 10-02-2024, 04:51 AM
                      0 responses
                      13 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 10-01-2024, 07:10 AM
                      0 responses
                      23 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 09-30-2024, 08:33 AM
                      1 response
                      29 views
                      0 likes
                      Last Post EmiTom
                      by EmiTom
                       
                      Started by seqadmin, 09-26-2024, 12:57 PM
                      0 responses
                      19 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X