Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • How do I shorten a genome sequence to secure my workflow is properly functioning?

    I am Moritz from the University Heidelberg in Germany.

    For my bachelor thesis I have 20 large (25-30 GB) genome files (.txt.gz) by patients with hepatocellular carcinoma. I have Bpipe installed on my Ubuntu server, which I have got to try out several approaches.

    Steps included are:

    Alignment (BWA (Transform sai and sam)) against hg19.fasta
    Transform (samtols)
    Dedupe
    The problem I have is that in order to try out my bpipe workflow, I have to take a whole sequence of 30 GB and start from the beginning. That takes a lot of time. So my questions are:

    How can I shorten one file?

    Where can I find a short sequence that I can use to test my pipeline?

  • #2
    You can use the suggestions in this thread to sample reads from your sequence files for testing small subsets: http://seqanswers.com/forums/showthread.php?t=16505

    Comment


    • #3
      If you want to have a little number of reads (suppose you have a .fastq file) in your "test sample", an easy way is :
      head -10000 your_big_file.fastq > sample_test_2500_reads.fastq
      One read = 4 rows in .fastq....so, in order to have (for example!!) 2500 reads in your "test sample", you have to set -10000 in head option.

      Changing this option, you can change the number of reads.

      This is a fisrt easy way.

      Comment


      • #4
        Like I wrote, I have several files which end on .txt.gz
        one example:
        D1n67_GDNA_12s00128-1-2_Hose_lane00128_1_2_sequence.txt.gz
        Is that an answer to the file format? (I'm totally new to Genetics)
        So can I take my file, gunzip it and then run your line?

        Comment


        • #5
          Yes, you can gunzip it and then do what mattia suggested.

          Keep in mind that the first lines of many fastq files will NOT have sequence data that is of similar quality to the majority of the file. If you solely want to make the pipeline work that may not be a problem (unless the reads are all "NNN...NNN"s and you're trying to align them), but if you want to see reads that are representative of the file as a whole then you should look in the thread that GenoMax linked too.

          Comment


          • #6
            Now I am stuck with this failure:

            Can anybody help me out if this? This is my Bpipe pipeline:
            REFERENCE="/genedata/human_genome_GRCh37/hg19.fa"
            PICARD_HOME="/home/trr/picard-tools-1.93/picard-tools-1.93"
            PLATYPUS_HOME="~/bin/Platypus_0.1.9"
            STAMPY_HOME="~/bin/stampy-1.0.17"
            BWA_HOME="/home/trr/bwa-0.7.5a"

            seq1="/genedata/sample_test_10k_reads/sample_test_10k_reads.txt.gz"

            //readgroup information:
            //rg_id="lane711s003155"
            rg_id="lane712s006433"
            rg_lb="nextera"
            rg_pl="ILLUMINA"
            rg_pu="flowcell-barcode.lane"
            rg_sm="GDNA"

            //##############################################################

            //Alignment

            //##############################################################
            //BWA
            @Transform("sai")
            align_bwa = {
            exec "$BWA_HOME/bwa aln -t 4 -q 10 $REFERENCE $input > $output"
            }

            @Transform("sam")
            sampe_bwa = {
            exec "$BWA_HOME/bwa sampe -P -r '@RG\tID:$rg_id\tLB:$rg_lb\tPL:$rg_pl\tPU:$rg_pu\tSM:$rg_sm' $REFERENCE $inputs.sai $seq1 $seq2 > $output"
            }

            //##############################################################

            //SAM to BAM

            //##############################################################

            //@Transform("bam")
            sort = {
            exec "samtools view -bSu $input | samtools sort - $output"
            }

            //##############################################################

            //Remove Duplicates

            //##############################################################

            @Filter("dedupe")
            dedupe = {
            exec """
            java -Xmx1g -jar $PICARD_HOME/MarkDuplicates.jar
            MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000
            METRICS_FILE=out.metrics
            REMOVE_DUPLICATES=true
            ASSUME_SORTED=true
            VALIDATION_STRINGENCY=LENIENT
            INPUT=$input
            OUTPUT=$output
            """
            }

            //##############################################################

            //Run Pipeline

            //##############################################################

            Bpipe.run {
            "sample_test_10k_reads*" * [align_bwa] + sampe_bwa + sort + dedupe
            }
            and I run it with
            Code:
            bpipe run sample_test.pipe sample_test_10k_reads.txt.gz
            Last edited by moritz1; 07-12-2013, 08:40 AM.

            Comment


            • #7
              I may be wrong as I haven't used this, but it looks like your pipeline wants a $seq1 and a $seq2, but you've only supplied a $seq1. Is that possibly the problem?

              When I have issues with alignment commands, I try to do it with the bare necessities to make it run and then add in more complicated aspects of the command once I can make sure I'm typing in the basic command correctly.

              Comment


              • #8
                Oh yeah thanks a lot, I think that should be it. I'll edit later when I can try it out!

                Comment


                • #9
                  Now I am getting this error, although I have changed the input to two files:



                  Code:
                  REFERENCE="/genedata/human_genome_GRCh37/hg19.fa"
                  PICARD_HOME="/home/trr/picard-tools-1.93/picard-tools-1.93"
                  PLATYPUS_HOME="~/bin/Platypus_0.1.9"
                  STAMPY_HOME="~/bin/stampy-1.0.17"
                  BWA_HOME="/home/trr/bwa-0.7.5a"
                  
                  seq1="/genedata/sample_test_10k_reads/sample_test_10k_reads1.txt.gz"
                  seq2="/genedata/sample_test_10k_reads/sample_test_10k_reads2.txt.gz"
                   
                  //readgroup information:
                  rg_id="lane712s006433"
                  rg_lb="nextera"
                  rg_pl="ILLUMINA"
                  rg_pu="flowcell-barcode.lane"
                  rg_sm="GDNA"
                   
                  //#############################################################
                  //Alignment
                  //##############################################################
                  //BWA
                  @Transform("sai")
                  align_bwa = {
                        exec "$BWA_HOME/bwa aln -t 4 -q 10 $REFERENCE $input > $output"
                  }
                   
                  @Transform("sam")
                  sampe_bwa = {
                        exec "$BWA_HOME/bwa sampe -P -r '@RG\tID:$rg_id\tLB:$rg_lb\tPL:$rg_pl\tPU:$rg_pu\tSM:$rg_sm' $REFERENCE $inputs.sai $seq1 $seq2 > $output"
                  }
                   
                  //##############################################################
                  //SAM to BAM
                  //##############################################################
                   
                  //@Transform("bam")
                  sort = {
                          exec "samtools view -bSu $input  | samtools sort - $output"
                  }
                   
                  //##############################################################
                   
                  //Remove Duplicates
                   
                  //##############################################################
                   
                  @Filter("dedupe")
                  dedupe = {
                          exec """
                             java -Xmx1g -jar $PICARD_HOME/MarkDuplicates.jar
                                             MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=1000
                                             METRICS_FILE=out.metrics
                                             REMOVE_DUPLICATES=true
                                             ASSUME_SORTED=true  
                                             VALIDATION_STRINGENCY=LENIENT
                                             INPUT=$input
                                             OUTPUT=$output
                         """
                  }
                   
                  //##############################################################
                  //Run Pipeline
                  //##############################################################
                   
                  Bpipe.run {
                      "sample_test_10k_reads%" * [align_bwa] + sampe_bwa + sort + dedupe
                  }
                  and I run it with:
                  Code:
                  bpipe run sample_test_pipeline.pipe sample_test_10k_reads*

                  Comment


                  • #10
                    Shouldn't that one line read:

                    Code:
                    exec "$BWA_HOME/bwa sampe -P -r '@RG\tID:$rg_id\tLB:$rg_lb\tPL:$rg_pl\tPU:$rg_pu\tSM:$rg_sm' $REFERENCE [B]$inputs1.sai $inputs2.sai[/B] $seq1 $seq2 > $output"

                    Comment


                    • #11
                      Yes, I got it working. Thanks a lot.

                      Comment

                      Latest Articles

                      Collapse

                      • seqadmin
                        Essential Discoveries and Tools in Epitranscriptomics
                        by seqadmin




                        The field of epigenetics has traditionally concentrated more on DNA and how changes like methylation and phosphorylation of histones impact gene expression and regulation. However, our increased understanding of RNA modifications and their importance in cellular processes has led to a rise in epitranscriptomics research. “Epitranscriptomics brings together the concepts of epigenetics and gene expression,” explained Adrien Leger, PhD, Principal Research Scientist...
                        04-22-2024, 07:01 AM
                      • seqadmin
                        Current Approaches to Protein Sequencing
                        by seqadmin


                        Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
                        04-04-2024, 04:25 PM

                      ad_right_rmr

                      Collapse

                      News

                      Collapse

                      Topics Statistics Last Post
                      Started by seqadmin, Yesterday, 08:47 AM
                      0 responses
                      16 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-11-2024, 12:08 PM
                      0 responses
                      60 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 10:19 PM
                      0 responses
                      60 views
                      0 likes
                      Last Post seqadmin  
                      Started by seqadmin, 04-10-2024, 09:21 AM
                      0 responses
                      54 views
                      0 likes
                      Last Post seqadmin  
                      Working...
                      X