Seqanswers Leaderboard Ad

Collapse

Announcement

Collapse
No announcement yet.
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

  • Split SFF file by Adaptors

    Hi All,

    I was assigned a work to split a SFF file into a number of adapter specific SFF files.
    If I read the SFF file into R as a SFFContainer, the reads slot looks like:

    ----------------------------------------------------------------
    A QualityScaledDNAStringSet instance containing:

    A DNAStringSet instance of length 377894
    width seq names
    [1] 211 TCAGAAGAGGATTCGATCTCG...GCCAAGCACACAGGGATAGG G2FU2:4:10
    [2] 80 TCAGAAGAGGATTCGATTATA...TTCTCTCTCACAAGTTACAC G2FU2:4:47
    [3] 46 TCAGAAGAGGATTCGTCTGCT...GTTGTCTTCTCTAAAATGCT G2FU2:4:49
    [4] 180 TCAGTAAGGAGAACGATAGGC...GCCAAGGCAGACAGGGATAG G2FU2:5:15
    [5] 133 TCAGCTAAGGTAACGATCTGA...TGTGTACATATCATGAGAGT G2FU2:5:16
    [6] 65 TCAGCTAAGGTAACGATATTT...GTCATTCAAATGTCAAGTGA G2FU2:5:48
    [7] 72 TCAGCTAAGGTAACGATGATC...TTAAGAAGTAAAATATAATA G2FU2:7:47
    [8] 36 TCAGTAAGGAGAACGATTAGGTAACTTAATAAAAAT G2FU2:8:47
    [9] 50 TCAGCAAGGTAACGTTGATAT...ACTGAGATACTTATCTTATT G2FU2:8:49
    ... ... ...
    [377886] 296 TCAGTAAGGAGAACGATCTTT...GCACAGACGGGAAGGTAGAG G2FU2:1146:1271
    [377887] 292 TCAGTAAGGAGAACGATGACT...CAGCAGCACAGAGGCGAGAG G2FU2:1146:1272
    [377888] 191 TCAGTAAGGAGAACGATACTC...CAAGGCACACAGGGGATAGG G2FU2:1147:1252
    [377889] 287 TCAGCAGAAGGAACGATGATC...AGAGCGAGCAAGCAGACAGG G2FU2:1147:1254
    [377890] 292 TCAGTAAGGAGAACGATATCG...CTACTCGAGGAGACAGGTAG G2FU2:1147:1258
    [377891] 281 TCAGCAGAAGGAACGATCGTC...GCGAAGGCAGCACAGGAGTA G2FU2:1147:1262
    [377892] 274 TCAGCTAAGGTAACGATCAAA...CCGATGCCCATAGAGTGCAG G2FU2:1147:1269
    [377893] 283 TCAGCTAAGGTAACGATGACT...CAAGGCACACAGGGAGTAGG G2FU2:1147:1271
    [377894] 301 TCAGCTAAGGTAACGATATTC...AGACACGGAGGTAGAGTGTA G2FU2:1147:1274

    A PhredQuality instance of length 377894
    width seq names
    [1] 211 AAAAA:>;>382(16549@00...+4.4&+++*11,,0%**33. G2FU2:4:10
    [2] 80 @7==B=@@@>?>37<7714:8...-(*-***-**--(*-(*--/ G2FU2:4:47
    [3] 46 A3225.000/13-21/00---...**&-**-&--***--&**-1 G2FU2:4:49
    [4] 180 BBCCCC>B>BCC>BBBBBC>C....-&++/+0...1235,33// G2FU2:5:15
    [5] 133 >3300,0+1(--(01110000...*********-*1**-**-*- G2FU2:5:16
    [6] 65 >59::585:28<2;9456:<....-*-**(*--%***-.(*-*2 G2FU2:5:48
    [7] 72 ;222313/3-00(01/0*--*...*%-(*-(***--%--*-(-- G2FU2:7:47
    [8] 36 @7==>A:>9>>>7<757.21,0/-//%-(+/224)2 G2FU2:8:47
    [9] 50 ;000-*&-&--(,-*&**--0...0***-----*-&-*-*&**. G2FU2:8:49
    ... ... ...
    [377886] 296 B@@>::/929552<@::188)...+1..,,,**-%++/+***** G2FU2:1146:1271
    [377887] 292 EEEEDD?C?CDD>CCCCCCCC...***,/1****,,.&****** G2FU2:1146:1272
    [377888] 191 DDDCBC>C>[email protected];2:28?A::;BEE.?;84, G2FU2:1147:1252
    [377889] 287 @668@CCC9C?C?DDDECCCE...0//,0****++,,*****1- G2FU2:1147:1254
    [377890] 292 DDDEDD@D>[email protected],,,012,1,,,,4&+++ G2FU2:1147:1258
    [377891] 281 @?AAA@@@:A:><>@@>>>;:...3++.&++/41++++3.1/++ G2FU2:1147:1262
    [377892] 274 CCCCCCC?C:>>7A;=@<<;,...8&++7758,+**+++****1 G2FU2:1147:1269
    [377893] 283 AAAAA>A9A57<14;66.24=...-,5;46,,,+,8;/+++34, G2FU2:1147:1271
    [377894] 301 @@@>=9<6<*+02657631+2...*+*++11,+*%*.******1 G2FU2:1147:1274
    ------------------------------------------------------------------

    And the adapters are like (total 96 adaptors):
    AdaptName AdaptSeq
    1 IonXpress_001 CTAAGGTAAC
    2 IonXpress_002 TAAGGAGAAC
    3 IonXpress_003 AAGAGGATTC
    4 IonXpress_004 TACCAAGATC
    5 IonXpress_005 CAGAAGGAAC
    6 IonXpress_006 CTGCAAGTTC
    7 IonXpress_007 TTCGTGATTC
    8 IonXpress_008 TTCCGATAAC
    9 IonXpress_009 TGAGCGGAAC
    10 IonXpress_010 CTGACCGAAC
    ............................

    Could any one please tell me if you have an idea about the meaning of "adapter specific SFF files"?
    In order to classify each read by the adapters, should I align all adapters on each sequence, some thing similar to the following?


    TCAGTACTGAGCTACAGTACACGATGCGTCCAGGAACCATCGGATGGCAATCG - sequence
    TCGTATGCCG (scan all positions until the end) - (m=2, i=1, d=1)
    TCGTATGCC - (m=2, i=1, d=0)
    TCGTATGC - (m=2, i=1, d=0)
    TCGTATG - (m=1, i=1, d=0)
    TCGTAT - (m=1, i=1, d=0)
    TCGTA - (m=1, i=1, d=0)
    TCGT - (m=1, i=0, d=0)
    TCG - (m=1, i=0, d=0) -> match!

    Or should I find a specific adapter for each read by the functions on the page 15


    Should I trim down each sequence from clipAdapterLeft position to clipAdapterRight position before any alignment or any other work?

    Thank you very much in advance.

    Best,

    Heidi

  • #2
    Heidi,

    Looking at your read data you can see that all of the reads start with the 4 bases "TCAG" which is known as the 'keytag'; this tells the software that the read was a library fragment (as opposed to a control fragment which are not in your data set). Following the keytag is the Multiplex ID (MID) sequence which corresponds to one of your 96 adapter sequences. "Adapter specific SFF files" means to parse the reads in your input file, identify their MID tag and sort them into new output SFF files according to their MID.

    There are a handful of tools available to splitt SFF files by barcode but I recommend getting the Roche/454 software. It is available for free but you have to submit a request through their website. Specifically the program you want to use is called 'sfffile'. This tool can (among other things) read an SFF file and a MID configuration file and output a set of MID (adapter) specific SFF files. Judging by the names of your MID tags (IonExpress_nnn) it would appear that this data was generated on a Life Technologies Ion Torrent instrument, not a Roche/454 but no matter, the SFF format is the same and Roche's software should be able to split the reads. However since the MID tag set is not the default Roche/454 tag set you will need to create a custom MIDConfig.parse file for use with the sfffile program. There are instructions for doing this in the documentation which accompanies the software and you can use the default MIDConfig.parse file as a template.

    Good luck.

    Comment


    • #3
      Thanks KMCarr for your explanation. I have a better understanding for SFF file now.
      I am more familiar with R Bioconductor. So I am trying with a Bioconductor package.
      I was able to read in the big SFF file as a big SFFContainer with the function readSFF in the Bioconductor package:R453Plus1Toolbox. I also fond the indexes to split the big SFFContainer. For example, the index (5,11,17,22,25,29,31,33,37) for the first small SFFContainer.
      I couldn't figure out how to extract the (5,11,17,22,25,29,31,33,37)th reads from the big SSFContainer and construct a small SFFContainer and save it as an SFF file. Could anyone please help me with this part of work?
      Thank you very much in advance.

      Heidi

      Comment


      • #4
        Dear kmcarr,

        I think I didn't pay much attention to your post last time is because it seemed like I didn't need to write any programs. The project I am working in is kind of exam to see my programming skills.
        I used R to identify the reads which can be classified to a specific adapter. So for each adapter, I have a list of read names. I have total 377894 reads, but for most adapters, there are only tens of reads, some times even less.
        Is this usual case, or do you think I probably made a mistake?

        Thank you very much.

        Heidi

        Comment

        Latest Articles

        Collapse

        • seqadmin
          Current Approaches to Protein Sequencing
          by seqadmin


          Proteins are often described as the workhorses of the cell, and identifying their sequences is key to understanding their role in biological processes and disease. Currently, the most common technique used to determine protein sequences is mass spectrometry. While still a valuable tool, mass spectrometry faces several limitations and requires a highly experienced scientist familiar with the equipment to operate it. Additionally, other proteomic methods, like affinity assays, are constrained...
          04-04-2024, 04:25 PM
        • seqadmin
          Strategies for Sequencing Challenging Samples
          by seqadmin


          Despite advancements in sequencing platforms and related sample preparation technologies, certain sample types continue to present significant challenges that can compromise sequencing results. Pedro Echave, Senior Manager of the Global Business Segment at Revvity, explained that the success of a sequencing experiment ultimately depends on the amount and integrity of the nucleic acid template (RNA or DNA) obtained from a sample. “The better the quality of the nucleic acid isolated...
          03-22-2024, 06:39 AM

        ad_right_rmr

        Collapse

        News

        Collapse

        Topics Statistics Last Post
        Started by seqadmin, 04-11-2024, 12:08 PM
        0 responses
        27 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 10:19 PM
        0 responses
        30 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-10-2024, 09:21 AM
        0 responses
        26 views
        0 likes
        Last Post seqadmin  
        Started by seqadmin, 04-04-2024, 09:00 AM
        0 responses
        52 views
        0 likes
        Last Post seqadmin  
        Working...
        X