Seqanswers Leaderboard Ad

**logicthief** · 02-07-2014, 04:55 PM

ps I think the adapter has been removed, otherwise there will be much more soft clipped reads. Also the soft-clipping length shouldn't be that long if it is caused by adapter.

**GenoMax** · 02-08-2014, 04:11 AM

Have you looked at the quality score distribution of the reads? Was that normal?

**logicthief** · 02-08-2014, 12:28 PM

Originally posted by GenoMax View Post

Have you looked at the quality score distribution of the reads? Was that normal?

Overall distribution of quality score looks okay to me

The distribution of quality for the reads with >200 soft clipping does look a little weird, though.

**GenoMax** · 02-08-2014, 04:47 PM

Did this run have a phiX spike? What was the % of the spike, if it was there? Was this a V2 or V3 chemistry run and what was the cluster concentration?

It is possible that because of the small target region you have a large number of reads with similar (same) sequence (what was the average fold coverage?). If this was the case (and if the flowcell was loaded with high concentration of library) then the read qualities likely suffered because of overall reduced complexity of sequence across the clusters.

Have you looked at the soft clipped reads ignoring the quality values to see if they align well across the length? They actually may.

**logicthief** · 02-09-2014, 01:21 PM

Originally posted by GenoMax View Post

Did this run have a phiX spike? What was the % of the spike, if it was there? Was this a V2 or V3 chemistry run and what was the cluster concentration?

It is possible that because of the small target region you have a large number of reads with similar (same) sequence (what was the average fold coverage?). If this was the case (and if the flowcell was loaded with high concentration of library) then the read qualities likely suffered because of overall reduced complexity of sequence across the clusters.

Have you looked at the soft clipped reads ignoring the quality values to see if they align well across the length? They actually may.

Thanks for your reply.

1. I'm checking the sequencing chemistry, we used phiX spike and V2 chemistry.

2. low complexity is an issue here, the average coverage is about 1200x, and we put multiple samples onto one lane. For another data set, we put even more (48) samples onto one MiSeq lane, and observed higher percentage of reads with >200 soft clipped bases. Do you think it's major reason?

3. I converted FASTQ to FASTA and did mapping again, this time, more reads (149,574) have >200 soft clipped bases.

**GenoMax** · 02-09-2014, 03:43 PM

Originally posted by logicthief View Post

2. low complexity is an issue here, the average coverage is about 1200x, and we put multiple samples onto one lane. For another data set, we put even more (48) samples onto one MiSeq lane, and observed higher percentage of reads with >200 soft clipped bases. Do you think it's major reason?

If the multiple samples are from the same region, that is not going to help with the low nucleotide diversity problem. You may want to consider deliberately under-loading the library and/or adding a phiX spike, if you happen to re-run. Are these amplicons or pull-down libraries?

3. I converted FASTQ to FASTA and did mapping again, this time, more reads (149,574) have >200 soft clipped bases.

Can you blat the soft clipped reads against the reference to check that there is nothing strange going on (chimeras)?

**logicthief** · 02-11-2014, 10:18 PM

Hi GenoMax,

Thanks for your reply.

Our library is prepared from amplicon.

I ran blat with default settings on the reads with more than 200 soft clipped reads. The first 18 columns of output are summarized by R. It seems the mapping rate is relatively low in these reads, any clue?

1.matches - Number of matching bases that aren't repeats.
2.misMatches - Number of bases that don't match.
3.repMatches - Number of matching bases that are part of repeats.
4.nCount - Number of 'N' bases.
5.qNumInsert - Number of inserts in query.
6.qBaseInsert - Number of bases inserted into query.
7.tNumInsert - Number of inserts in target.
8.tBaseInsert - Number of bases inserted into target.
9.strand - defined as + (forward) or - (reverse) for query strand. In mouse, a second '+' or '-' indecates genomic strand.
10.qName - Query sequence name.
11.qSize - Query sequence size.
12.qStart - Alignment start position in query.
13.qEnd - Alignment end position in query.
14.tName - Target sequence name.
15.tSize - Target sequence size.
16.tStart - Alignment start position in query.
17.tEnd - Alignment end position in query.
18.blockCount - Number of blocks in the alignment.

HTML Code:

       V1               V2               V3          V4
 Min.   : 30.00   Min.   : 0.000   Min.   :0   Min.   :  0.00000
 1st Qu.: 54.00   1st Qu.: 2.000   1st Qu.:0   1st Qu.:  0.00000
 Median : 86.00   Median : 5.000   Median :0   Median :  0.00000
 Mean   : 95.91   Mean   : 6.035   Mean   :0   Mean   :  0.01775
 3rd Qu.:126.00   3rd Qu.: 9.000   3rd Qu.:0   3rd Qu.:  0.00000
 Max.   :272.00   Max.   :25.000   Max.   :0   Max.   :186.00000

       V5               V6               V7               V8
 Min.   :0.0000   Min.   :-40.00   Min.   : 0.000   Min.   :      0
 1st Qu.:0.0000   1st Qu.:  0.00   1st Qu.: 0.000   1st Qu.:      0
 Median :1.0000   Median :  1.00   Median : 1.000   Median :      2
 Mean   :0.7156   Mean   : 12.89   Mean   : 1.015   Mean   :   4387
 3rd Qu.:1.0000   3rd Qu.: 12.00   3rd Qu.: 2.000   3rd Qu.:     37
 Max.   :9.0000   Max.   :211.00   Max.   :18.000   Max.   :1292032

 V9                                                   V10
 -:888733   M01558:18:000000000-A4GLM:1:1104:22686:11369:   2178
 +:889831   M01558:18:000000000-A4GLM:1:1103:13877:23360:   1731
            M01558:18:000000000-A4GLM:1:2106:9363:8478  :   1722
            M01558:18:000000000-A4GLM:1:2110:21183:12995:   1689
            M01558:18:000000000-A4GLM:1:2105:27871:14537:   1600
            M01558:18:000000000-A4GLM:1:1104:18639:7680 :   1530
            (Other)                                     :1768114
      V11             V12              V13             V14
 Min.   :231.0   Min.   :  0.00   Min.   : 30.0   1      : 121115
 1st Qu.:252.0   1st Qu.:  5.00   1st Qu.:121.0   2      : 105838
 Median :253.0   Median : 38.00   Median :189.0   17     : 103302
 Mean   :252.4   Mean   : 61.82   Mean   :176.7   7      :  96979
 3rd Qu.:253.0   3rd Qu.:110.00   3rd Qu.:244.0   3      :  93220
 Max.   :253.0   Max.   :223.00   Max.   :253.0   12     :  86681
                                                  (Other):1171429
      V15                 V16                 V17                 V18
 Min.   :     4262   Min.   :        2   Min.   :      122   Min.   : 1.000
 1st Qu.: 90354753   1st Qu.: 30317200   1st Qu.: 30319283   1st Qu.: 1.000
 Median :141213431   Median : 56869278   Median : 56875718   Median : 2.000
 Mean   :142625131   Mean   : 70776742   Mean   : 70781231   Mean   : 2.191
 3rd Qu.:180915260   3rd Qu.:102962065   3rd Qu.:102965836   3rd Qu.: 3.000
 Max.   :249250621   Max.   :249234810   Max.   :249234945   Max.   :21.000

**Dario1984** · 09-10-2014, 10:00 PM

I have found this problem. I ran BLAT on the segments of reads which are soft-clipped and they map perfectly or almost perfectly to another chromosome.

An example is

TAGTGGAAATTCTTGGCCGCTTCTGAGCTGCTGATGGCAGTGTCTTGAGGGAGAGGCCTGGGAGAGGTTTCGTTATAACTGTGCAGAAAGAGTGCACAGTG

which has 57 matches followed by 44 soft-clipped bases. The 57 matches were mapped to chromosome 1, whereas the 44 other bases map perfectly to chromosome 2.

It's not a problem with adapters, but it seems BWA-MEM misses some easy alignments. The few other reads in the same region do not have that mapping characteristic. They map to that region from start to end.

**lh3** · 09-11-2014, 05:19 PM

For this particular read, the bwa-mem (0.7.10) alignment is:

1 0 1 25189065 60 57M44S
1 2048 2 169676055 60 51H50M

This is a chimeric alignment.

Topics	Statistics	Last Post
Ancient Viral Sequences in Human Brain Linked to Psychiatric Disorders by seqadmin Started by seqadmin, Today, 07:35 AM	0 responses 5 views 0 likes	Last Post by seqadmin Today, 07:35 AM
New Milestone for COSMIC with Extensive Cancer Mutation Data by seqadmin Started by seqadmin, Yesterday, 02:06 PM	0 responses 8 views 0 likes	Last Post by seqadmin Yesterday, 02:06 PM
The Role of Spliceosomes in RNA Splicing and Genome Evolution by seqadmin Started by seqadmin, 05-14-2024, 07:03 AM	0 responses 27 views 0 likes	Last Post by seqadmin 05-14-2024, 07:03 AM
A Closer Look at the Enigmatic Genomes of Oikopleura dioica by seqadmin Started by seqadmin, 05-10-2024, 06:35 AM	0 responses 47 views 0 likes	Last Post by seqadmin 05-10-2024, 06:35 AM

Seqanswers Leaderboard Ad

Announcement

High Percentage of Soft Clipping, BWA-MEM, MiSeq

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Latest Articles

ad_right_rmr

News