Hello

I am having a very strange bug in a code I am writing and when I tried to go step by step and look where the problem comes from I found this :

I have paired end reads, I do the alignment with bowtie2, no limit for the read sizes, I create my sam files, my bam and use samtools to sort and index, until here I have no problem

Then I wanted to look at the insert size in a quick way so I did

samtools view bamfile | cut -f 9 | sort -n | unique -c

and I have this distribution (this is just the header, it contains also high positive values

1 -1015 1 -1014 1 -1002 7 -1000 17 -999 9 -998 14 -997 21 -996 18 -995 13 -994

The bam file looks also Ok

M01783:49:000000000-A8MLU:1:1101:15827:12859 81 1 10003 1 76M X 155260048 0 ACCCTAACCCTAACCCTAACCCTGACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC :,,<<,,F@E66,FEC,<CD8EC,@CCF?FE<AECC6CFE8E,GGEGEFGGGFECGGGGGEGGGGGFGGGFCCCCC AS:i:-3 XS:i:-3 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:23A52 YT:Z:UP M01783:49:000000000-A8MLU:1:2104:27042:8289 81 1 10003 1 76M X 155260048 0 ACGCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC 68,67CE8,6,FE,,6<EC<F@FFEGD<8@CGGFGFC,8FD;,@CFFEFFCF<CFGFE<@C<DC<FC,<GGCCCCC AS:i:-3 XS:i:-3 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:2C73 YT:Z:UP M01783:49:000000000-A8MLU:1:2110:13157:18810 81 1 10003 1 76M X 155260048 0 ACCCGAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC FC@7,C8GGGGGFFCC,,,<CDD@6<8FFE@CDFFFAGGFFF@C<6,FGGGECGGGGFFFCGGGCGGGGGF<ACC@ AS:i:-3 XS:i:-3 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:4T71 YT:Z:UP M01783:49:000000000-A8MLU:1:1102:5613:15723 133 1 10004 0 * = 1000 0 GGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGGTTAGGG CCCCCGGGGCFEGGGGGGGGGECFFGGEE

I was expecting to have the same data using pysam, so adapted this to my code

def isPaired(bamfile, alignments=1000): '''check if a *bamfile* contains paired end data The method reads at most the first *alignments* and returns True if any of the alignments are paired. ''' samfile = pysam.Samfile(bamfile) n = 0 for read in samfile: if read.is_paired: break n += 1 if n == alignments: break samfile.close() return n != alignments def estimateInsertSizeDistribution(bamfile, alignments=1000): ''' estimate insert size from first alignments in bam file. returns mean and stddev of insert sizes. ''' assert isPaired(bamfile), \ 'can only estimate insert size from' \ 'paired bam files' samfile = pysam.Samfile(bamfile) # only get positive to avoid double counting inserts = np.array( [read.tlen for read in samfile.head(alignments) if read.is_proper_pair and read.tlen > 0]) return inserts

This is supposed to return these positive values that I saw with samtools command but I got only 0 (zeros) as insert size, even if I remove the condition for reads to be in proper pair, I still have zeros as insert size (9th column)

I also took a look at the reads that pysam generated, and I was surprised to get zeros at the 9th column everywhere.

Do you think it is a pysam bug ? or just something I am missing ?

and here you can compare the output from pysam where you can see clearly that the 9th column is 0 instead of 1 for the first read

>>> for read in samfile.head(2): ... print read ... M01783:49:000000000-A8MLU:1:1101:15827:12859 81 0 10002 1 [(0, 76)] 22 155260047 76 ACCCTAACCCTAACCCTAACCCTGACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC :,,<<,,F@E66,FEC,<CD8EC,@CCF?FE<AECC6CFE8E,GGEGEFGGGFECGGGGGEGGGGGFGGGFCCCCC [('AS', -3), ('XS', -3), ('XN', 0), ('XM', 1), ('XO', 0), ('XG', 0), ('NM', 1), ('MD', '23A52'), ('YT', 'UP')] M01783:49:000000000-A8MLU:1:2104:27042:8289 81 0 10002 1 [(0, 76)] 22 155260047 76 ACGCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCCTAACCC 68,67CE8,6,FE,,6<EC<F@FFEGD<8@CGGFGFC,8FD;,@CFFEFFCF<CFGFE<@C<DC<FC,<GGCCCCC [('AS', -3), ('XS', -3), ('XN', 0), ('XM', 1), ('XO', 0), ('XG', 0), ('NM', 1), ('MD', '2C73'), ('YT', 'UP')]

You're right !!

Thanks a lot, that worked just fine

I appreciate