CanDrA는 mis-sense mutation의 효과를 예측해 주는 프로그램이다. TCGA의 데이터의 mis-sense 데이터를 CanDrA가 요구하는 입력 파일 형태로 만들기 위해서는 약간의 가공이 필요하다. 아래는 python의 pandas를 이용하여 TCGA 데이터를 가공하는 예시이다.

Parse the TCGA somatic mutation file for generating CanDrA input file

I use COAD(colon adenocarcinoma) data here.
The file format is MAF(mutation annotation format).

import pandas as pd
from pandas import Series, DataFrame
import numpy as np

print( "pandas: %s"%pd.__version__ )
print( "numpy: %s"%np.__version__ )

pandas: 0.17.0
numpy: 1.10.1

df = pd.read_table("hgsc.bcm.edu__Illumina_Genome_Analyzer_DNA_Sequencing_level2.maf")

df.columns

Index([u'Hugo_Symbol', u'Entrez_Gene_Id', u'Center', u'Ncbi_Build', u'Chrom',
       u'Start_Position', u'End_Position', u'Strand',
       u'Variant_Classification', u'Variant_Type', u'Reference_Allele',
       u'Tumor_Seq_Allele1', u'Tumor_Seq_Allele2', u'Dbsnp_Rs',
       u'Dbsnp_Val_Status', u'Tumor_Sample_Barcode',
       u'Matched_Norm_Sample_Barcode', u'Match_Norm_Seq_Allele1',
       u'Match_Norm_Seq_Allele2', u'Tumor_Validation_Allele1',
       u'Tumor_Validation_Allele2', u'Match_Norm_Validation_Allele1',
       u'Match_Norm_Validation_Allele2', u'Verification_Status',
       u'Validation_Status', u'Mutation_Status', u'Sequencing_Phase',
       u'Sequence_Source', u'Validation_Method', u'Score', u'Bam_File',
       u'Sequencer', u'Tumor_Sample_UUID', u'Matched_Norm_Sample_UUID',
       u'File_Name', u'Archive_Name', u'Line_Number'],
      dtype='object')

The CanDrA page says as follows:

A input file should be in a tab-delimited format. Columns of an input file are

chromosome number

genomic_coordinate

ref_allele

mutated_allele

strand.

More please refer to demo_input.txt in the package

# A subset of the DataFrame, specified by the necessary columns for CanDrA input file format
sdf = df[ ['Chrom', 'Start_Position', 'Reference_Allele', 'Tumor_Seq_Allele2', 'Strand'] ]

sdf[:5]

# Rename the columns with short words
sdf.columns = ['chrom', 'pos', 'ref', 'mut', 'strand']
sdf[:5]

sdf.replace('-', np.nan, inplace=True)

C:\Users\dwlee\Anaconda\envs\py27\lib\site-packages\pandas\core\common.py:449: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  mask = arr == x
C:\Users\dwlee\Anaconda\envs\py27\lib\site-packages\ipykernel\__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':

sdf.dropna(how='any', inplace=True)

C:\Users\dwlee\Anaconda\envs\py27\lib\site-packages\ipykernel\__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':

(sdf == np.nan).any()

chrom     False
pos       False
ref       False
mut       False
strand    False
dtype: bool

sdf.to_csv("candra_input_tcga_coad.txt", sep='\t', index=False)

	Chrom	Start_Position	Reference_Allele	Tumor_Seq_Allele2	Strand
0	19	58861779	G	A	+
1	19	58862886	G	A	+
2	19	58863691	C	T	+
3	19	58863691	C	T	+
4	19	58863717	C	T	+

	chrom	pos	ref	mut	strand
0	19	58861779	G	A	+
1	19	58862886	G	A	+
2	19	58863691	C	T	+
3	19	58863691	C	T	+
4	19	58863717	C	T	+

TCGA mis-sense 돌연변이 데이터를 환자별로 추출하는 방법 (0)	2015.12.21
bisect 모듈의 insort 함수 (0)	2015.06.19
Cython 간단한 예제 (0)	2015.02.28
OpenBLAS를 이용하여 numpy와 scipy 설치 (0)	2014.06.10
몬티홀(Monty Hall) 문제 코드 (0)	2014.02.17

일	월	화	수	목	금	토
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31

BIBLIOTHEQUE

티스토리 뷰

CanDrA input file 생성을 위한 TCGA somatic mutation 파일 파싱하기

Parse the TCGA somatic mutation file for generating CanDrA input file

The CanDrA page says as follows:

'Python > 요리 방법' 카테고리의 다른 글

티스토리툴바