티스토리 뷰

CanDrA는 mis-sense mutation의 효과를 예측해 주는 프로그램이다. TCGA의 데이터의 mis-sense 데이터를 CanDrA가 요구하는 입력 파일 형태로 만들기 위해서는 약간의 가공이 필요하다. 아래는 python의 pandas를 이용하여 TCGA 데이터를 가공하는 예시이다.

Parse the TCGA somatic mutation file for generating CanDrA input file


In [1]:
import pandas as pd
from pandas import Series, DataFrame
import numpy as np
In [2]:
print( "pandas: %s"%pd.__version__ )
print( "numpy: %s"%np.__version__ )
pandas: 0.17.0
numpy: 1.10.1
In [3]:
df = pd.read_table("hgsc.bcm.edu__Illumina_Genome_Analyzer_DNA_Sequencing_level2.maf")
In [4]:
df.columns
Out[4]:
Index([u'Hugo_Symbol', u'Entrez_Gene_Id', u'Center', u'Ncbi_Build', u'Chrom',
       u'Start_Position', u'End_Position', u'Strand',
       u'Variant_Classification', u'Variant_Type', u'Reference_Allele',
       u'Tumor_Seq_Allele1', u'Tumor_Seq_Allele2', u'Dbsnp_Rs',
       u'Dbsnp_Val_Status', u'Tumor_Sample_Barcode',
       u'Matched_Norm_Sample_Barcode', u'Match_Norm_Seq_Allele1',
       u'Match_Norm_Seq_Allele2', u'Tumor_Validation_Allele1',
       u'Tumor_Validation_Allele2', u'Match_Norm_Validation_Allele1',
       u'Match_Norm_Validation_Allele2', u'Verification_Status',
       u'Validation_Status', u'Mutation_Status', u'Sequencing_Phase',
       u'Sequence_Source', u'Validation_Method', u'Score', u'Bam_File',
       u'Sequencer', u'Tumor_Sample_UUID', u'Matched_Norm_Sample_UUID',
       u'File_Name', u'Archive_Name', u'Line_Number'],
      dtype='object')

The CanDrA page says as follows:

A input file should be in a tab-delimited format. Columns of an input file are

  1. chromosome number
  2. genomic_coordinate
  3. ref_allele
  4. mutated_allele
  5. strand.

More please refer to demo_input.txt in the package


In [5]:
# A subset of the DataFrame, specified by the necessary columns for CanDrA input file format
sdf = df[ ['Chrom', 'Start_Position', 'Reference_Allele', 'Tumor_Seq_Allele2', 'Strand'] ]
In [6]:
sdf[:5]
Out[6]:
Chrom Start_Position Reference_Allele Tumor_Seq_Allele2 Strand
0 19 58861779 G A +
1 19 58862886 G A +
2 19 58863691 C T +
3 19 58863691 C T +
4 19 58863717 C T +
In [7]:
# Rename the columns with short words
sdf.columns = ['chrom', 'pos', 'ref', 'mut', 'strand']
sdf[:5]
Out[7]:
chrom pos ref mut strand
0 19 58861779 G A +
1 19 58862886 G A +
2 19 58863691 C T +
3 19 58863691 C T +
4 19 58863717 C T +
In [8]:
sdf.replace('-', np.nan, inplace=True)
C:\Users\dwlee\Anaconda\envs\py27\lib\site-packages\pandas\core\common.py:449: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
  mask = arr == x
C:\Users\dwlee\Anaconda\envs\py27\lib\site-packages\ipykernel\__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
In [9]:
sdf.dropna(how='any', inplace=True)
C:\Users\dwlee\Anaconda\envs\py27\lib\site-packages\ipykernel\__main__.py:1: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
In [10]:
(sdf == np.nan).any()
Out[10]:
chrom     False
pos       False
ref       False
mut       False
strand    False
dtype: bool
In [11]:
sdf.to_csv("candra_input_tcga_coad.txt", sep='\t', index=False)


댓글