Data Filtering

Last updated: August 31st, 20202020-08-31Project preview

Data Filtering

This code/script is for sorting data based on filtering.

  1. upload the reference data
  2. upload the test data
  3. filter the test data based on the reference data
  4. store the filtered test data into file

image

In [1]:
import numpy as np
import math
import pandas as pd

The following functions are tools to pull the data out from each text file.

  1. extract_data_from_filter: The outcome of data is the list of gene symbols/names and stores as "Reference"
  2. extract_data_from_test: The outcome of data is the list of "Fold-change (log2[FC])", "gene symbols", "gene description"
In [5]:
def extract_data_from_filter(filename):             ## <---- Reference data file name: this should be in the format of string(ex. 'yourfilename.txt')
    infile = open(filename,'r')                ## <---- open the file: 'r' means "Reading mode" and 'w' means "Writing mode"
    infile.readline()                          ## Skip the first line
    symbols  = []                              ## Creating the dummy list where all your collection goes
    for line in infile:                        ## infile is the format of sort of list but it will be the entire row (it can be "hello world" instead of "hello" and "world")
        if line.strip():                       ## by doing so, you know line.strip() exists or not. if not, then this conditional statement will pass the element of infile (line) 
            line = line.strip("\n ' '")        ## I would say this is precautionary to make sure it cut the sentence if there is any \n 
            line = line.split("	")             ## split row into column 
            symbol = line[0]                   ## pull the very first element as symbol (I think this file format has only one column)
            symbols.append(symbol)             ## what this does is filling the dummy list as you progress
    infile.close()                             ## close writing process 
    return symbols                             ## return the outcome 

def extract_data_from_test(filename):
    infile = open(filename,'r')
    infile.readline()
    fc_m1based = []
    fc_m2based = []
    name = []
    descriptions = []
    pvalue = []
    for line in infile:
        if line.strip():
            line = line.strip("\n ' '")
            line = line.split("	")
            l1 = line[0] # name
            l2 = line[7] # decription
            fc_m1 = float(line[5])
            fc_m2 = float(line[6])
            pval = float(line[4])
            
            name.append(l1)
            descriptions.append(l2)
            fc_m1based.append(fc_m1)
            fc_m2based.append(fc_m2)
            pvalue.append(pval)
            
    infile.close()
    return name, descriptions, fc_m1based, fc_m2based, pvalue     ## These are float, string, and string.
In [6]:
ref_symbols_m1 = extract_data_from_filter('filter1.txt')
ref_symbols_m2 = extract_data_from_filter('filter2.txt')
In [7]:
[name, descriptions, fc_m1based, fc_m2based, pvalue] =extract_data_from_test('mRNAseqM1M2_P0_05.txt')
In [12]:
n = len(name)
m1 = len(ref_symbols_m1)
m2 = len(ref_symbols_m2)

file = open('m1_specific_genes_remaining_in_M1M2.txt','w')
file.write('List of M1 specific genes remaining in M1M2 \n')
file.write('Gene Code, logFC(M1M2/M1), logFC(M1M2/M2), decriptions, p-value \n')
for i in np.arange(m1):
    ref_name = ref_symbols_m1[i]
    for j in np.arange(n):
        test_name = name[j]
        if ref_name == test_name:
            file.write('{}, {}, {}, {}, {} \n'.format(name[j],fc_m1based[j],fc_m2based[j],descriptions[j],pvalue[j]))

file.close()

file = open('m2_specific_genes_remaining_in_M1M2.txt','w')
file.write('List of M1 specific genes remaining in M1M2 \n')
file.write('Gene Code, logFC(M1M2/M1), logFC(M1M2/M2), decriptions, p-value \n')
for i in np.arange(m2):
    ref_name = ref_symbols_m2[i]
    for j in np.arange(n):
        test_name = name[j]
        if ref_name == test_name:
            file.write('{}, {}, {}, {}, {} \n'.format(name[j],fc_m1based[j],fc_m2based[j],descriptions[j],pvalue[j]))

file.close()
In [ ]:
 

Comparison

  1. it will go through matching process for filtering irrelevant data.
  2. it will store them as output

Data description

  • 'filter1.txt' data contains the upregulated gene information during the polarization of cell from M0 to M1 (classical activation): Therefore, the genes in this file will represent M1 specific genes

  • 'filter2.txt' data contains the upregulated gene information during the polarization of cell from M0 to M2 (alternative activation): Therefore, the genes in this file will represent M2 specific genes

  • 'testDown.txt' contains genes being downregulated during particular process

  • 'testUp.txt' contains genes being upregulated during particular process

Purpose of this process

We will determine whether this "particular process" induces M1 phenotype or M2 phenotype

In [ ]:
def tester(testFile,RefFile,DescriptionOfTest): ## inputs are string so make sure you put as comparison('yourtestfile1.txt','yourtestfile2.txt')
    
    Ref = RefFile+'.txt'
    
    # Extract gene code from text file via "function"
    FilterDataSet   = extract_data_from_filter(Ref)   ## M1 specific upregualted Genes
    
    # Create Dictionary 
    data1 = {'Gene Code - Filter':FilterDataSet}
    
    # Create Panda Data Frame
    df1 = pd.DataFrame(data1)    
    
    ### Pulling out the specific genes that are down or up regulated from the mRNA seq data obtained and collected by Chris and Rob 
    [logFCtest, symbolstest, dscrpttest] = extract_data_from_test(testFile+'.txt') ## you don't have to give them the same name as return upthere. 
       
    x = len(logFCtest)   # This will pull the size of list. if there are ten elements, then your x = 10
    
    # Create Dictionary for raw data 
    data2 = {'Gene Code - Raw Data':symbolstest,
             'Gene Description - Raw Data':dscrpttest}
    
    # Create Panda Data Frame
    df2 = pd.DataFrame(data2)
    
    n2 = len(FilterDataSet)
    
    ## Against Test Data
    SpecificCode   = []
    SpecificDscrpt = []
    for k in np.arange(n2):        # if you put np.arange(2), the outcome is [0,1] and these elements can be use as index 
        for l in np.arange(x):
            if FilterDataSet[k] == symbolstest[l]:    ## This will go through each element of M0toM1up and compared with symbolsdowntest to see if they match or not. If not, it will pass.
                SpecificCode.append(symbolstest[l])  ## downregulated genes during M1->M2 polarization are M1 specific
                SpecificDscrpt.append(dscrpttest[l])
    
    file = open(DescriptionOfTest+'.txt','w')
    file.write('Gene Code Collection indiciating upregulated'+DescriptionOfTest+' within the unknown process  \n')
    file.write('mRNA_Symbol, description \n')
    for num in np.arange(len(SpecificCode)):
         file.write('{}, {} \n'.format(SpecificCode[num],SpecificDscrpt[num]))
    file.close()
    
    data3 = {'Gene Code - Filtered by M1 specific gene comparison':SpecificCode,
             'Description': SpecificDscrpt}
        
    df3 = pd.DataFrame(data3)
    
        
    return df1, df2, df3
In [ ]:
M1Specific1, TestDown, FilteredDataDown = tester('testDown','filter1','M1_Specific_Gene_testDown')
In [ ]:
# Data Table from reference gene code 
M1Specific1
In [ ]:
# Data Table from Tested data set
TestDown
In [ ]:
# Filtered Data 
## Number of entry will be comparied with other outcomes
FilteredDataDown
In [ ]:
M1Specific2, TestUp, FilteredDataUp = tester('testUp','filter1','M1_Specific_Gene_testUp')
In [ ]:
TestUp
In [ ]:
FilteredDataUp
In [ ]:
 
Notebooks AI
Notebooks AI Profile20060