Data Filtering

Last updated: April 24th, 20202020-04-24Project preview

Data Filtering

This code/script is for sorting data based on filtering.

  1. upload the reference data
  2. upload the test data
  3. filter the test data based on the reference data
  4. store the filtered test data into file


In [ ]:
import numpy as np
import math
import pandas as pd

The following functions are tools to pull the data out from each text file.

  1. extract_data_from_filter: The outcome of data is the list of gene symbols/names and stores as "Reference"
  2. extract_data_from_test: The outcome of data is the list of "Fold-change (log2[FC])", "gene symbols", "gene description"
In [ ]:
def extract_data_from_filter(filename):             ## <---- Reference data file name: this should be in the format of string(ex. 'yourfilename.txt')
    infile = open(filename,'r')                ## <---- open the file: 'r' means "Reading mode" and 'w' means "Writing mode"
    infile.readline()                          ## Skip the first line
    symbols  = []                              ## Creating the dummy list where all your collection goes
    for line in infile:                        ## infile is the format of sort of list but it will be the entire row (it can be "hello world" instead of "hello" and "world")
        if line.strip():                       ## by doing so, you know line.strip() exists or not. if not, then this conditional statement will pass the element of infile (line) 
            line = line.strip("\n ' '")        ## I would say this is precautionary to make sure it cut the sentence if there is any \n 
            line = line.split("	")             ## split row into column 
            symbol = line[0]                   ## pull the very first element as symbol (I think this file format has only one column)
            symbols.append(symbol)             ## what this does is filling the dummy list as you progress
    infile.close()                             ## close writing process 
    return symbols                             ## return the outcome 

def extract_data_from_test(filename):
    infile = open(filename,'r')
    numbers1 = []
    symbols  = []
    descriptions = []
    for line in infile:
        if line.strip():
            line = line.strip("\n ' '")
            line = line.split("	")
            number1 = float(line[0])          ## This time what you are storing is a number (float) but python will treat as "string" in the first place so you have to convert to float to treat as numerical value
            symbol = line[2]                  ## If you check the text file, there are many columns. So once you split them, you need to treat them as column and the integer in [] indicates the index of the column you are interested in.
            description = line[5]
    return numbers1, symbols, descriptions    ## These are float, string, and string.


  1. it will go through matching process for filtering irrelevant data.
  2. it will store them as output

Data description

  • 'filter1.txt' data contains the upregulated gene information during the polarization of cell from M0 to M1 (classical activation): Therefore, the genes in this file will represent M1 specific genes

  • 'filter2.txt' data contains the upregulated gene information during the polarization of cell from M0 to M2 (alternative activation): Therefore, the genes in this file will represent M2 specific genes

  • 'testDown.txt' contains genes being downregulated during particular process

  • 'testUp.txt' contains genes being upregulated during particular process

Purpose of this process

We will determine whether this "particular process" induces M1 phenotype or M2 phenotype

In [ ]:
def tester(testFile,RefFile,DescriptionOfTest): ## inputs are string so make sure you put as comparison('yourtestfile1.txt','yourtestfile2.txt')
    Ref = RefFile+'.txt'
    # Extract gene code from text file via "function"
    FilterDataSet   = extract_data_from_filter(Ref)   ## M1 specific upregualted Genes
    # Create Dictionary 
    data1 = {'Gene Code - Filter':FilterDataSet}
    # Create Panda Data Frame
    df1 = pd.DataFrame(data1)    
    ### Pulling out the specific genes that are down or up regulated from the mRNA seq data obtained and collected by Chris and Rob 
    [logFCtest, symbolstest, dscrpttest] = extract_data_from_test(testFile+'.txt') ## you don't have to give them the same name as return upthere. 
    x = len(logFCtest)   # This will pull the size of list. if there are ten elements, then your x = 10
    # Create Dictionary for raw data 
    data2 = {'Gene Code - Raw Data':symbolstest,
             'Gene Description - Raw Data':dscrpttest}
    # Create Panda Data Frame
    df2 = pd.DataFrame(data2)
    n2 = len(FilterDataSet)
    ## Against Test Data
    SpecificCode   = []
    SpecificDscrpt = []
    for k in np.arange(n2):        # if you put np.arange(2), the outcome is [0,1] and these elements can be use as index 
        for l in np.arange(x):
            if FilterDataSet[k] == symbolstest[l]:    ## This will go through each element of M0toM1up and compared with symbolsdowntest to see if they match or not. If not, it will pass.
                SpecificCode.append(symbolstest[l])  ## downregulated genes during M1->M2 polarization are M1 specific
    file = open(DescriptionOfTest+'.txt','w')
    file.write('Gene Code Collection indiciating upregulated'+DescriptionOfTest+' within the unknown process  \n')
    file.write('mRNA_Symbol, description \n')
    for num in np.arange(len(SpecificCode)):
         file.write('{}, {} \n'.format(SpecificCode[num],SpecificDscrpt[num]))
    data3 = {'Gene Code - Filtered by M1 specific gene comparison':SpecificCode,
             'Description': SpecificDscrpt}
    df3 = pd.DataFrame(data3)
    return df1, df2, df3
In [ ]:
M1Specific1, TestDown, FilteredDataDown = tester('testDown','filter1','M1_Specific_Gene_testDown')
In [ ]:
# Data Table from reference gene code 
In [ ]:
# Data Table from Tested data set
In [ ]:
# Filtered Data 
## Number of entry will be comparied with other outcomes
In [ ]:
M1Specific2, TestUp, FilteredDataUp = tester('testUp','filter1','M1_Specific_Gene_testUp')
In [ ]:
In [ ]:
In [ ]:
Notebooks AI
Notebooks AI Profile20060