cvasl.seperated module¶

This file contains functions for processing csv and tsv files towards correct formats.

cvasl.seperated.avg_k_folds(frame)¶: This function takes a dataframe of k_fold results, as formatted for our experiments with derived datasets and returns an averaged dataframe

cvasl.seperated.bin_dataset(dataframe, column, num_bins=4, graph=False)¶

This function creates an additional column where a continues variable can be binned into 2 or 4 parts.

Parameters:

dataframe (str) – dataframe variable
column (str) – column name written in singe qoutes
num_bins (int) – 2 or 4 for number bins
graph (bool) – on True setting produces split graph

Returns:

dataframe with additional column

Return type:

pandas.dataFrame

cvasl.seperated.check_identical_columns(tsv_path, header=0)¶

Here we enter the path to a folder, then return the columns in which all files are exactly duplicated in name and values.

needs more

cvasl.seperated.check_sex_dimorph_expectations(dataframe)¶

This function checks that men as expected have larger brains than women in a given dataframe.

Parameters:: dataframe (DataFrame) – dataframe with cvasl standard for patient MRI data
Returns:: dataframe, or zero, with side effect of printed information
Return type:: DataFrame or int

cvasl.seperated.concat_double_header(dataframe_dub)¶

This function concatenates the two headers of a dataframe :param dataframe_dub: dataframe with double header :type dataframe_dub: pandas.dataFrame

Returns:: dataframe with a single header
Return type:: ~pandas.DataFrame

cvasl.seperated.derived_function(column, a, b, c)¶

This functions allows you to derive a projected value for any parameter based on a polynomial for age versus the parameter, given that your data is in a dataframe format.

Parameters:

column (pandas.core.series.Series) – pandas dataframe variable column
a (float) – first coeffiecnt
b (float) – second coefficient
c (float) – final term in polynomial

Returns:

series

Return type:

Series

cvasl.seperated.drop_columns_folder(directory, list_droppables)¶

This function works csvs in a folder on those with unnamed columns and other unwanted columns it drops them, they are then available in a new folder called ‘stripped’

Parameters:: directory (str) – directory where csv are variable
Returns:: dataframes without unnamed columns
Return type:: list

cvasl.seperated.drop_y(df)¶: This is meant as a psuedo-helper function for pandas columns when they are merged. It drops columns that end in y

cvasl.seperated.find_original_y_values(polynomial, output_value)¶

Finds the original y-values of a second or third degree polynomial given its coefficients and an output value.

Parameters:

polynomial (tuple) – coefficients of the polynomial in the form (a, b, c)
output_value (list) – output of polynomial when a list of y are given

Returns:

pile,list of original y-values corresponding to the output value

Return type:

list

cvasl.seperated.find_outliers_by_list(dataframe, column_list, number_sd)¶

This function finds the outliers in terms of anything outside a given number of standard deviations (number_sd) from the mean on a list of specific specific column, then returns these rows of the dataframe.

Parameters:

dataframe (DataFrame) – whole dataframe on dataset
column_list (list) – list of relevant columns
number_sd (float) – number of standard deviations

Returns:

dataframe of outliers

Return type:

DataFrame

cvasl.seperated.folder_chain_out_columns(datasets_folder, columns, output_folder)¶

This function works csvs in a folder at any folder level inside on those with unwanted columns it drops them, they are then available in a new folder called specified

Parameters:

datasets_folder (str) – directory where csv are variable
columns (list) – list of columns as strings
output_folder (str) – directory where newly made csvs are sent

Returns:

None

Return type:

None

cvasl.seperated.generate_transformation_matrix(polynomial1, polynomial2)¶

Generates a matrix that transforms one polynomial into another. :param polynomial1: coefficients of the polynomial in form (a1, b1, …) :type polynomial1: Sequence :param polynomial2: coefficients of the polynomial in form (a2, b2, …) :type polynomial2: Sequence

Returns:: m, an array
Return type:: ndarrray

cvasl.seperated.make_log_file(file_name, list_of_columns)¶

This function recodes columns on a csv file into their log value

Parameters:: file_name (str) – csv with variables as columns
Returns:: dataframe
Return type:: dataframe

cvasl.seperated.make_log_folder(directory, list_of_columns)¶

This function recodes columns on csvs with such a column in a sepcified directory into csvs with a new column is the log of old. The new files are produced as a side effect.

Parameters:: directory (str) – directory where csv are variable
Returns:: dataframes with sex encoded correctly
Return type:: list

cvasl.seperated.plot_2on2_df(dataframe1, dataframe2, special_column, color1='purple', color2='orange')¶

This function is meant to create an artifact of two datasets with comparable variables in terms of graphing the variables against a variable of interest

Parameters:

dataframe1 (pandas.dataFrame) – dataframe variable
dataframe2 (pandas.dataFrame) – dataframe variable
special_column_name (str) – string of column you want to graph against

Returns:

no return, makes artifact

Return type:

None.

cvasl.seperated.polyfit_and_show(dataframe, special_column_name, other_column_name, degree_poly, color1='purple')¶

This function creates a polynomial for two columns. It returns the coefficients in a 2nd degree polynomial and also creates a graph as a side effect.

Parameters:

dataframe (pandas.dataFrame) – dataframe variable
special_column_name (str) – string of column you want to graph against
other_column_name (str) – string of column you want to graph
degree_poly (int) – either 1,2 or 3 only
color1 (str) – string of color for graphing

Returns:

coeffiects

Return type:

ndarray

cvasl.seperated.polyfit_second_degree_to_df(dataframe, special_column_name, other_column_names)¶

This function creates polynomials for columns, as compared to a special column, in our case age. It returns the coefficients in a dataframe.

Parameters:

dataframe (pandas.dataFrame) – dataframe variable
special_column_name (str) – column name, usually age
other_column_names (list) – columns you want to get poly coefficients on

Returns:

coeffiects

Return type:

ndarray

cvasl.seperated.preprocess(folder, file_extension, outcome_folder, log_cols=[], plus_one_log_columns=[])¶: This function given a directory will search all subdirectory for noted file extension Copies of the files will be processed as specified which is the specified columns turned to log or +1 then log then put in the outcome folder

cvasl.seperated.pull_off_unnamed_column(unclean, extra_columns=[])¶: This function takes a dataframe and if there are columns with the string “Unnamed” it drops them. It also drops the extra columns you input

cvasl.seperated.recode_sex(whole_dataframe, string_for_sex)¶

This function recodes sex into a new column if there are two possible values. It maintains numerical order but changes the values to 0 and 1. The new column is called ‘sex_encoded’. Note sex should be encoded in numbers i.e. ints or floats

Parameters:

whole_dataframe (str) – dataframe variable
string_for_sex (str) – column name written in singe qoutes

Returns:

dataframe with sex encoded colum

Return type:

pandas.dataFrame

cvasl.seperated.recode_sex_folder(directory)¶

This function recodes sex on csvs with such a column in a sepcified directory into csvs with a new column if there are two possible values. It maintains numerical order but changes the values to 0 and 1. The column is called changed. Note sex should be encoded in numbers i.e. ints for many functions. The new files are produced as a side effect.

Parameters:: directory (str) – directory where csv are variable
Returns:: dataframes with sex encoded correctly
Return type:: list

cvasl.seperated.recode_sex_to_numeric(df)¶: When we need to flip the sex back to numbers from the suggested format this function will turn females to 1, males to 0

cvasl.seperated.relate_columns_graphs(dataframe, special_column_name, saver=False)¶

This function makes a scatter plot of all columns :param dataframe: dataframe variable :type dataframe: pandas.dataFrame :param special_column_name: string of column you want to graph against :type special_column_name: str :param saver: bool to indicate if graph pngs should be saved :type saver: bool

Returns:: no return, makes artifact
Return type:: None.

cvasl.seperated.relate_columns_graphs_numeric(dataframe, special_column_name, saver=False)¶

This function makes a scatter plot of all columns that are numeric.

Parameters:

dataframe (pandas.dataFrame) – dataframe variable
special_column_name (str) – string of column you want to graph against
saver (str) – string to indicate if graph pngs should be saved

Returns:

no return, makes artifact

Return type:

None.

cvasl.seperated.relate_columns_graphs_two_dfs(dataframe1, dataframe2, special_column_name, other_column_name, color1='purple', color2='orange')¶

This function is meant to be a helper function for one that makes a scatter plot of all columns that two dataframes have in common

Parameters:

dataframe1 (pandas.dataFrame) – dataframe variable
dataframe2 (pandas.dataFrame) – dataframe variable
special_column_name (str) – str of column you graph against
other_column_name (str) – string of column you want to graph

Returns:

no return, makes artifact

Return type:

None.

cvasl.seperated.static_bin_age(dataframe)¶: This function applies static binning by age decade to a dataframe

cvasl.seperated.stratified_cat_and_cont_categories_shuffle_split(model_name, model_file_name, scikit_model, our_ml_matrix, our_x, our_y, cat_category='sex', cont_category='age', splits=5, test_size_p=0.2, printed=False)¶

This takes a sci-kit learn coded model and creates a dataframe based on (stratified) k-folds of results on our_ml_matrix, and it’s X component returns a dataframe of fold results and raw y_test versus y_pred as well as a tuple with models and then the training data from the model. This is a twist on Stratified Shuffle Split to allow it’s stratification on a categorical and continous variable. Note that the categorical should already be converted into integers before this function is run. The random state in the StratifiedShuffleSplit is set, so the results should be reproducible.

Parameters:

model_name (str) – name of model
model_file_name (str) – name offile where specific model will be stored
skikit_model (str) – name of skikit-model
our_ml_matrix (~pd.DataFrame) – dataframe to work over
our_x (dataframe) – X or features columnfor machine learning
our_y (class:~pandas.core.series.Series) – y or label column for machine learning
cat_category (str) – categorical variable column to stratify on eg. sex
cont_category (str) – continuuous variable column to stratify on eg. age
splits (int) – number of folds desired
test_size_p (float) – percent to put into test
printed (bool) – printed information on folds option

Returns:

dataframe, y dataframe, and models

Return type:

tuple

cvasl.seperated.stratified_one_category_shuffle_split(model_name, model_file_name, scikit_model, our_ml_matrix, our_x, our_y, category='sex', splits=5, test_size_p=0.2, printed=False)¶

This takes a sci-kit learn coded model and creates a dataframe based on k-folds of results on our_ml_matrix, and it’s X component returns a dataframe of fold results and raw y_test versus y_pred as well as a tuple with models and then the training data from the model. The random state in the StratifiedShuffleSplit is set, so the results should be reproducible.

Parameters:

model_name (str) – name of model
model_file_name (str) – name offile where specific model will be stored
skikit_model (str) – name of skikit-model
our_ml_matrix (~pd.DataFrame) – dataframe to work over
our_x (dataframe) – X or features columnfor machine learning
our_y (class:~pandas.core.series.Series) – y or label column for machine learning
category (str) – categorical variable (column) to be stratified on eg. sex
splits (int) – number of folds desired
test_size_p (float) – percent to put into test
printed (bool) – printed information on folds option

Returns:

dataframe, y dataframe, and models

Return type:

tuple