cvasl.seperated module

Copyright 2023 Netherlands eScience Center and the Amsterdam University Medical Center. Licensed under the Apache License, version 2.0. See LICENSE for details.

This file contains functions for processing csv and tsv files towards correct formats.

cvasl.seperated.avg_k_folds(frame)

This function takes a dataframe of k_fold results, as formatted for our experiments with derived datasets and returns an averaged dataframe

cvasl.seperated.bin_dataset(dataframe, column, num_bins=4, graph=False)

This function creates an additional column where a continues variable can be binned into 2 or 4 parts.

Parameters:
  • dataframe (str) – dataframe variable

  • column (str) – column name written in singe qoutes

  • num_bins (int) – 2 or 4 for number bins

  • graph (bool) – on True setting produces split graph

Returns:

dataframe with additional column

Return type:

pandas.dataFrame

cvasl.seperated.check_identical_columns(tsv_path, header=0)

Here we enter the path to a folder, then return the columns in which all files are exactly duplicated in name and values.

needs more

cvasl.seperated.check_sex_dimorph_expectations(dataframe)

This function checks that men as expected have larger brains than women in a given dataframe.

Parameters:

dataframe (DataFrame) – dataframe with cvasl standard for patient MRI data

Returns:

dataframe, or zero, with side effect of printed information

Return type:

DataFrame or int

cvasl.seperated.concat_double_header(dataframe_dub)

This function concatenates the two headers of a dataframe :param dataframe_dub: dataframe with double header :type dataframe_dub: pandas.dataFrame

Returns:

dataframe with a single header

Return type:

~pandas.DataFrame

cvasl.seperated.derived_function(column, a, b, c)

This functions allows you to derive a projected value for any parameter based on a polynomial for age versus the parameter, given that your data is in a dataframe format.

Parameters:
Returns:

series

Return type:

Series

cvasl.seperated.drop_columns_folder(directory, list_droppables)

This function works csvs in a folder on those with unnamed columns and other unwanted columns it drops them, they are then available in a new folder called ‘stripped’

Parameters:

directory (str) – directory where csv are variable

Returns:

dataframes without unnamed columns

Return type:

list

cvasl.seperated.drop_y(df)

This is meant as a psuedo-helper function for pandas columns when they are merged. It drops columns that end in y

cvasl.seperated.find_original_y_values(polynomial, output_value)

Finds the original y-values of a second or third degree polynomial given its coefficients and an output value.

Parameters:
  • polynomial (tuple) – coefficients of the polynomial in the form (a, b, c)

  • output_value (list) – output of polynomial when a list of y are given

Returns:

pile,list of original y-values corresponding to the output value

Return type:

list

cvasl.seperated.find_outliers_by_list(dataframe, column_list, number_sd)

This function finds the outliers in terms of anything outside a given number of standard deviations (number_sd) from the mean on a list of specific specific column, then returns these rows of the dataframe.

Parameters:
  • dataframe (DataFrame) – whole dataframe on dataset

  • column_list (list) – list of relevant columns

  • number_sd (float) – number of standard deviations

Returns:

dataframe of outliers

Return type:

DataFrame

cvasl.seperated.folder_chain_out_columns(datasets_folder, columns, output_folder)

This function works csvs in a folder at any folder level inside on those with unwanted columns it drops them, they are then available in a new folder called specified

Parameters:
  • datasets_folder (str) – directory where csv are variable

  • columns (list) – list of columns as strings

  • output_folder (str) – directory where newly made csvs are sent

Returns:

None

Return type:

None

cvasl.seperated.generate_transformation_matrix(polynomial1, polynomial2)

Generates a matrix that transforms one polynomial into another. :param polynomial1: coefficients of the polynomial in form (a1, b1, …) :type polynomial1: Sequence :param polynomial2: coefficients of the polynomial in form (a2, b2, …) :type polynomial2: Sequence

Returns:

m, an array

Return type:

ndarrray

cvasl.seperated.make_log_file(file_name, list_of_columns)

This function recodes columns on a csv file into their log value

Parameters:

file_name (str) – csv with variables as columns

Returns:

dataframe

Return type:

dataframe

cvasl.seperated.make_log_folder(directory, list_of_columns)

This function recodes columns on csvs with such a column in a sepcified directory into csvs with a new column is the log of old. The new files are produced as a side effect.

Parameters:

directory (str) – directory where csv are variable

Returns:

dataframes with sex encoded correctly

Return type:

list

cvasl.seperated.plot_2on2_df(dataframe1, dataframe2, special_column, color1='purple', color2='orange')

This function is meant to create an artifact of two datasets with comparable variables in terms of graphing the variables against a variable of interest

Parameters:
  • dataframe1 (pandas.dataFrame) – dataframe variable

  • dataframe2 (pandas.dataFrame) – dataframe variable

  • special_column_name (str) – string of column you want to graph against

Returns:

no return, makes artifact

Return type:

None.

cvasl.seperated.polyfit_and_show(dataframe, special_column_name, other_column_name, degree_poly, color1='purple')

This function creates a polynomial for two columns. It returns the coefficients in a 2nd degree polynomial and also creates a graph as a side effect.

Parameters:
  • dataframe (pandas.dataFrame) – dataframe variable

  • special_column_name (str) – string of column you want to graph against

  • other_column_name (str) – string of column you want to graph

  • degree_poly (int) – either 1,2 or 3 only

  • color1 (str) – string of color for graphing

Returns:

coeffiects

Return type:

ndarray

cvasl.seperated.polyfit_second_degree_to_df(dataframe, special_column_name, other_column_names)

This function creates polynomials for columns, as compared to a special column, in our case age. It returns the coefficients in a dataframe.

Parameters:
  • dataframe (pandas.dataFrame) – dataframe variable

  • special_column_name (str) – column name, usually age

  • other_column_names (list) – columns you want to get poly coefficients on

Returns:

coeffiects

Return type:

ndarray

cvasl.seperated.preprocess(folder, file_extension, outcome_folder, log_cols=[], plus_one_log_columns=[])

This function given a directory will search all subdirectory for noted file extension Copies of the files will be processed as specified which is the specified columns turned to log or +1 then log then put in the outcome folder

cvasl.seperated.pull_off_unnamed_column(unclean, extra_columns=[])

This function takes a dataframe and if there are columns with the string “Unnamed” it drops them. It also drops the extra columns you input

cvasl.seperated.recode_sex(whole_dataframe, string_for_sex)

This function recodes sex into a new column if there are two possible values. It maintains numerical order but changes the values to 0 and 1. The new column is called ‘sex_encoded’. Note sex should be encoded in numbers i.e. ints or floats

Parameters:
  • whole_dataframe (str) – dataframe variable

  • string_for_sex (str) – column name written in singe qoutes

Returns:

dataframe with sex encoded colum

Return type:

pandas.dataFrame

cvasl.seperated.recode_sex_folder(directory)

This function recodes sex on csvs with such a column in a sepcified directory into csvs with a new column if there are two possible values. It maintains numerical order but changes the values to 0 and 1. The column is called changed. Note sex should be encoded in numbers i.e. ints for many functions. The new files are produced as a side effect.

Parameters:

directory (str) – directory where csv are variable

Returns:

dataframes with sex encoded correctly

Return type:

list

cvasl.seperated.recode_sex_to_numeric(df)

When we need to flip the sex back to numbers from the suggested format this function will turn females to 1, males to 0

cvasl.seperated.relate_columns_graphs(dataframe, special_column_name, saver=False)

This function makes a scatter plot of all columns :param dataframe: dataframe variable :type dataframe: pandas.dataFrame :param special_column_name: string of column you want to graph against :type special_column_name: str :param saver: bool to indicate if graph pngs should be saved :type saver: bool

Returns:

no return, makes artifact

Return type:

None.

cvasl.seperated.relate_columns_graphs_numeric(dataframe, special_column_name, saver=False)

This function makes a scatter plot of all columns that are numeric.

Parameters:
  • dataframe (pandas.dataFrame) – dataframe variable

  • special_column_name (str) – string of column you want to graph against

  • saver (str) – string to indicate if graph pngs should be saved

Returns:

no return, makes artifact

Return type:

None.

cvasl.seperated.relate_columns_graphs_two_dfs(dataframe1, dataframe2, special_column_name, other_column_name, color1='purple', color2='orange')

This function is meant to be a helper function for one that makes a scatter plot of all columns that two dataframes have in common

Parameters:
  • dataframe1 (pandas.dataFrame) – dataframe variable

  • dataframe2 (pandas.dataFrame) – dataframe variable

  • special_column_name (str) – str of column you graph against

  • other_column_name (str) – string of column you want to graph

Returns:

no return, makes artifact

Return type:

None.

cvasl.seperated.static_bin_age(dataframe)

This function applies static binning by age decade to a dataframe

cvasl.seperated.stratified_cat_and_cont_categories_shuffle_split(model_name, model_file_name, scikit_model, our_ml_matrix, our_x, our_y, cat_category='sex', cont_category='age', splits=5, test_size_p=0.2, printed=False)

This takes a sci-kit learn coded model and creates a dataframe based on (stratified) k-folds of results on our_ml_matrix, and it’s X component returns a dataframe of fold results and raw y_test versus y_pred as well as a tuple with models and then the training data from the model. This is a twist on Stratified Shuffle Split to allow it’s stratification on a categorical and continous variable. Note that the categorical should already be converted into integers before this function is run. The random state in the StratifiedShuffleSplit is set, so the results should be reproducible.

Parameters:
  • model_name (str) – name of model

  • model_file_name (str) – name offile where specific model will be stored

  • skikit_model (str) – name of skikit-model

  • our_ml_matrix (~pd.DataFrame) – dataframe to work over

  • our_x (dataframe) – X or features columnfor machine learning

  • our_y (class:~pandas.core.series.Series) – y or label column for machine learning

  • cat_category (str) – categorical variable column to stratify on eg. sex

  • cont_category (str) – continuuous variable column to stratify on eg. age

  • splits (int) – number of folds desired

  • test_size_p (float) – percent to put into test

  • printed (bool) – printed information on folds option

Returns:

dataframe, y dataframe, and models

Return type:

tuple

cvasl.seperated.stratified_one_category_shuffle_split(model_name, model_file_name, scikit_model, our_ml_matrix, our_x, our_y, category='sex', splits=5, test_size_p=0.2, printed=False)

This takes a sci-kit learn coded model and creates a dataframe based on k-folds of results on our_ml_matrix, and it’s X component returns a dataframe of fold results and raw y_test versus y_pred as well as a tuple with models and then the training data from the model. The random state in the StratifiedShuffleSplit is set, so the results should be reproducible.

Parameters:
  • model_name (str) – name of model

  • model_file_name (str) – name offile where specific model will be stored

  • skikit_model (str) – name of skikit-model

  • our_ml_matrix (~pd.DataFrame) – dataframe to work over

  • our_x (dataframe) – X or features columnfor machine learning

  • our_y (class:~pandas.core.series.Series) – y or label column for machine learning

  • category (str) – categorical variable (column) to be stratified on eg. sex

  • splits (int) – number of folds desired

  • test_size_p (float) – percent to put into test

  • printed (bool) – printed information on folds option

Returns:

dataframe, y dataframe, and models

Return type:

tuple