cvasl.seperated module¶
Copyright 2023 Netherlands eScience Center and the Amsterdam University Medical Center. Licensed under the Apache License, version 2.0. See LICENSE for details.
This file contains functions for processing csv and tsv files towards correct formats.
- cvasl.seperated.avg_k_folds(frame)¶
This function takes a dataframe of k_fold results, as formatted for our experiments with derived datasets and returns an averaged dataframe
- cvasl.seperated.bin_dataset(dataframe, column, num_bins=4, graph=False)¶
This function creates an additional column where a continues variable can be binned into 2 or 4 parts.
- cvasl.seperated.check_identical_columns(tsv_path, header=0)¶
Here we enter the path to a folder, then return the columns in which all files are exactly duplicated in name and values.
needs more
- cvasl.seperated.check_sex_dimorph_expectations(dataframe)¶
This function checks that men as expected have larger brains than women in a given dataframe.
- cvasl.seperated.concat_double_header(dataframe_dub)¶
This function concatenates the two headers of a dataframe :param dataframe_dub: dataframe with double header :type dataframe_dub: pandas.dataFrame
- Returns:
dataframe with a single header
- Return type:
~pandas.DataFrame
- cvasl.seperated.derived_function(column, a, b, c)¶
This functions allows you to derive a projected value for any parameter based on a polynomial for age versus the parameter, given that your data is in a dataframe format.
- Parameters:
column (pandas.core.series.Series) – pandas dataframe variable column
a (float) – first coeffiecnt
b (float) – second coefficient
c (float) – final term in polynomial
- Returns:
series
- Return type:
- cvasl.seperated.drop_columns_folder(directory, list_droppables)¶
This function works csvs in a folder on those with unnamed columns and other unwanted columns it drops them, they are then available in a new folder called ‘stripped’
- cvasl.seperated.drop_y(df)¶
This is meant as a psuedo-helper function for pandas columns when they are merged. It drops columns that end in y
- cvasl.seperated.find_original_y_values(polynomial, output_value)¶
Finds the original y-values of a second or third degree polynomial given its coefficients and an output value.
- cvasl.seperated.find_outliers_by_list(dataframe, column_list, number_sd)¶
This function finds the outliers in terms of anything outside a given number of standard deviations (number_sd) from the mean on a list of specific specific column, then returns these rows of the dataframe.
- cvasl.seperated.folder_chain_out_columns(datasets_folder, columns, output_folder)¶
This function works csvs in a folder at any folder level inside on those with unwanted columns it drops them, they are then available in a new folder called specified
- cvasl.seperated.generate_transformation_matrix(polynomial1, polynomial2)¶
Generates a matrix that transforms one polynomial into another. :param polynomial1: coefficients of the polynomial in form (a1, b1, …) :type polynomial1: Sequence :param polynomial2: coefficients of the polynomial in form (a2, b2, …) :type polynomial2: Sequence
- Returns:
m, an array
- Return type:
ndarrray
- cvasl.seperated.make_log_file(file_name, list_of_columns)¶
This function recodes columns on a csv file into their log value
- Parameters:
file_name (str) – csv with variables as columns
- Returns:
dataframe
- Return type:
dataframe
- cvasl.seperated.make_log_folder(directory, list_of_columns)¶
This function recodes columns on csvs with such a column in a sepcified directory into csvs with a new column is the log of old. The new files are produced as a side effect.
- cvasl.seperated.plot_2on2_df(dataframe1, dataframe2, special_column, color1='purple', color2='orange')¶
This function is meant to create an artifact of two datasets with comparable variables in terms of graphing the variables against a variable of interest
- Parameters:
dataframe1 (pandas.dataFrame) – dataframe variable
dataframe2 (pandas.dataFrame) – dataframe variable
special_column_name (str) – string of column you want to graph against
- Returns:
no return, makes artifact
- Return type:
None.
- cvasl.seperated.polyfit_and_show(dataframe, special_column_name, other_column_name, degree_poly, color1='purple')¶
This function creates a polynomial for two columns. It returns the coefficients in a 2nd degree polynomial and also creates a graph as a side effect.
- Parameters:
- Returns:
coeffiects
- Return type:
- cvasl.seperated.polyfit_second_degree_to_df(dataframe, special_column_name, other_column_names)¶
This function creates polynomials for columns, as compared to a special column, in our case age. It returns the coefficients in a dataframe.
- cvasl.seperated.preprocess(folder, file_extension, outcome_folder, log_cols=[], plus_one_log_columns=[])¶
This function given a directory will search all subdirectory for noted file extension Copies of the files will be processed as specified which is the specified columns turned to log or +1 then log then put in the outcome folder
- cvasl.seperated.pull_off_unnamed_column(unclean, extra_columns=[])¶
This function takes a dataframe and if there are columns with the string “Unnamed” it drops them. It also drops the extra columns you input
- cvasl.seperated.recode_sex(whole_dataframe, string_for_sex)¶
This function recodes sex into a new column if there are two possible values. It maintains numerical order but changes the values to 0 and 1. The new column is called ‘sex_encoded’. Note sex should be encoded in numbers i.e. ints or floats
- cvasl.seperated.recode_sex_folder(directory)¶
This function recodes sex on csvs with such a column in a sepcified directory into csvs with a new column if there are two possible values. It maintains numerical order but changes the values to 0 and 1. The column is called changed. Note sex should be encoded in numbers i.e. ints for many functions. The new files are produced as a side effect.
- cvasl.seperated.recode_sex_to_numeric(df)¶
When we need to flip the sex back to numbers from the suggested format this function will turn females to 1, males to 0
- cvasl.seperated.relate_columns_graphs(dataframe, special_column_name, saver=False)¶
This function makes a scatter plot of all columns :param dataframe: dataframe variable :type dataframe: pandas.dataFrame :param special_column_name: string of column you want to graph against :type special_column_name: str :param saver: bool to indicate if graph pngs should be saved :type saver: bool
- Returns:
no return, makes artifact
- Return type:
None.
- cvasl.seperated.relate_columns_graphs_numeric(dataframe, special_column_name, saver=False)¶
This function makes a scatter plot of all columns that are numeric.
- cvasl.seperated.relate_columns_graphs_two_dfs(dataframe1, dataframe2, special_column_name, other_column_name, color1='purple', color2='orange')¶
This function is meant to be a helper function for one that makes a scatter plot of all columns that two dataframes have in common
- cvasl.seperated.static_bin_age(dataframe)¶
This function applies static binning by age decade to a dataframe
- cvasl.seperated.stratified_cat_and_cont_categories_shuffle_split(model_name, model_file_name, scikit_model, our_ml_matrix, our_x, our_y, cat_category='sex', cont_category='age', splits=5, test_size_p=0.2, printed=False)¶
This takes a sci-kit learn coded model and creates a dataframe based on (stratified) k-folds of results on our_ml_matrix, and it’s X component returns a dataframe of fold results and raw y_test versus y_pred as well as a tuple with models and then the training data from the model. This is a twist on Stratified Shuffle Split to allow it’s stratification on a categorical and continous variable. Note that the categorical should already be converted into integers before this function is run. The random state in the StratifiedShuffleSplit is set, so the results should be reproducible.
- Parameters:
model_name (str) – name of model
model_file_name (str) – name offile where specific model will be stored
skikit_model (str) – name of skikit-model
our_ml_matrix (~pd.DataFrame) – dataframe to work over
our_x (dataframe) – X or features columnfor machine learning
our_y (class:~pandas.core.series.Series) – y or label column for machine learning
cat_category (str) – categorical variable column to stratify on eg. sex
cont_category (str) – continuuous variable column to stratify on eg. age
splits (int) – number of folds desired
test_size_p (float) – percent to put into test
printed (bool) – printed information on folds option
- Returns:
dataframe, y dataframe, and models
- Return type:
- cvasl.seperated.stratified_one_category_shuffle_split(model_name, model_file_name, scikit_model, our_ml_matrix, our_x, our_y, category='sex', splits=5, test_size_p=0.2, printed=False)¶
This takes a sci-kit learn coded model and creates a dataframe based on k-folds of results on our_ml_matrix, and it’s X component returns a dataframe of fold results and raw y_test versus y_pred as well as a tuple with models and then the training data from the model. The random state in the StratifiedShuffleSplit is set, so the results should be reproducible.
- Parameters:
model_name (str) – name of model
model_file_name (str) – name offile where specific model will be stored
skikit_model (str) – name of skikit-model
our_ml_matrix (~pd.DataFrame) – dataframe to work over
our_x (dataframe) – X or features columnfor machine learning
our_y (class:~pandas.core.series.Series) – y or label column for machine learning
category (str) – categorical variable (column) to be stratified on eg. sex
splits (int) – number of folds desired
test_size_p (float) – percent to put into test
printed (bool) – printed information on folds option
- Returns:
dataframe, y dataframe, and models
- Return type: