cvasl.harmony module

Copyright 2023 Netherlands eScience Center and the Amsterdam University Medical Center. Licensed under the Apache License, version 2.0. See LICENSE for details.

This file contains functions for processing csv and tsv files as they relate to specific common harmonization algorithms. Most seperated values processing is in the seperated module, however, this this module has been made so it can be called in environments compatible with common harmonization algorithms which often require older versions of python, pandas and numpy than usual in 2023.

cvasl.harmony.compare_harm_multi_site_violins(unharmonized_df, harmonized_df, feature_list, batch_column='site')

Create a violin plot on multisite harmonization by features.

cvasl.harmony.compare_harm_one_site_violins(unharmonized_df, harmonized_df, feature_list, chosen_feature='sex')

Create a violin plot on single site harmonization by features, split on a binary feature of choice which defaults to sex.

cvasl.harmony.increment_keys(input_dict, chosen_value=1)

This function increments all keys in dictionary by a certain chosen value.

cvasl.harmony.log_out_columns(dataframe, column_list)

This function recodes changes specified column values in a dataframe to a log of the values, which can make overall distributions change.

Parameters:
  • dataframe (str) – dataframe variable

  • column_list (list) – column names

Returns:

dataframe with different (log) values in specified columns

Return type:

pandas.dataFrame

cvasl.harmony.make_topper(btF, row0, row1)

This function makes top rows for something harmonized out of the btF part produced by the prep_for_neurocombat function i.e. prep_for_neurocombat(dataframename1, dataframename2)

Parameters:
  • btF (~pandas.DataFrame) – frame variable produced in prep_for_neurocombat

  • row0 (str) – frame column removed i.e. age or sex

  • row1 (str) – frame column removed i.e. age or sex

Returns:

dataframe called TopperF to add back

Return type:

~pandas.DataFrame

cvasl.harmony.negative_harm_outcomes(folder, file_extension, number_columns=['sex', 'gm_vol', 'wm_vol', 'csf_vol', 'gm_icvratio', 'gmwm_icvratio', 'wmhvol_wmvol', 'wmh_count', 'deepwm_b_cov', 'aca_b_cov', 'mca_b_cov', 'pca_b_cov', 'totalgm_b_cov', 'deepwm_b_cbf', 'aca_b_cbf', 'mca_b_cbf', 'pca_b_cbf', 'totalgm_b_cbf'])

This function given a directory will search all subdirectory for noted file extension If all files are harmonization outcome files it will then return a list of files with negative values, and print off information about negatives in all files.

cvasl.harmony.prep_for_neurocombat(dataframe1, dataframe2)

This function takes two dataframes in the cvasl format, then turns them into the items needed for the neurocombat algorithm with re-identification.

Parameters:
  • dataframe1 – frame variable

  • dataframe2 – frame variable

Returns:

dataframes for neurocombat algorithm and ints of some legnths

Return type:

tuple

cvasl.harmony.prep_for_neurocombat_5way(dataframe1, dataframe2, dataframe3, dataframe4, dataframe5)

This function takes five dataframes in the cvasl format, then turns them into the items needed for the neurocombat algorithm with re-identification.

Parameters:
  • dataframe1 – frame variable

  • dataframe2 – frame variable

  • dataframe3 – frame variable

  • dataframe4 – frame variable

  • dataframe5 – frame variable

Returns:

dataframes for neurocombat algorithm and ints of some legnths

Return type:

tuple

cvasl.harmony.show_diff_on_var(dataset1, name_dataset1, dataset2, name_dataset2, var1, var2)
cvasl.harmony.show_diff_on_var3(dataset1, name_dataset1, dataset2, name_dataset2, dataset3, name_dataset3, var1, var2)
cvasl.harmony.show_diff_on_var5(dataset1, name_dataset1, dataset2, name_dataset2, dataset3, name_dataset3, dataset4, name_dataset4, dataset5, name_dataset5, var1, var2)
cvasl.harmony.split_frame_half_balanced_by_column(frame, column)

This is function is made for a dataframe you want to split on a columns with continous values e.g. age.; and returns two dataframes in which the values in this column are about equally distributed e.g. average age over both frames, if age is column variable, will be similar

Parameters:
  • dataframe – frame variable

  • column (Series) – column name

Returns:

dataframes evenly idstributed on values in specified column

Return type:

pandas.dataFrame

cvasl.harmony.top_and_bottom_by_column(frame, column)

This is useful in cases where you want to split on a columns with continous values e.g. age.; and upi want the highest and lowest values seperated

Parameters:
  • dataframe – frame variable

  • column (Series) – column name

Returns:

dataframes unevenly distributed on values in specified column

Return type:

~pandas.DataFrame