cvasl.vendor.comscan package¶
Submodules¶
cvasl.vendor.comscan.clustering module¶
- class cvasl.vendor.comscan.clustering.KElbowVisualizer(estimator, ax=None, k=10, metric='distortion', timings=True, locate_elbow=True, n_jobs=None, verbose=0, pre_dispatch='2*n_jobs', **kwargs)¶
Bases:
ClusteringScoreVisualizer
The K-Elbow Visualizer implements the “elbow” method of selecting the optimal number of clusters for K-means clustering. K-means is a simple unsupervised machine learning algorithm that groups data into a specified number (k) of clusters. Because the user must specify in advance what k to choose, the algorithm is somewhat naive – it assigns all members to k clusters even if that is not the right k for the dataset.
The elbow method runs k-means clustering on the dataset for a range of values for k (say from 1-10) and then for each value of k computes an average score for all clusters. By default, the
distortion
score is computed, the sum of square distances from each point to its assigned center. Other metrics can also be used such as thesilhouette
score, the mean silhouette coefficient for all samples or thecalinski_harabasz
score, which computes the ratio of dispersion between and within clusters.When these overall metrics for each model are plotted, it is possible to visually determine the best value for k. If the line chart looks like an arm, then the “elbow” (the point of inflection on the curve) is the best value of k. The “arm” can be either up or down, but if there is a strong inflection point, it is a good indication that the underlying model fits best at that point.
Note
yellowbrick.cluster.elbowImplements the elbow method for determining the optimal number of clusters.Author: Benjamin BengfortCreated: Thu Mar 23 22:36:31 2017 -0400Copyright (C) 2016 The scikit-yb developersFor license information, see LICENSE.txtID: elbow.py [5a370c8] benjamin@bengfort.comParameters¶
- estimatora scikit-learn clusterer
Should be an instance of an unfitted clusterer, specifically
KMeans
orMiniBatchKMeans
. If it is not a clusterer, an exception is raised.- axmatplotlib Axes, default: None
The axes to plot the figure on. If None is passed in the current axes will be used (or generated if required).
- kinteger, tuple, or iterable
The k values to compute silhouette scores for. If a single integer is specified, then will compute the range (2,k). If a tuple of 2 integers is specified, then k will be in np.arange(k[0], k[1]). Otherwise, specify an iterable of integers to use as values for k.
- metricstring, default:
"distortion"
Select the scoring metric to evaluate the clusters. The default is the mean distortion, defined by the sum of squared distances between each observation and its closest centroid. Other metrics include:
distortion: mean sum of squared distances to centers
silhouette: mean ratio of intra-cluster and nearest-cluster distance
calinski_harabasz: ratio of within to between cluster dispersion
- timingsbool, default: True
Display the fitting time per k to evaluate the amount of time required to train the clustering model.
- locate_elbowbool, default: True
Automatically find the “elbow” or “knee” which likely corresponds to the optimal value of k using the “knee point detection algorithm”. The knee point detection algorithm finds the point of maximum curvature, which in a well-behaved clustering problem also represents the pivot of the elbow curve. The point is labeled with a dashed line and annotated with the score and k values.
- n_jobsint, default: None
Number of jobs to run in parallel. Training the estimator and computing the score are parallelized over the cross-validation splits.
None
means 1 unless in ajoblib.parallel_backend
context.-1
means using all processors.- verboseint, default: 0
The verbosity level.
- pre_dispatchint or str, default: ‘2*n_jobs’
Controls the number of jobs that get dispatched during parallel execution. Reducing this number can be useful to avoid an explosion of memory consumption when more jobs get dispatched than CPUs can process. This parameter can be: - None, in which case all the jobs are immediately created and spawned. Use this for lightweight and fast-running jobs, to avoid delays due to on-demand spawning of the jobs - An int, giving the exact number of total jobs that are spawned - A str, giving an expression as a function of n_jobs, as in ‘2*n_jobs’
- kwargsdict
Keyword arguments that are passed to the base class and may influence the visualization as defined in other Visualizers.
Attributes¶
- k_scores_array of shape (n,) where n is no. of k values
The silhouette score corresponding to each k value.
- k_timers_array of shape (n,) where n is no. of k values
The time taken to fit n KMeans model corresponding to each k value.
- elbow_value_integer
The optimal value of k.
- elbow_score_float
The silhouette score corresponding to the optimal value of k.
- estimators_
BaseEstimator
a scikit-learn fitted estimator
Examples¶
>>> from yellowbrick.cluster import KElbowVisualizer >>> from sklearn.cluster import KMeans >>> model = KElbowVisualizer(KMeans(), k=10) >>> X = np.array([[1, 2], [1, 4], [1, 0], ... [4, 2], [4, 4], [4, 0]]) >>> model.fit(X) >>> model.show()
Notes¶
Modification from yellowbrick consist of get the best_estimator based on the finded elbow_value
If you get a visualizer that doesn’t have an elbow or inflection point, then this method may not be working. The elbow method does not work well if the data is not very clustered; in this case, you might see a smooth curve and the value of k is unclear. Other scoring methods, such as BIC or SSE, also can be used to explore if clustering is a correct choice.
For a discussion on the Elbow method, read more at Robert Gove’s Block website. For more on the knee point detection algorithm see the paper “Finding a “kneedle” in a Haystack”.
See also
The scikit-learn documentation for the silhouette_score and calinski_harabasz_score. The default,
distortion_score
, is implemented inyellowbrick.cluster.elbow
.- draw()¶
Draw the elbow curve for the specified scores and values of K.
- finalize()¶
Prepare the figure for rendering by setting the title as well as the X and Y axis labels and adding the legend.
- fit(X, y=None, **kwargs)¶
Fits n KMeans models where n is the length of
self.k_values_
, storing the silhouette scores in theself.k_scores_
attribute. The “elbow” and silhouette score corresponding to it are stored inself.elbow_value
andself.elbow_score
respectively. This method finishes up by calling draw to create the plot.
- class cvasl.vendor.comscan.clustering.KMeansConstrainedMissing(n_clusters=8, size_min=None, size_max=None, em_iter=10, n_init=10, max_iter=300, features_reduction: str | None = None, n_components: int = 2, tol=0.0001, verbose=False, random_state=None, copy_x=True, n_jobs=1)¶
Bases:
TransformerMixin
,ClusterMixin
,BaseEstimator
K-Means clustering with minimum and maximum cluster size constraints with possible missing values
Note
inspired of https://stackoverflow.com/questions/35611465/python-scikit-learn-clustering-with-missing-data
Parameters¶
- n_clustersint, optional, default: 8
The number of clusters to form as well as the number of centroids to generate.
- size_minint, optional, default: None
Constrain the label assignment so that each cluster has a minimum size of size_min. If None, no constrains will be applied
- size_maxint, optional, default: None
Constrain the label assignment so that each cluster has a maximum size of size_max. If None, no constrains will be applied
- em_iterint, default: 10
expectation–maximization (EM) iteration for convergence of missing values. Use when no features reduction is applied and missing values.
- n_initint, default: 10
Number of times the k-means algorithm will be run with different centroid seeds. The final results will be the best output of n_init consecutive runs in terms of inertia.
- max_iterint, default: 300
Maximum number of iterations of the k-means algorithm for a single run.
- features_reductionstr, default: None
Method for reduction of the embedded space with n_components. Can be pca or umap.
- n_componentsint, default: 2
Dimension of the embedded space for features reduction.
- tolfloat, default: 1e-4
Relative tolerance with regards to inertia to declare convergence
- verboseint, default: 0
Verbosity mode.
- random_stateint, RandomState instance or None, optional, default: None
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
- copy_xboolean, default: True
When pre-computing distances it is more numerically accurate to center the data first. If copy_x is True, then the original data is not modified. If False, the original data is modified, and put back before the function returns, but small numerical differences may be introduced by subtracting and then adding the data mean.
- n_jobsint, default: 1
The number of jobs to use for the computation. This works by computing each of the n_init runs in parallel.
If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used.
Attributes¶
- cluster_centers_array, [n_clusters, n_features]
Coordinates of cluster centers
- labels_ :
Labels of each point
- inertia_float
Sum of squared distances of samples to their closest cluster center.
cls_ : KMeansConstrained classifier object
cls_features_reduction_ : PCA or UMAP reduction object
- centroids_: array
Centroids found at the last iteration of k-means.
- X_hat_array
Copy of X with the missing values filled in.
mu_ : Columns means
Examples¶
>>> X = np.array([[1, 2], [1, 4], [1, 0], ... [4, 2], [4, 4], [4, 0]]) >>> clf = KMeansConstrainedMissing( ... n_clusters=2, ... size_min=2, ... size_max=5, ... random_state=0 ... ) >>> clf.fit_predict(X) array([0, 0, 0, 1, 1, 1], dtype=int32) >>> clf.cluster_centers_ array([[ 1., 2.], [ 4., 2.]]) >>> clf.labels_ array([0, 0, 0, 1, 1, 1], dtype=int32)
Notes¶
K-means problem constrained with a minimum and/or maximum size for each cluster.
The constrained assignment is formulated as a Minimum Cost Flow (MCF) linear network optimisation problem. This is then solved using a cost-scaling push-relabel algorithm. The implementation used is Google’s Operations Research tools’s SimpleMinCostFlow.
- Ref:
1. Bradley, P. S., K. P. Bennett, and Ayhan Demiriz. “Constrained k-means clustering.” Microsoft Research, Redmond (2000): 1-8. 2. Google’s SimpleMinCostFlow implementation: https://github.com/google/or-tools/blob/master/ortools/graph/min_cost_flow.h
- fit(X, y=None)¶
Compute k-means clustering with given constants.
Parameters¶
- Xarray-like, shape=(n_samples, n_features)
Training instances to cluster.
y : Ignored
- fit_predict(X, y=None)¶
Compute cluster centers and predict cluster index for each sample.
Equivalent to calling fit(X) followed by predict(X) but also more efficient.
Parameters¶
- X{array-like, sparse matrix}, shape = [n_samples, n_features]
New data to transform.
Returns¶
- labelsarray, shape [n_samples,]
Index of the cluster each sample belongs to.
- predict(X, size_min='init', size_max='init')¶
Predict the closest cluster each sample in X belongs to given the provided constraints. The constraints can be temporally overridden when determining which cluster each datapoint is assigned to.
Only computes the assignment step. It does not re-fit the cluster positions.
Parameters¶
- Xarray-like, shape = [n_samples, n_features]
New data to predict.
- size_minint, optional, default: size_min provided with initialisation
Constrain the label assignment so that each cluster has a minimum size of size_min. If None, no constrains will be applied. If ‘init’ the value provided during initialisation of the class will be used.
- size_maxint, optional, default: size_max provided with initialisation
Constrain the label assignment so that each cluster has a maximum size of size_max. If None, no constrains will be applied. If ‘init’ the value provided during initialisation of the class will be used.
Returns¶
- labelsarray, shape [n_samples,]
Index of the cluster each sample belongs to.
- set_predict_request(*, size_max: bool | None | str = '$UNCHANGED$', size_min: bool | None | str = '$UNCHANGED$') KMeansConstrainedMissing ¶
Request metadata passed to the
predict
method.Note that this method is only relevant if
enable_metadata_routing=True
(seesklearn.set_config()
). Please see User Guide on how the routing mechanism works.The options for each parameter are:
True
: metadata is requested, and passed topredict
if provided. The request is ignored if metadata is not provided.False
: metadata is not requested and the meta-estimator will not pass it topredict
.None
: metadata is not requested, and the meta-estimator will raise an error if the user provides it.str
: metadata should be passed to the meta-estimator with this given alias instead of the original name.
The default (
sklearn.utils.metadata_routing.UNCHANGED
) retains the existing request. This allows you to change the request for some parameters and not others.Added in version 1.3.
Note
This method is only relevant if this estimator is used as a sub-estimator of a meta-estimator, e.g. used inside a
Pipeline
. Otherwise it has no effect.Parameters¶
- size_maxstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
size_max
parameter inpredict
.- size_minstr, True, False, or None, default=sklearn.utils.metadata_routing.UNCHANGED
Metadata routing for
size_min
parameter inpredict
.
Returns¶
- selfobject
The updated object.
- cvasl.vendor.comscan.clustering.optimal_clustering(X: DataFrame | ndarray, size_min: int = 10, metric: str = 'distortion', features_reduction: str | None = None, n_components: int = 2, n_jobs: int = 1, random_state: int | None = None, visualize: bool = False) Tuple[KMeansConstrained, UMAP | PCA, int, ndarray, int, Sequence[float], float, ndarray, ndarray, ndarray] ¶
Function to find the optimal clustering using a constrained k means. Two method are available to find the optimal number of cluster
silhouette
orelbow
.- Parameters:
X – array-like or DataFrame of floats, shape (n_samples, n_features) The observations to cluster.
size_min – Constrain the label assignment so that each cluster has a minimum size of size_min. If None, no constrains will be applied. default: None
metric – Select the scoring metric to evaluate the clusters. The default is the mean distortion, defined by the sum of squared distances between each observation and its closest centroid. Other metrics include: - distortion: mean sum of squared distances to centers - silhouette: mean ratio of intra-cluster and nearest-cluster distance - calinski_harabasz: ratio of within to between cluster dispersion
features_reduction – Method for reduction of the embedded space with n_components. Can be pca or umap. Default: None
n_components – Dimension of the embedded space for features reduction. Default 2.
n_jobs – int The number of jobs to use for the computation. This works by computing each of the n_init runs in parallel. If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used.
random_state – int, RandomState instance or None, optional, default: None If int, random_state is the seed used by the random number generator; If None, the random number generator is the RandomState instance used by np.random.
visualize – bool, default: False If True, calls
show()
- Returns:
cls: KMeansConstrained classifier object
cls_features_reduction: PCA or UMAP reduction object
cluster_nb: optimal number of cluster
labels: label[i] is the code or index of the centroid the i’th observation is closest to.
ref_label: cluster label with the minimal within-cluster sum-of-squares.
wicss_clusters: within-cluster sum-of-squares for each cluster
best_wicss_cluster: minimal wicss.
centroid: Centroids found at the last iteration of k-means.
X_hat: Copy of X with the missing values filled in.
cvasl.vendor.comscan.neurocombat module¶
- class cvasl.vendor.comscan.neurocombat.AutoCombat(features: List[str] | List[int] | str | int, sites_features: List[str] | List[int] | str | int = None, sites: str | int | None = None, size_min: int = 10, metric: str = 'distortion', use_ref_site: bool = False, scaler_clustering=StandardScaler(), discrete_cluster_features: List[str] | List[int] | str | int | None = None, continuous_cluster_features: List[str] | List[int] | str | int | None = None, features_reduction: str | None = None, n_components: int = 2, threshold_missing_sites_features=25, drop_site_columns: bool = False, discrete_combat_covariates: List[str] | List[int] | str | int | None = None, continuous_combat_covariates: List[str] | List[int] | str | int | None = None, empirical_bayes: bool = True, parametric: bool = True, mean_only: bool = False, return_only_features: bool = False, n_jobs: int = 1, random_state: int | None = 123, copy: bool = True)¶
Bases:
Combat
Harmonize/normalize features using Combat’s parametric empirical Bayes framework.
Combat need to have well-known acquisition sites or scanner to harmonize features. It is sometimes difficult to define an imaging acquisition site if on two sites imaging parameters can be really similar. ComScan gives the possibility to automatically determine the number of sites and their association based on imaging features (e.g. dicom tags) by clustering. Thus ComScan can be used on data not seen during training because it can predict which imager best matches the one it has seen during training.
Parameters¶
features : Target features to be harmonized.
sites_features : Target variable for define (acquisition sites or scanner) by clustering.
- sitesTarget variable for ComScan problems (e.g. acquisition sites or scanner).
This argument is Optional. If this argument is provided will run traditional ComBat else AutoCombat. In this case args: sites_features, size_min, method, scaler_clustering, discrete_cluster_features, continuous_cluster_features, threshold_missing_sites_features, drop_site_columns are unused.
size_min : Constraint of the minimum size of site for clustering.
- metric“distortion”, “silhouette” or “calinski_harabasz”.
Metric to define the optimal number of cluster. Default: distortion.
- use_ref_siteUse a ref site to be used as reference for batch adjustment. The ref site used is the cluster
with the minimal inertia. i.e minimizing within-cluster sum-of-squares.
- scaler_clusteringScaler to use for continuous site features. Need to be a scikit learn scaler.
Default is
StandardScaler()
.
discrete_cluster_features : Target sites_features which are categorical to one-hot (e.g. ManufacturerModelName).
continuous_cluster_features : Target sites_features which are continuous to scale (e.g. EchoTime).
- features_reductionMethod for reduction of the embedded space with n_components. Can be ‘pca’ or ‘umap’.
Default is None.
- n_componentsDimension of the embedded space for features reduction.
Default is 2.
- threshold_missing_sites_featuresThreshold of acceptable missing features for sites features clustering.
25 specify that 75% of all samples need to have this features. Default is 25.
- drop_site_columnsDrop sites columns find by clustering in return.
Default is False.
discrete_combat_covariates : Target covariates which are categorical (e.g. male or female).
continuous_combat_covariates : Target covariates which are continuous (e.g. age).
- empirical_bayesPerformed empirical bayes.
Default is True.
- parametricPerformed parametric adjustements.
Default is True.
- mean_onlyAdjust only the mean (no scaling)
Default is False.
- return_only_featuresReturn only features.
Default is False.
- n_jobs: The number of jobs to use for the computation. This works by computing each of the n_init runs in parallel.
If -1 all CPUs are used. If 1 is given, no parallel computing code is used at all, which is useful for debugging. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. Thus for n_jobs = -2, all CPUs but one are used. Default is 1.
- random_stateint, RandomState instance or None, optional, default: 123
If int, random_state is the seed used by the random number generator; If None, the random number generator is the RandomState instance used by np.random.
- copySet to False to perform inplace row normalization and avoid a copy (if the input is already a numpy array).
Default is True.
Attributes¶
cls_ : clustering classifier object
- info_clustering_Dictionary that stores info of clustering from sites_features with cluster_nb, labels, ref_label
wicss_clusters, best_wicss_cluster
cls_feature_reduction_ : feature reduction object
clustering_data_features_mean_ : dict of mean for clustering data (use for imputation)
X_hat_ : array after fit
clustering_data_features_ : column features for clustering from train (after encoding + scaling)
clustering_data_discrete_features_: column features for clustering after one-hot encoding
dict_cls_fitted: dict of columns of fitted cls used for fitted clustering data
Examples¶
>>> data = pd.DataFrame([{"features_1": 0.97, "site_features_0": 2, "site_features_1": 0}, >>> {"features_1": 1.35, "site_features_0": 1.01, "site_features_1": 1}, >>> {"features_1": 1.43, "site_features_0": 1.09, "site_features_1": 1}, >>> {"features_1": 0.85, "site_features_0": 2.3, "site_features_1": 0}])
>>> auto_combat = AutoCombat(features=["features_1"], sites_features=["site_features_0", "site_features_1"], >>> continuous_cluster_features=["site_features_0", "site_features_1"], size_min=2)) >>> print(auto_combat.fit(data)) AutoCombat(continuous_cluster_features=['site_features_0', 'site_features_1'], discrete_cluster_features=[], features=['features_1'], sites=['sites'], sites_features=['site_features_0', 'site_features_1'], size_min=2))
Notes¶
NaNs values are not treated.
Warning¶
Be sure to have the same sites features between fit and transform. The choice has not been to imposed an entry format to check a colum name or a slice.
- fit(X: ndarray | DataFrame, *y: ndarray | DataFrame | None) AutoCombat ¶
Compute sites, ref_site using clustering. Then compute the stand mean, var pooled, gamma star, delta star to be used for later adjusted data from Combat.
Parameters¶
- Xarray-like or DataFrame of shape (n_samples, n_features).
Requires the columns needed by the ComScan(). The data used to find adjustments.
- *yy in scikit learn: None
Ignored.
Returns¶
- selfobject
Fitted ComScan estimator.
- transform(X: ndarray | DataFrame) ndarray ¶
Scale features of X according to combat estimator.
Parameters¶
- Xarray-like or DataFrame of shape (n_samples, n_features). Requires the columns needed by the Combat().
Input data that will be transformed.
Returns¶
- Xtarray-like of shape (n_samples, n_features)
Transformed data.
- class cvasl.vendor.comscan.neurocombat.Combat(features: List[str] | List[int] | str | int, sites: str | int, discrete_covariates: List[str] | List[int] | str | int | None = None, continuous_covariates: List[str] | List[int] | str | int | None = None, ref_site: str | int | None = None, empirical_bayes: bool = True, parametric: bool = True, mean_only: bool = False, return_only_features: bool = False, raise_ref_site: bool = True, copy: bool = True)¶
Bases:
BaseEstimator
,TransformerMixin
Harmonize/normalize features using Combat’s parametric empirical Bayes framework
Parameters¶
features : Target features to be harmonized
- sitesTarget variable for ComScan problems
(e.g. acquisition sites or scanner).
- discrete_covariatesTarget covariates which are
categorical (e.g. male or female).
- continuous_covariatesTarget covariates
which are continuous (e.g. age).
- ref_siteVariable value (acquisition sites or scanner)
to be used as reference for batch adjustment. Default is False.
- empirical_bayesPerformed empirical bayes.
Default is True.
- parametricPerformed parametric adjustements.
Default is True.
- mean_onlyAdjust only the mean (no scaling).
Default is False.
- return_only_featuresReturn only features.
Default is False.
- raise_ref_siteraise when the reference site pass as arguments not exist, else set to no reference.
Default is True.
- copySet to False to perform inplace row normalization and avoid a copy (if the input is already a numpy array).
Default is True.
Attributes¶
- info_dict_fit_dictionary that stores batch info of fitted data with:
batch_levels, ref_level, n_batch, n_sample, sample_per_batch, batch_info
- stand_mean_array-like
Standardized mean
- var_pooled_array-like
Variance pooled
- mod_mean_array-like
Mod mean
- gamma_star_array-like
Adjustement gamma star
- delta_star_array-like
Adjustement delta star
- info_dict_transform_dictionary that stores batch info of transformed data with
batch_levels, ref_level, n_batch, n_sample, sample_per_batch, batch_info
Examples¶
Notes¶
NaNs values are not treated.
- fit(X: ndarray | DataFrame, *y: ndarray | DataFrame | None) Combat ¶
Compute the stand mean, var pooled, gamma star, delta star to be used for later adjusted data.
Parameters¶
- Xarray-like or DataFrame of shape (n_samples, n_features).
Requires the columns needed by the Combat(). The data used to find adjustments.
- *yy in scikit learn: None
Ignored.
Returns¶
- selfobject
Fitted combat estimator.
- load_fit(filepath: str) None ¶
load a fitted model attribute
info_dict_fit_
,stand_mean_
,var_pooled_
,gamma_star_
,delta_star_
- Parameters:
filepath – filepath of the pkl file to load
- save_fit(filepath: str) None ¶
save a fitted model attribute
info_dict_fit_
,stand_mean_
,var_pooled_
,gamma_star_
,delta_star_
- Parameters:
filepath – filepath were to save. if no extension .pkl will add it
- transform(X: ndarray | DataFrame) ndarray ¶
Scale features of X according to combat estimator.
Parameters¶
- Xarray-like or DataFrame of shape (n_samples, n_features). Requires the columns needed by the Combat().
Input data that will be transformed.
Returns¶
- Xtarray-like of shape (n_samples, n_features)
Transformed data.
- class cvasl.vendor.comscan.neurocombat.ImageCombat(image_path: str | int, sites_features: ~typing.List[str] | ~typing.List[int] | str | int = None, sites: str | int = None, save_path_fit: str = 'fit_data', save_path_transform: str = 'transform_data', size_min: int = 10, method: str = 'silhouette', use_ref_site: bool = False, scaler_clustering=StandardScaler(), discrete_cluster_features: ~typing.List[str] | ~typing.List[int] | str | int | None = None, continuous_cluster_features: ~typing.List[str] | ~typing.List[int] | str | int | None = None, features_reduction: str | None = None, n_components: int = 2, threshold_missing_sites_features=25, drop_site_columns: bool = True, discrete_combat_covariates: ~typing.List[str] | ~typing.List[int] | str | int | None = None, continuous_combat_covariates: ~typing.List[str] | ~typing.List[int] | str | int | None = None, empirical_bayes: bool = True, parametric: bool = True, mean_only: bool = False, random_state: int | None = 123, flattened_dtype: ~numpy.dtype | None = <class 'numpy.float16'>, output_dtype: ~numpy.dtype | None = <class 'numpy.float32'>, copy: bool = True)¶
Bases:
AutoCombat
Harmonize/normalize features using Combat’s parametric empirical Bayes framework directly on image.
ImageCombat allow the possibility to Harmonize/normalize a set of NIFTI images. All images must have the same dimensions and orientation. A common mask is created based on an heuristic proposed by T.Nichols. Images are then vectorizing for ComScan. ImageCombat allows the possibily to use Combat (well-defined site) or AutoCombat (clustering for sites finding)
Parameters¶
image_path : image_path of nifti files.
sites_features : Target variable for define (acquisition sites or scanner) by clustering.
- sitesTarget variable for ComScan problems (e.g. acquisition sites or scanner).
This argument is Optional. If this argument is provided will run traditional ComBat. In this case args: sites_features, size_min, method, scaler_clustering, discrete_cluster_features, continuous_cluster_features, threshold_missing_sites_features, drop_site_columns are unused.
size_min : Constraint of the minimum size of site for clustering.
- method“silhouette” or “elbow”. Method to define the optimal number of cluster.
Default is silhouette.
- use_ref_site: Use a ref site to be used as reference for batch adjustment. The ref site used is the cluster
with the minimal inertia. i.e minimizing within-cluster sum-of-squares. Default is False.
- scaler_clustering: Scaler to use for continuous site features. Need to be a scikit learn scaler.
Default is
StandardScaler()
.
discrete_cluster_features: Target sites_features which are categorical to one-hot (e.g. ManufacturerModelName).
continuous_cluster_features: Target sites_features which are continuous to scale (e.g. EchoTime).
- features_reduction: Method for reduction of the embedded space with n_components. Can be ‘pca’ or ‘umap’.
Default is None.
- n_components: Dimension of the embedded space for features reduction.
Default is 2.
- threshold_missing_sites_features: Threshold of acceptable missing features for sites features clustering.
25 specify that 75% of all samples need to have this features. Default is 25.
drop_site_columns: Drop sites columns find by clustering in return.
discrete_combat_covariates : Target covariates which are categorical (e.g. male or female).
continuous_combat_covariates : Target covariates which are continuous (e.g. age).
- empirical_bayesPerformed empirical bayes.
Default is True.
- parametricPerformed parametric adjustements.
Default is True.
- mean_onlyAdjust only the mean (no scaling)
Default is False.
- random_state: int, RandomState instance or None, optional, default: 123
If int, random_state is the seed used by the random number generator; If None, the random number generator is the RandomState instance used by np.random.
- copySet to False to perform inplace row normalization and avoid a copy (if the input is already a numpy array).
Default is True.
Attributes¶
mask_ : array-like of the common brain mask
flattened_array_ : flattened array of all the training set
Notes¶
NaNs values are not treated.
- fit(X: ndarray | DataFrame, *y: ndarray | DataFrame | None) ImageCombat ¶
Compute sites, ref_site using clustering. Then compute the stand mean, var pooled, gamma star, delta star to be used for later adjusted data from Combat.
Parameters¶
- Xarray-like or DataFrame of shape (n_samples, n_features).
Requires the columns needed by the ComScan(). The data used to find adjustments.
- *yy in scikit learn: None
Ignored.
Returns¶
- selfobject
Fitted ComScan estimator.
cvasl.vendor.comscan.nifti module¶
- cvasl.vendor.comscan.nifti.flatten_nifti_files(input_path: ~typing.List[str], mask: str | ~numpy.ndarray, output_flattened_array_path: str = 'flattened_array', dtype: [<class 'numpy.dtype'>, typing.Callable] = <class 'numpy.float16'>, save: bool = True, compress_save: bool = True)¶
Flattened list of nifti files to a flattened array [n_images, n_masked_voxels] and save to .npy or .npz if compressed
- Parameters:
input_path – List of nifti files path
mask – path of mask or array
output_flattened_array_path – path of the output flattened array. No extension is needed. Will be save as .npy if no compression, else .npz
save – save the flattened array
dtype – dtype of the output flattened array. Default is float 16 to save memory
compress_save – If true compress the numpy array into .npz
- Returns:
flattened array [n_images, n_masked_voxels]
cvasl.vendor.comscan.utils module¶
- cvasl.vendor.comscan.utils.check_exist_vars(df: DataFrame, _vars: List) ndarray ¶
Check that a list of columns name exist in a DataFrame.
- Parameters:
df – a DataFrame
_vars – List of columns name to check
- Returns:
index of columns name
- Raise:
ValueError if missing features
- cvasl.vendor.comscan.utils.check_is_nii_exist(input_file_path: str) str ¶
Check if a directory exist.
- Parameters:
input_file_path – string of the path of the nii or nii.gz.
- Returns:
string if exist, else raise Error.
- Raise:
FileNotFoundError or FileExistsError
- cvasl.vendor.comscan.utils.column_var_dtype(df: DataFrame, identify_dtypes: Sequence[str] = ('object',)) DataFrame ¶
identify type of columns in DataFrame
- Parameters:
df – input dataframe
identify_dtypes – pandas dtype
Note
see https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#basics-dtypes for pandas dtype
- Returns:
summary df with col index and col name for all identify_dtypes vars
- cvasl.vendor.comscan.utils.fix_columns(df: DataFrame, columns: List[str], inplace: bool = False, extra_nans: bool = False) DataFrame ¶
Fix columns for the test set. When the train was encoded with
pd.get_dummies
.Note
inspired from: http://fastml.com/how-to-use-pd-dot-get-dummies-with-the-test-set
- Parameters:
df – input dataframe
columns – columns of the original dataframe
inplace – If False, return a copy. Otherwise, do operation inplace and return None
extra_nans – put extra columns as nans based on one hot encoding columns
- Returns:
the corrected version of DataFrame for test set
- cvasl.vendor.comscan.utils.get_column_index(df: DataFrame, query_cols: List[str]) ndarray ¶
Get columns index from columns name
- Parameters:
df – input dataframe
query_cols – List name of colunns
- Returns:
array of column index
- cvasl.vendor.comscan.utils.load_nifty_volume_as_array(input_path_file: str) Tuple[ndarray, Tuple[Tuple, Tuple, Tuple]] ¶
Load nifty image into numpy array [z,y,x] axis order. The output array shape is like [Depth, Height, Width].
- Parameters:
input_path_file – input path file, should be ‘.nii’ or ‘.nii.gz’
- Returns:
a numpy data array, (with header)
- cvasl.vendor.comscan.utils.mat_to_bytes(nrows: int, ncols: int, dtype: int = 32, out: str = 'GB') float ¶
Calculate the size of a numpy array in bytes.
Note
code from: https://gist.github.com/dimalik/f4609661fb83e3b5d22e7550c1776b90
- Parameters:
nrows – the number of rows of the matrix.
ncols – the number of columns of the matrix.
dtype – the size of each element in the matrix. Defaults to 32bits.
out – the output unit. Defaults to gigabytes (GB)
- Returns:
the size of the matrix in the given unit
- cvasl.vendor.comscan.utils.one_hot_encoder(df: DataFrame, columns: List[str], drop_column: bool = True, dummy_na: bool = False, add_nan_columns: bool = False, inplace: bool = False) DataFrame ¶
Encoding categorical feature in the dataframe, allow possibility to keep NaN. The categorical feature index and name are from cat_var function. These columns need to be “object” dtypes.
- Parameters:
df – input dataframe
columns – List of columns to encode
drop_column – Set to True to drop the original column after encoding. Default to True.
dummy_na – Add a column to indicate NaNs, if False NaNs are ignored.
add_nan_columns – Add a empty nan columns if not create (can be used are other categories)
inplace – If False, return a copy. Otherwise, do operation inplace and return None
- Returns:
new dataframe where columns are one hot encoded
- cvasl.vendor.comscan.utils.save_to_nii(im: ~numpy.ndarray, header: (<class 'tuple'>, <class 'tuple'>, <class 'tuple'>), output_dir: str, filename: str, mode: str = 'image', gzip: bool = True) None ¶
Save numpy array to nii.gz format to submit.
- Parameters:
im – array numpy
header – header metadata (origin, spacing, direction).
output_dir – Output directory.
filename – Filename of the output file.
mode – save as ‘image’ or ‘label’
gzip – zip nii (ie, nii.gz)
- cvasl.vendor.comscan.utils.scaler_encoder(df: DataFrame, columns: List[str], scaler=StandardScaler(), inplace: bool = False) DataFrame ¶
Apply sklearn scaler to columns.
- Parameters:
df – input dataframe
columns – List of columns to encode
scaler – scaler object from sklearn
inplace – If False, return a copy. Otherwise, do operation inplace and return None
- Returns:
df: DataFrame scaled
dict_cls_fitted: dict by col of fitted cls
- cvasl.vendor.comscan.utils.split_filename(file_name: str) Tuple[str, str, str] ¶
Split file_name into folder path name, basename, and extension name.
- Parameters:
file_name – full path
- Returns:
path name, basename, extension name
- cvasl.vendor.comscan.utils.tsne(df: DataFrame, columns: List[str], n_components: int = 2, random_state: int | None = 123, n_jobs: int | None = -1)¶
t-distributed Stochastic Neighbor Embedding.
t-SNE is a tool to visualize high-dimensional data. It converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data. t-SNE has a cost function that is not convex, i.e. with different initializations we can get different results.
- Parameters:
df – input dataframe
columns – List of columns to use
n_components – Dimension of the embedded space. Default 2.
random_state – int, RandomState instance or None, optional, default: 123 If int, random_state is the seed used by the random number generator; If None, the random number generator is the RandomState instance used by np.random.
n_jobs – default=-1 The number of parallel jobs to run for neighbors search. This parameter has no impact when
metric="precomputed"
or (metric="euclidean"
andmethod="exact"
).None
means 1 unless in ajoblib.parallel_backend
context.-1
means using all processors.
- Returns:
array-like with projections
- cvasl.vendor.comscan.utils.u_map(df: DataFrame, columns: List[str], n_components: int = 2, random_state: int | None = 123, n_jobs: int | None = -1)¶
Just like t-SNE, UMAP is a dimensionality reduction specifically designed for visualizing complex data in low dimensions (2D or 3D). As the number of data points increase, UMAP becomes more time efficient compared to TSNE.
- Parameters:
df – input dataframe
columns – List of columns to use
n_components – Dimension of the embedded space. Default 2.
random_state – int, RandomState instance or None, optional, default: 123 If int, random_state is the seed used by the random number generator; If None, the random number generator is the RandomState instance used by np.random.
n_jobs – default=-1 The number of parallel jobs to run for neighbors search. This parameter has no impact when
metric="precomputed"
or (metric="euclidean"
andmethod="exact"
).None
means 1 unless in ajoblib.parallel_backend
context.-1
means using all processors.
- Returns:
array-like with projections