BBKNN¶
Batch balanced KNN
-
bbknn.
bbknn
(adata, batch_key='batch', use_rep='X_pca', copy=False, **kwargs)¶ Batch balanced KNN, altering the KNN procedure to identify each cell’s top neighbours in each batch separately instead of the entire cell pool with no accounting for batch. The nearest neighbours for each batch are then merged to create a final list of neighbours for the cell. Aligns batches in a quick and lightweight manner. For use in the scanpy workflow as an alternative to
scanpy.pp.neighbors()
.- adata :
AnnData
- Needs the PCA computed and stored in
adata.obsm["X_pca"]
. - batch_key :
str
, optional (default: “batch”) adata.obs
column name discriminating between your batches.- neighbors_within_batch :
int
, optional (default: 3) - How many top neighbours to report for each batch; total number of neighbours in the initial k-nearest-neighbours computation will be this number times the number of batches. This then serves as the basis for the construction of a symmetrical matrix of connectivities.
- use_rep :
str
, optional (default: “X_pca”) - The dimensionality reduction in
.obsm
to use for neighbour detection. Defaults to PCA. - n_pcs :
int
, optional (default: 50) - How many dimensions (in case of PCA, principal components) to use in the analysis.
- trim :
int
orNone
, optional (default:None
) - Trim the neighbours of each cell to these many top connectivities. May help with
population independence and improve the tidiness of clustering. The lower the value the
more independent the individual populations, at the cost of more conserved batch effect.
If
None
, sets the parameter value automatically to 10 timesneighbors_within_batch
times the number of batches. Set to 0 to skip. - approx :
bool
, optional (default:True
) - If
True
, use approximate neighbour finding - annoy or pyNNDescent. This results in a quicker run time for large datasets while also potentially increasing the degree of batch correction. - use_annoy :
bool
, optional (default:True
) - Only used when
approx=True
. IfTrue
, will use annoy for neighbour finding. IfFalse
, will use pyNNDescent instead. - annoy_n_trees :
int
, optional (default: 10) - Only used with annoy neighbour identification. The number of trees to construct in the annoy forest. More trees give higher precision when querying, at the cost of increased run time and resource intensity.
- pynndescent_n_neighbors :
int
, optional (default: 30) - Only used with pyNNDescent neighbour identification. The number of neighbours to include in the approximate neighbour graph. More neighbours give higher precision when querying, at the cost of increased run time and resource intensity.
- pynndescent_random_state :
int
, optional (default: 0) - Only used with pyNNDescent neighbour identification. The RNG seed to use when creating the graph.
- use_faiss :
bool
, optional (default:True
) - If
approx=False
and the metric is “euclidean”, use the faiss package to compute nearest neighbours if installed. This improves performance at a minor cost to numerical precision as faiss operates on float32. - metric :
str
orsklearn.neighbors.DistanceMetric
ortypes.FunctionType
, optional (default: “euclidean”) What distance metric to use. The options depend on the choice of neighbour algorithm.
“euclidean”, the default, is always available.
Annoy supports “angular”, “manhattan” and “hamming”.
PyNNDescent supports metrics listed in
pynndescent.distances.named_distances
and custom functions, including compiled Numba code.>>> pynndescent.distances.named_distances.keys() dict_keys(['euclidean', 'l2', 'sqeuclidean', 'manhattan', 'taxicab', 'l1', 'chebyshev', 'linfinity', 'linfty', 'linf', 'minkowski', 'seuclidean', 'standardised_euclidean', 'wminkowski', 'weighted_minkowski', 'mahalanobis', 'canberra', 'cosine', 'dot', 'correlation', 'hellinger', 'haversine', 'braycurtis', 'spearmanr', 'kantorovich', 'wasserstein', 'tsss', 'true_angular', 'hamming', 'jaccard', 'dice', 'matching', 'kulsinski', 'rogerstanimoto', 'russellrao', 'sokalsneath', 'sokalmichener', 'yule'])
KDTree supports members of the
sklearn.neighbors.KDTree.valid_metrics
list, or parameterisedsklearn.neighbors.DistanceMetric
objects:>>> sklearn.neighbors.KDTree.valid_metrics ['p', 'chebyshev', 'cityblock', 'minkowski', 'infinity', 'l2', 'euclidean', 'manhattan', 'l1']
- set_op_mix_ratio :
float
, optional (default: 1) - UMAP connectivity computation parameter, float between 0 and 1, controlling the blend between a connectivity matrix formed exclusively from mutual nearest neighbour pairs (0) and a union of all observed neighbour relationships with the mutual pairs emphasised (1)
- local_connectivity :
int
, optional (default: 1) - UMAP connectivity computation parameter, how many nearest neighbors of each cell are assumed to be fully connected (and given a connectivity value of 1)
- copy :
bool
, optional (default:False
) - If
True
, return a copy instead of writing to the supplied adata.
- adata :
-
bbknn.
ridge_regression
(adata, batch_key, confounder_key=[], chunksize=100000000.0, copy=False, **kwargs)¶ Perform ridge regression on scaled expression data, accepting both technical and biological categorical variables. The effect of the technical variables is removed while the effect of the biological variables is retained. This is a preprocessing step that can aid BBKNN integration (Park, 2020).
Alters the object’s
.X
to be the regression residuals, and creates.layers['X_explained']
with the expression explained by the technical effect.- adata :
AnnData
- Needs scaled data in
.X
. - batch_key :
list
- A list of categorical
.obs
columns to regress out as technical effects. - confounder_key :
list
, optional (default:[]
) - A list of categorical
.obs
columns to retain as biological effects. - chunksize :
int
, optional (default: 1e8) - How many elements of the expression matrix to process at a time. Potentially useful to manage memory use for larger datasets.
- copy :
bool
, optional (default:False
) - If
True
, return a copy instead of writing to the supplied adata. - kwargs
- Any arguments to pass to Ridge.
- adata :
-
bbknn.
extract_cell_connectivity
(adata, cell, key='extracted_cell_connectivity')¶ Helper post-processing function that extracts a single cell’s connectivity and stores it in
adata.obs
, ready for plotting. Connectivities range from 0 to 1, the higher the connectivity the closer the cells are in the neighbour graph. Cells with a connectivity of 0 are unconnected in the graph.- adata :
AnnData
- After having BBKNN ran on it.
- cell :
str
- The name of the cell to extract the connectivities for.
- key :
str
, optional (default “extracted_cell_connectivity”) - What name to store the connectivities under in
adata.obs
.
- adata :
-
bbknn.matrix.
bbknn
(pca, batch_list, neighbors_within_batch=3, n_pcs=50, trim=None, computation='annoy', annoy_n_trees=10, pynndescent_n_neighbors=30, pynndescent_random_state=0, metric='euclidean', set_op_mix_ratio=1, local_connectivity=1, approx=None, use_annoy=None, use_faiss=None, scanpy_logging=False)¶ Scanpy-independent BBKNN variant that runs on a PCA matrix and list of per-cell batch assignments instead of an AnnData object. Non-data-entry arguments behave the same way as
bbknn.bbknn()
. Returns a(distances, connectivities, parameters)
tuple, like what would have been stored in the AnnData object. The connectivities are the actual neighbourhood graph.- pca :
numpy.array
- PCA (or other dimensionality reduction) coordinates for each cell, with cells as rows.
- batch_list :
numpy.array
orlist
- A list of batch assignments for each cell.
- scanpy_logging :
bool
, optional (default:False
) - Whether to use scanpy logging to print updates rather than
warnings.warn()
- pca :