bbknn.bbknn

bbknn.bbknn(adata, batch_key='batch', use_rep='X_pca', key_added=None, copy=False, **kwargs)

Batch balanced KNN, altering the KNN procedure to identify each cell’s top neighbours in each batch separately instead of the entire cell pool with no accounting for batch. The nearest neighbours for each batch are then merged to create a final list of neighbours for the cell. Aligns batches in a quick and lightweight manner. For use in the scanpy workflow as an alternative to scanpy.pp.neighbors().

Input

adataAnnData

Needs your dimensionality reduction of choice computed and stored in .obsm.

batch_keystr, optional (default: “batch”)

adata.obs column name discriminating between your batches.

neighbors_within_batchint, optional (default: 3)

How many top neighbours to report for each batch; total number of neighbours in the initial k-nearest-neighbours computation will be this number times the number of batches. This then serves as the basis for the construction of a symmetrical matrix of connectivities.

use_repstr, optional (default: “X_pca”)

The dimensionality reduction in .obsm to use for neighbour detection. Defaults to PCA.

n_pcsint, optional (default: 50)

How many dimensions (in case of PCA, principal components) to use in the analysis.

trimint or None, optional (default: None)

Trim the neighbours of each cell to these many top connectivities. May help with population independence and improve the tidiness of clustering. The lower the value the more independent the individual populations, at the cost of more conserved batch effect. If None, sets the parameter value automatically to 10 times neighbors_within_batch times the number of batches. Set to 0 to skip.

computationstr, optional (default: “annoy”)

Which KNN algorithm to use. BBKNN supports the approximate neighbour search of “annoy” and “pynndescent”, and the exact neighbour search of “faiss”, “cKDTree” and “KDTree”. Available metric choices depend on the package used here.

annoy_n_treesint, optional (default: 10)

Only used with annoy neighbour identification. The number of trees to construct in the annoy forest. More trees give higher precision when querying, at the cost of increased run time and resource intensity.

pynndescent_n_neighborsint, optional (default: 30)

Only used with pyNNDescent neighbour identification. The number of neighbours to include in the approximate neighbour graph. More neighbours give higher precision when querying, at the cost of increased run time and resource intensity.

pynndescent_random_stateint, optional (default: 0)

Only used with pyNNDescent neighbour identification. The RNG seed to use when creating the graph.

metricstr or sklearn.neighbors.DistanceMetric or types.FunctionType, optional (default: “euclidean”)

What distance metric to use. The options depend on the choice of neighbour algorithm.

“euclidean”, the default, is always available.

Annoy supports “angular”, “manhattan” and “hamming”.

PyNNDescent supports metrics listed in pynndescent.distances.named_distances and custom functions, including compiled Numba code.

>>> pynndescent.distances.named_distances.keys()
dict_keys(['euclidean', 'l2', 'sqeuclidean', 'manhattan', 'taxicab', 'l1', 'chebyshev', 'linfinity', 
'linfty', 'linf', 'minkowski', 'seuclidean', 'standardised_euclidean', 'wminkowski', 'weighted_minkowski', 
'mahalanobis', 'canberra', 'cosine', 'dot', 'correlation', 'hellinger', 'haversine', 'braycurtis', 'spearmanr', 
'kantorovich', 'wasserstein', 'tsss', 'true_angular', 'hamming', 'jaccard', 'dice', 'matching', 'kulsinski', 
'rogerstanimoto', 'russellrao', 'sokalsneath', 'sokalmichener', 'yule'])

KDTree supports members of the sklearn.neighbors.KDTree.valid_metrics() list, or parameterised sklearn.metrics.DistanceMetric objects:

>>> sklearn.neighbors.KDTree.valid_metrics()
['euclidean', 'l2', 'minkowski', 'p', 'manhattan', 'cityblock', 'l1', 'chebyshev', 'infinity']
set_op_mix_ratiofloat, optional (default: 1)

UMAP connectivity computation parameter, float between 0 and 1, controlling the blend between a connectivity matrix formed exclusively from mutual nearest neighbour pairs (0) and a union of all observed neighbour relationships with the mutual pairs emphasised (1)

local_connectivityint, optional (default: 1)

UMAP connectivity computation parameter, how many nearest neighbors of each cell are assumed to be fully connected (and given a connectivity value of 1)

copybool, optional (default: False)

If True, return a copy instead of writing to the supplied adata.