Clustering#

aitlas.clustering.kmeans module#

class Kmeans(k)[source]#

Bases: object

cluster(data, verbose=False)[source]#

Performs k-means clustering.

Parameters:

x_data (np.array (N * dim)) – data to cluster

aitlas.clustering.pic module#

class PIC(args=None, sigma=0.2, nnn=5, alpha=0.001, distribute_singletons=True)[source]#

Bases: object

Class to perform Power Iteration Clustering on a graph of nearest neighbors. Arguments for consistency with k-means init:

Parameters:
  • sigma (float) – bandwith of the Gaussian kernel (default 0.2)

  • nnn (int) – number of nearest neighbors (default 5)

  • alpha (float) – parameter in PIC (default 0.001)

  • distribute_singletons (bool) – If True, reassign each singleton to the cluster of its closest nonsingleton nearest neighbors (up to nnn nearest neighbors).

  • images_lists (list of lists of ints) – for each cluster, the list of image indexes belonging to this cluster

cluster(data, verbose=False)[source]#

aitlas.clustering.utils module#

preprocess_features(npdata, pca=256)[source]#

Preprocess an array of features.

Parameters:
  • npdata (np.array (N * dim)) – features to preprocess

  • pca (int) – dim of output

Returns:

data PCA-reduced, whitened and L2-normalized

Return type:

np.array (N * pca)

make_graph(xb, nnn)[source]#

Builds a graph of nearest neighbors.

Parameters:
  • xb (np.array (N * dim)) – data

  • nnn (int) – number of nearest neighbors

Returns:

list for each data the list of ids to its nnn nearest neighbors

Returns:

list for each data the list of distances to its nnn NN

Return type:

np.array (N * nnn)

class ReassignedDataset(image_indexes, pseudolabels, dataset)[source]#

Bases: Dataset

A dataset where the new images labels are given in argument.

Parameters:
  • image_indexes (list of ints) – list of data indexes

  • pseudolabels (list of ints) – list of labels for each data

  • dataset (list of tuples with paths to images) – initial dataset

  • transform (callable, optional) – a function/transform that takes in an PIL image and returns a transformed version

make_dataset(image_indexes, pseudolabels)[source]#
cluster_assign(images_lists, dataset)[source]#

Creates a dataset from clustering, with clusters as labels.

Params images_lists:

for each cluster, the list of image indexes belonging to this cluster

Params dataset:

initial dataset

Returns:

dataset with clusters as labels

Return type:

ReassignedDataset(torch.utils.data.Dataset)

run_kmeans(x, nmb_clusters, verbose=False)[source]#

Runs kmeans on 1 GPU. :param x: data :type x: np.array (N * dim) :param nmb_clusters: number of clusters :type nmb_clusters: int :return: list of ids for each data to its nearest cluster :rtype: list of ints

arrange_clustering(images_lists)[source]#
make_adjacencyW(I, D, sigma)[source]#

Create adjacency matrix with a Gaussian kernel.

Parameters:
  • I (numpy array) – for each vertex the ids to its nnn linked vertices + first column of identity.

  • D (numpy array) – for each data the l2 distances to its nnn linked vertices + first column of zeros.

  • sigma (float) – bandwith of the Gaussian kernel.

Returns:

affinity matrix of the graph.

Return type:

scipy.sparse.csr_matrix

run_pic(I, D, sigma, alpha)[source]#

Run PIC algorithm

find_maxima_cluster(W, v)[source]#