Introduction to ClusTorus

ClusTorus is a package for clustering multivariate angular data, especially for protein structure data. ClusTorus provides various clustering algorithms designed with the conformal prediction framework, which can deal with the outliers. The package suggests various methods for fitting the algorithms, and some of them will be introduced soon. The package also provides some simple tools for handling angluar data, such as angular subtraction, computing angular distance, etc. Now, check how to use ClusTorus briefly.

Data Loading and Handling

ClusTorus provides two toy datasets, which are used in Jung, et.al.(2021), “Clustering on the Torus by Conformal Prediction”, Annals of Applied Statistics. The dataset we will use here is sampled from a mixture of K = 3 clusters, where the first cluster is sampled from a spherical normal distribution with size n₁ = 100, the second cluster of size n₂ = 350 is from the uniform distribution on a large “L”-shaped region, and the third cluster of size 50 is sampled from the uniform distribution on the entire 𝕋².

library(ClusTorus)
library(tidyverse)

data <- toydata2
head(data)
#>        phi       psi label
#> 1 2.730154 0.2819140     1
#> 2 3.004941 0.7895926     1
#> 3 2.733773 0.9507966     1
#> 4 2.976103 0.6336779     1
#> 5 2.655409 0.6599409     1
#> 6 3.196350 0.6545109     1

data %>% ggplot(aes(x = phi, y = psi, color = label)) + geom_point() +
  scale_x_continuous(breaks = c(0,1,2,3,4)*pi/2, labels = c("0","pi/2","pi","3pi/2","2pi"), limits = c(0,2*pi))+
  scale_y_continuous(breaks = c(0,1,2,3,4)*pi/2, labels = c("0","pi/2","pi","3pi/2","2pi"), limits = c(0,2*pi))+
  ggtitle('Data set 2 with true labels')

ClusTorus provides the function on.torus, which converts the radian scale data to be on [0, 2π)^d, where d means the number of dimension. In this case, the provided dataset is already on [0, 2π)^d, and thus we don’t need to use on.torus.

Clustering with Various Options

Now, we are ready to implement clustering algorithms to the data. ClusTorus provides various options for clustering/constructing prediction set, but we will provide only the case for “kmeans - general”. We need to choose hyperparameters: the number of modes or ellipsoids J and the significance level α. Before choosing the hyperparameter, we will implement the model fitting function icp.torus.score with various hyperparameter options, first.

set.seed(2021)

Jvec <- 5:30
l <- icp.torus(as.matrix(data[, 1:2]),
                            model = "kmeans",
                            kmeansfitmethod = "general",
                            init = "hierarchical",
                            J = Jvec,
                            verbose = FALSE)

The list l contains the icp.torus objects, which consist of fitted parameters for generating clusters, by varying the hypterparameter J. That is, these objects are optimally fitted ingredients for generating clusters for given J. By specifying the significance level, we can get the clusters. But, how to generate optimal clusters/conformal prediction sets? One may think that the hyperparameter which generates the cluster/prediction set of the minimum volume/area will be the optimum for given significance level. The other may think that we can choose the number of mixture components J by using information criteria such as AIC or BIC. These approaches are implemented in the function hyperparam.torus; the main arguments of hyperparam.torus is data, icp.torus.objects, and option. Analogously, the argument data is analogously for the data. icp.torus.objects is for the list object whose elements are icp.torus objects, such as the list l generated above. The argument option, which is the most important argument, is for the hyperparameter selection criterion. If option = "elbow", then hyperparam.torus selects J and α based on the volume based criterion as mentioned above. If option = "AIC" or option = "BIC", then hyperparam.torus selects J based on the designated information criterion, and selects the most stable α in the sense of the number of generated clusters. If option = "risk", then it chooses J which minimizes the sum of the conformity scores and analogously choose the stable α. We will use option = "risk" in this case, and the following codes show the criterion results(J versus the evaluated criterion), α results(α versus the number of clusters), and chosen J and α.

output <- hyperparam.torus(l, option = "risk")
#> ..........................
#> .......
output
#> Type of conformity score: kmeans general 
#> Optimizing method: risk 
#> -------------
#> Optimally chosen parameters. Number of components =  5 , alpha =  0.084 
#> Results based on criterion risk :
#>     J criterion
#> 1   5  2495.134
#> 2   6  2499.865
#> 3   7  2501.007
#> 4   8  2510.881
#> 5   9  2536.576
#> 6  10  2555.744
#> 7  11  2552.515
#> 8  12  2549.668
#> 9  13  2568.499
#> 10 14  2585.362
#> 11 15  2595.043
#> 12 16  2608.337
#> 13 17  2604.204
#> 14 18  2598.733
#> 15 19  2612.264
#> 16 20  2616.506
#> 17 21  2638.969
#> 18 22  2639.597
#> 19 23  2654.307
#> 20 24  2656.181
#> 21 25  2659.685
#> 22 26  2668.536
#> 23 27  2668.536
#> 24 28  2737.958
#> 25 29  2759.033
#> 26 30  2754.286
#> 
#> Clustering results by varying alpha on [0.0020.15] :
#>    alpha ncluster
#> 1  0.002        1
#> 2  0.004        1
#> 3  0.006        1
#> 4  0.008        1
#> 5  0.010        1
#> 6  0.012        1
#> 7  0.014        1
#> 8  0.016        1
#> 9  0.018        1
#> 10 0.020        2
#> 11 0.022        2
#> 12 0.024        2
#> 13 0.026        2
#> 14 0.028        3
#> 15 0.030        3
#> 16 0.032        4
#> 17 0.034        4
#> 18 0.036        4
#> 19 0.038        4
#> 20 0.040        4
#> 21 0.042        4
#> 22 0.044        4
#> 23 0.046        4
#> 24 0.048        4
#> 25 0.050        4
#> 26 0.052        4
#> 27 0.054        4
#> 28 0.056        4
#> 29 0.058        2
#> 30 0.060        2
#> 31 0.062        2
#> 32 0.064        2
#> 33 0.066        2
#> 34 0.068        2
#> 35 0.070        2
#> 36 0.072        2
#> 37 0.074        2
#> 38 0.076        2
#> 39 0.078        2
#> 40 0.080        2
#> 41 0.082        2
#> 42 0.084        2
#> 43 0.086        2
#> 44 0.088        2
#> 45 0.090        2
#> 46 0.092        2
#> 47 0.094        2
#> 48 0.096        2
#> 49 0.098        2
#> 50 0.100        2
#> 51 0.102        2
#> 52 0.104        2
#> 53 0.106        2
#> 54 0.108        2
#> 55 0.110        2
#> 56 0.112        3
#> 57 0.114        3
#> 58 0.116        3
#> 59 0.118        3
#> 60 0.120        3
#> 61 0.122        3
#> 62 0.124        3
#> 63 0.126        3
#> 64 0.128        3
#> 65 0.130        3
#> 66 0.132        3
#> 67 0.134        3
#> 68 0.136        3
#> 69 0.138        3
#> 70 0.140        3
#> 71 0.142        3
#> 72 0.144        3
#> 73 0.146        3
#> 74 0.148        3
#> 75 0.150        3
#> 
#> Available components:
#> [1] "model"         "option"        "IC.results"    "alpha.results"
#> [5] "icp.torus"     "Jhat"          "alphahat"
plot(output)

hyperparam.torus also provides the selected model; that is, since icp.torus.score is the function for the model fitting, hyperparam.torus selects the model of the chosen J. output$optim$icp.torus is the selected model, and we will generate the resulting clusters with it.

icp.torus.kmeans <- output$icp.torus
alphahat <- output$alphahat

Generating Clusters and Visualization of Clustering Results

With cluster.assign.torus, we can generate the cluster for option mixture and kmeans, and for each data point, the label of cluster is assigned. Moreover, for the case of kmeans, we can check how the cluster is generated directly as below:

c_kmeans <- cluster.assign.torus(icp.torus.kmeans, level = alphahat)
plot(icp.torus.kmeans, level = alphahat)

Also, with the assigned cluster labels, we can visualize the clusters with colored data points;

c_kmeans
#> Number of clusters: 2 
#> -------------
#> Clustering results by log density:
#>  [1] 1 1 1 1 1 1 1 1 1 1
#> cluster sizes: 209 791 
#> 
#> Clustering results by posterior:
#>  [1] 1 1 1 1 1 1 1 1 1 1
#> cluster sizes: 245 755 
#> 
#> Clustering results by representing outliers:
#>  [1] 1 1 1 1 1 1 1 3 1 1
#> cluster sizes: 196 704 100 
#> 
#>  Note: cluster id = 3 represents outliers.
#> 
#> Clustering results by Mahalanobis distance:
#>  [1] 1 1 1 1 1 1 1 1 1 1
#> cluster sizes: 209 791 
#> 
#> 990 clustering results are omitted.
plot(c_kmeans)
#> Warning in get_plot_component(plot, "guide-box"): Multiple components found;
#> returning the first one. To return all, use `return_all = TRUE`.

It seems that the resulting clusters perform satisfactory compared to the original cluster assignments.

If you clearly follow the examples, it will be relatively easy to use the other options for the argument method = "kde" or method = "mixture" of the model fitting function icp.torus.score, and for the argument option = "AIC", option = "BIC", option = "elbow" of the hyperparameter selecting function hyperparam.torus. You may need the detailed manual. In the package manual, concrete examples and descriptions for the functions and their arguments are listed. Please check the package manual and see the details.