ClusTorus is a package for clustering multivariate angular data, especially for protein structure data. ClusTorus provides various clustering algorithms designed with the conformal prediction framework, which can deal with the outliers. The package suggests various methods for fitting the algorithms, and some of them will be introduced soon. The package also provides some simple tools for handling angluar data, such as angular subtraction, computing angular distance, etc. Now, check how to use ClusTorus briefly.
ClusTorus provides two toy datasets, which are used in Jung, et.al.(2021), “Clustering on the Torus by Conformal Prediction”, Annals of Applied Statistics. The dataset we will use here is sampled from a mixture of \(K = 3\) clusters, where the first cluster is sampled from a spherical normal distribution with size \(n_1 = 100\), the second cluster of size \(n_2 = 350\) is from the uniform distribution on a large “L”-shaped region, and the third cluster of size 50 is sampled from the uniform distribution on the entire \(\mathbb{T}^2\).
library(ClusTorus)
library(tidyverse)
data <- toydata2
head(data)
#> phi psi label
#> 1 2.730154 0.2819140 1
#> 2 3.004941 0.7895926 1
#> 3 2.733773 0.9507966 1
#> 4 2.976103 0.6336779 1
#> 5 2.655409 0.6599409 1
#> 6 3.196350 0.6545109 1
data %>% ggplot(aes(x = phi, y = psi, color = label)) + geom_point() +
scale_x_continuous(breaks = c(0,1,2,3,4)*pi/2, labels = c("0","pi/2","pi","3pi/2","2pi"), limits = c(0,2*pi))+
scale_y_continuous(breaks = c(0,1,2,3,4)*pi/2, labels = c("0","pi/2","pi","3pi/2","2pi"), limits = c(0,2*pi))+
ggtitle('Data set 2 with true labels')
ClusTorus provides the function
on.torus, which converts
the radian scale data to be on \([0,
2\pi)^d\), where \(d\) means the
number of dimension. In this case, the provided dataset is already on
\([0, 2\pi)^d\), and thus we don’t need
to use on.torus.
Now, we are ready to implement clustering algorithms to the data.
ClusTorus provides various options for clustering/constructing
prediction set, but we will provide only the case for “kmeans -
general”. We need to choose hyperparameters: the number of modes or
ellipsoids \(J\) and the significance
level \(\alpha\). Before choosing the
hyperparameter, we will implement the model fitting function
icp.torus.score with various hyperparameter options,
first.
set.seed(2021)
Jvec <- 5:30
l <- icp.torus(as.matrix(data[, 1:2]),
model = "kmeans",
kmeansfitmethod = "general",
init = "hierarchical",
J = Jvec,
verbose = FALSE)The list l contains the icp.torus objects,
which consist of fitted parameters for generating clusters, by varying
the hypterparameter \(J\). That is,
these objects are optimally fitted ingredients for generating clusters
for given \(J\). By specifying the
significance level, we can get the clusters. But, how to generate
optimal clusters/conformal prediction sets? One may think that the
hyperparameter which generates the cluster/prediction set of the minimum
volume/area will be the optimum for given significance level. The other
may think that we can choose the number of mixture components \(J\) by using information criteria such as
AIC or BIC. These approaches are implemented in the function
hyperparam.torus; the main arguments of
hyperparam.torus is data,
icp.torus.objects, and option. Analogously,
the argument data is analogously for the data.
icp.torus.objects is for the list object whose elements are
icp.torus objects, such as the list l
generated above. The argument option, which is the most
important argument, is for the hyperparameter selection criterion. If
option = "elbow", then hyperparam.torus
selects \(J\) and \(\alpha\) based on the volume based
criterion as mentioned above. If option = "AIC" or
option = "BIC", then hyperparam.torus selects
\(J\) based on the designated
information criterion, and selects the most stable \(\alpha\) in the sense of the number of
generated clusters. If option = "risk", then it chooses
\(J\) which minimizes the sum of the
conformity scores and analogously choose the stable \(\alpha\). We will use
option = "risk" in this case, and the following codes show
the criterion results(\(J\) versus the
evaluated criterion), \(\alpha\)
results(\(\alpha\) versus the number of
clusters), and chosen \(J\) and \(\alpha\).
output <- hyperparam.torus(l, option = "risk")
#> ..........................
#> .......
output
#> Type of conformity score: kmeans general
#> Optimizing method: risk
#> -------------
#> Optimally chosen parameters. Number of components = 5 , alpha = 0.084
#> Results based on criterion risk :
#> J criterion
#> 1 5 2495.134
#> 2 6 2499.865
#> 3 7 2501.007
#> 4 8 2510.881
#> 5 9 2536.576
#> 6 10 2555.744
#> 7 11 2552.515
#> 8 12 2549.668
#> 9 13 2568.499
#> 10 14 2585.362
#> 11 15 2595.043
#> 12 16 2608.337
#> 13 17 2604.204
#> 14 18 2598.733
#> 15 19 2612.264
#> 16 20 2616.506
#> 17 21 2638.969
#> 18 22 2639.597
#> 19 23 2654.307
#> 20 24 2656.181
#> 21 25 2659.685
#> 22 26 2668.536
#> 23 27 2668.536
#> 24 28 2737.958
#> 25 29 2759.033
#> 26 30 2754.286
#>
#> Clustering results by varying alpha on [0.0020.15] :
#> alpha ncluster
#> 1 0.002 1
#> 2 0.004 1
#> 3 0.006 1
#> 4 0.008 1
#> 5 0.010 1
#> 6 0.012 1
#> 7 0.014 1
#> 8 0.016 1
#> 9 0.018 1
#> 10 0.020 2
#> 11 0.022 2
#> 12 0.024 2
#> 13 0.026 2
#> 14 0.028 3
#> 15 0.030 3
#> 16 0.032 4
#> 17 0.034 4
#> 18 0.036 4
#> 19 0.038 4
#> 20 0.040 4
#> 21 0.042 4
#> 22 0.044 4
#> 23 0.046 4
#> 24 0.048 4
#> 25 0.050 4
#> 26 0.052 4
#> 27 0.054 4
#> 28 0.056 4
#> 29 0.058 2
#> 30 0.060 2
#> 31 0.062 2
#> 32 0.064 2
#> 33 0.066 2
#> 34 0.068 2
#> 35 0.070 2
#> 36 0.072 2
#> 37 0.074 2
#> 38 0.076 2
#> 39 0.078 2
#> 40 0.080 2
#> 41 0.082 2
#> 42 0.084 2
#> 43 0.086 2
#> 44 0.088 2
#> 45 0.090 2
#> 46 0.092 2
#> 47 0.094 2
#> 48 0.096 2
#> 49 0.098 2
#> 50 0.100 2
#> 51 0.102 2
#> 52 0.104 2
#> 53 0.106 2
#> 54 0.108 2
#> 55 0.110 2
#> 56 0.112 3
#> 57 0.114 3
#> 58 0.116 3
#> 59 0.118 3
#> 60 0.120 3
#> 61 0.122 3
#> 62 0.124 3
#> 63 0.126 3
#> 64 0.128 3
#> 65 0.130 3
#> 66 0.132 3
#> 67 0.134 3
#> 68 0.136 3
#> 69 0.138 3
#> 70 0.140 3
#> 71 0.142 3
#> 72 0.144 3
#> 73 0.146 3
#> 74 0.148 3
#> 75 0.150 3
#>
#> Available components:
#> [1] "model" "option" "IC.results" "alpha.results"
#> [5] "icp.torus" "Jhat" "alphahat"
plot(output)hyperparam.torus also provides the selected model; that
is, since icp.torus.score is the function for the model
fitting, hyperparam.torus selects the model of the chosen
\(J\).
output$optim$icp.torus is the selected model, and we will
generate the resulting clusters with it.
With cluster.assign.torus, we can generate the cluster
for option mixture and kmeans, and for each
data point, the label of cluster is assigned. Moreover, for the case of
kmeans, we can check how the cluster is generated directly
as below:
c_kmeans <- cluster.assign.torus(icp.torus.kmeans, level = alphahat)
plot(icp.torus.kmeans, level = alphahat)Also, with the assigned cluster labels, we can visualize the clusters with colored data points;
c_kmeans
#> Number of clusters: 2
#> -------------
#> Clustering results by log density:
#> [1] 1 1 1 1 1 1 1 1 1 1
#> cluster sizes: 209 791
#>
#> Clustering results by posterior:
#> [1] 1 1 1 1 1 1 1 1 1 1
#> cluster sizes: 245 755
#>
#> Clustering results by representing outliers:
#> [1] 1 1 1 1 1 1 1 3 1 1
#> cluster sizes: 196 704 100
#>
#> Note: cluster id = 3 represents outliers.
#>
#> Clustering results by Mahalanobis distance:
#> [1] 1 1 1 1 1 1 1 1 1 1
#> cluster sizes: 209 791
#>
#> 990 clustering results are omitted.
plot(c_kmeans)It seems that the resulting clusters perform satisfactory compared to the original cluster assignments.
If you clearly follow the examples, it will be relatively easy to use
the other options for the argument method = "kde" or
method = "mixture" of the model fitting function
icp.torus.score, and for the argument
option = "AIC", option = "BIC",
option = "elbow" of the hyperparameter selecting function
hyperparam.torus. You may need the detailed manual. In the
package manual, concrete examples and descriptions for the functions and
their arguments are listed. Please check the package manual and see the
details.