--- title: "Introduction to ClusTorus" author: "Seungki Hong" date: "1/21/2021" output: html_document vignette: > %\VignetteIndexEntry{Introduction to ClusTorus} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ClusTorus is a package for clustering multivariate angular data, especially for protein structure data. ClusTorus provides various clustering algorithms designed with the conformal prediction framework, which can deal with the outliers. The package suggests various methods for fitting the algorithms, and some of them will be introduced soon. The package also provides some simple tools for handling angluar data, such as angular subtraction, computing angular distance, etc. Now, check how to use ClusTorus briefly. ## Data Loading and Handling ClusTorus provides two toy datasets, which are used in Jung, et.al.(2021), "Clustering on the Torus by Conformal Prediction", Annals of Applied Statistics. The dataset we will use here is sampled from a mixture of $K = 3$ clusters, where the first cluster is sampled from a spherical normal distribution with size $n_1 = 100$, the second cluster of size $n_2 = 350$ is from the uniform distribution on a large ā€œLā€-shaped region, and the third cluster of size 50 is sampled from the uniform distribution on the entire $\mathbb{T}^2$. ```{r, message = FALSE, warning = FALSE} library(ClusTorus) library(tidyverse) data <- toydata2 head(data) data %>% ggplot(aes(x = phi, y = psi, color = label)) + geom_point() + scale_x_continuous(breaks = c(0,1,2,3,4)*pi/2, labels = c("0","pi/2","pi","3pi/2","2pi"), limits = c(0,2*pi))+ scale_y_continuous(breaks = c(0,1,2,3,4)*pi/2, labels = c("0","pi/2","pi","3pi/2","2pi"), limits = c(0,2*pi))+ ggtitle('Data set 2 with true labels') ``` ClusTorus provides the function `on.torus`, which converts the radian scale data to be on $[0, 2\pi)^d$, where $d$ means the number of dimension. In this case, the provided dataset is already on $[0, 2\pi)^d$, and thus we don't need to use `on.torus`. ## Clustering with Various Options Now, we are ready to implement clustering algorithms to the data. ClusTorus provides various options for clustering/constructing prediction set, but we will provide only the case for "kmeans - general". We need to choose hyperparameters: the number of modes or ellipsoids $J$ and the significance level $\alpha$. Before choosing the hyperparameter, we will implement the model fitting function `icp.torus.score` with various hyperparameter options, first. ```{r, warning = FALSE} set.seed(2021) Jvec <- 5:30 l <- icp.torus(as.matrix(data[, 1:2]), model = "kmeans", kmeansfitmethod = "general", init = "hierarchical", J = Jvec, verbose = FALSE) ``` The list `l` contains the `icp.torus` objects, which consist of fitted parameters for generating clusters, by varying the hypterparameter $J$. That is, these objects are optimally fitted ingredients for generating clusters for given $J$. By specifying the significance level, we can get the clusters. But, how to generate optimal clusters/conformal prediction sets? One may think that the hyperparameter which generates the cluster/prediction set of the minimum volume/area will be the optimum for given significance level. The other may think that we can choose the number of mixture components $J$ by using information criteria such as AIC or BIC. These approaches are implemented in the function `hyperparam.torus`; the main arguments of `hyperparam.torus` is `data`, `icp.torus.objects`, and `option`. Analogously, the argument `data` is analogously for the data. `icp.torus.objects` is for the list object whose elements are `icp.torus` objects, such as the list `l` generated above. The argument `option`, which is the most important argument, is for the hyperparameter selection criterion. If `option = "elbow"`, then `hyperparam.torus` selects $J$ and $\alpha$ based on the volume based criterion as mentioned above. If `option = "AIC"` or `option = "BIC"`, then `hyperparam.torus` selects $J$ based on the designated information criterion, and selects the most stable $\alpha$ in the sense of the number of generated clusters. If `option = "risk"`, then it chooses $J$ which minimizes the sum of the conformity scores and analogously choose the stable $\alpha$. We will use `option = "risk"` in this case, and the following codes show the criterion results($J$ versus the evaluated criterion), $\alpha$ results($\alpha$ versus the number of clusters), and chosen $J$ and $\alpha$. ```{r} output <- hyperparam.torus(l, option = "risk") output plot(output) ``` `hyperparam.torus` also provides the selected model; that is, since `icp.torus.score` is the function for the model fitting, `hyperparam.torus` selects the model of the chosen $J$. `output$optim$icp.torus` is the selected model, and we will generate the resulting clusters with it. ```{r} icp.torus.kmeans <- output$icp.torus alphahat <- output$alphahat ``` ## Generating Clusters and Visualization of Clustering Results With `cluster.assign.torus`, we can generate the cluster for option `mixture` and `kmeans`, and for each data point, the label of cluster is assigned. Moreover, for the case of `kmeans`, we can check how the cluster is generated directly as below: ```{r} c_kmeans <- cluster.assign.torus(icp.torus.kmeans, level = alphahat) plot(icp.torus.kmeans, level = alphahat) ``` Also, with the assigned cluster labels, we can visualize the clusters with colored data points; ```{r} c_kmeans plot(c_kmeans) ``` It seems that the resulting clusters perform satisfactory compared to the original cluster assignments. If you clearly follow the examples, it will be relatively easy to use the other options for the argument `method = "kde"` or `method = "mixture"` of the model fitting function `icp.torus.score`, and for the argument `option = "AIC"`, `option = "BIC"`, `option = "elbow"` of the hyperparameter selecting function `hyperparam.torus`. You may need the detailed manual. In the package manual, concrete examples and descriptions for the functions and their arguments are listed. Please check the package manual and see the details.