Title: | Prediction and Clustering on the Torus by Conformal Prediction |
---|---|
Description: | Provides various tools of for clustering multivariate angular data on the torus. The package provides angular adaptations of usual clustering methods such as the k-means clustering, pairwise angular distances, which can be used as an input for distance-based clustering algorithms, and implements clustering based on the conformal prediction framework. Options for the conformal scores include scores based on a kernel density estimate, multivariate von Mises mixtures, and naive k-means clusters. Moreover, the package provides some basic data handling tools for angular data. |
Authors: | Sungkyu Jung [aut, cph], Seungki Hong [aut, cre], Kiho Park [ctb], Byungwon Kim [ctb] |
Maintainer: | Seungki Hong <[email protected]> |
License: | GPL-3 |
Version: | 0.2.4 |
Built: | 2025-02-07 04:28:02 UTC |
Source: | https://github.com/sungkyujung/clustorus |
ang.dist
computes element-wise angular distance between
two angular values in .
ang.dist(x, y)
ang.dist(x, y)
x , y
|
angular data(both scalar or vector) whose elements are in |
angular data (scalar or vector) whose elements are in
Jung, S., Park, K., & Kim, B. (2021). Clustering on the torus by conformal prediction. The Annals of Applied Statistics, 15(4), 1583-1603.
x <- c(pi/3, 0) y <- c(pi/4, pi/2) ang.dist(x, y)
x <- c(pi/3, 0) y <- c(pi/4, pi/2) ang.dist(x, y)
ang.minus
computes element-wise angular subtraction defined as
ang.minus(x, y)
ang.minus(x, y)
x , y
|
angular data(scalar or vector) whose elements are in |
returns a scalar or a vector whose elements are in
.
Jung, S., Park, K., & Kim, B. (2021). Clustering on the torus by conformal prediction. The Annals of Applied Statistics, 15(4), 1583-1603.
x <- c(pi/2, 0) y <- c(pi, pi/3) ang.minus(x, y)
x <- c(pi/2, 0) y <- c(pi, pi/3) ang.minus(x, y)
ang.pdist
computes pairwise angular distances matrix.
ang.pdist(data)
ang.pdist(data)
data |
n x d angular data on |
ang.pdist
returns pairwise angular distances matrix
with the class dist
Jung, S., Park, K., & Kim, B. (2021). Clustering on the torus by conformal prediction. The Annals of Applied Statistics, 15(4), 1583-1603.
data <- matrix(c(pi/3, pi/3, pi/2, pi, pi/4, pi/2, 0, pi/3, pi/6), ncol = 3, byrow = TRUE) ang.pdist(data)
data <- matrix(c(pi/3, pi/3, pi/2, pi, pi/4, pi/2, 0, pi/3, pi/6), ncol = 3, byrow = TRUE) ang.pdist(data)
clus.torus
returns clustering results of data on the torus based on
inductive conformal prediction set
clus.torus( data, split.id = NULL, model = c("kmeans", "mixture"), mixturefitmethod = c("axis-aligned", "circular", "general"), kmeansfitmethod = c("general", "homogeneous-circular", "heterogeneous-circular", "ellipsoids"), J = NULL, level = NULL, option = NULL, verbose = TRUE, ... ) ## S3 method for class 'clus.torus' plot( x, panel = 1, assignment = "outlier", data = NULL, ellipse = TRUE, type = NULL, overlay = FALSE, out = FALSE, ... )
clus.torus( data, split.id = NULL, model = c("kmeans", "mixture"), mixturefitmethod = c("axis-aligned", "circular", "general"), kmeansfitmethod = c("general", "homogeneous-circular", "heterogeneous-circular", "ellipsoids"), J = NULL, level = NULL, option = NULL, verbose = TRUE, ... ) ## S3 method for class 'clus.torus' plot( x, panel = 1, assignment = "outlier", data = NULL, ellipse = TRUE, type = NULL, overlay = FALSE, out = FALSE, ... )
data |
n x d matrix of toroidal data on |
split.id |
a n-dimensional vector consisting of values 1 (estimation) and 2(evaluation) |
model |
A string. One of "mixture" and "kmeans" which
determines the model or estimation methods. If "mixture", the model is based
on the von Mises mixture, fitted
with an EM algorithm. It supports the von Mises mixture and its variants
based conformity scores. If "kmeans", the model is also based on the von
Mises mixture, but the parameter estimation is implemented with the
elliptical k-means algorithm. It supports the
log-max-mixture based conformity score only. If the
dimension of data space is greater than 2, only "kmeans" is supported.
Default is |
mixturefitmethod |
A string. One of "circular", "axis-aligned", and
"general" which determines the constraint of the EM fitting. Default is
"axis-aligned". This argument only works for |
kmeansfitmethod |
A string. One of "general", ellipsoids",
"heterogeneous-circular" or "homogeneous-circular". If "general", the
elliptical k-means algorithm with no constraint is used. If "ellipsoids",
only the one iteration of the algorithm is used. If"heterogeneous-circular",
the same as above, but with the constraint that ellipsoids must be spheres.
If "homogeneous-circular", the same as above but the radii of the spheres are
identical. Default is "general". This argument only works for |
J |
the number of components for mixture model fitting. If |
level |
a scalar in |
option |
A string. One of "elbow", "risk", "AIC", or "BIC", which determines the
criterion for the model selection. "risk" is based on the negative log-likelihood, "AIC"
for the Akaike Information Criterion, and "BIC" for the Bayesian Information Criterion.
"elbow" is based on minimizing the criterion used in Jung et. al.(2021).
This argument is only used if |
verbose |
boolean index, which indicates whether display
additional details as to what the algorithm is doing or
how many loops are done. Default is |
... |
Further arguments that will be passed to |
x |
|
panel |
One of 1 or 2 which determines the type of plot. If |
assignment |
A string. One of "outlier", "log.density", "posterior", "mahalanobis". Default is "outlier". |
ellipse |
A boolean index which determines whether plotting ellipse-intersections. Default is |
type |
A string. One of "mix", "max" or "e". This argument is only available if |
overlay |
A boolean index which determines whether plotting ellipse-intersections on clustering plots. Default is |
out |
An option for returning the ggplot object. Default is |
clus.torus
is a user-friendly all-in-one function which implements following
procedures automatically: 1. compute conformity scores for given model and fitting method,
2. choose optimal model and level based on prespecified criterion, and
3. make clusters based on the chosen model and level. Procedure 1-3 can be
independently done with icp.torus
, hyperparam.torus
,
hyperparam.J
, hyperparam.alpha
and cluster.assign.torus
.
If you want to see more detail for each procedure, please see
icp.torus
, hyperparam.J
, hyperparam.alpha
hyperparam.torus
, cluster.assign.torus
.
clus.torus
returns a clus.torus
object, which consists of following 3 different S3 objects;
cluster.obj
cluster.obj
object; clustering assignment results for
several methods. For detail, see cluster.assign.torus
.
icp.torus
icp.torus
object; containing model parameters and
conformity scores. For detail, see icp.torus
.
hyperparam.select
hyperparam.torus
object (if J = NULL
or a
sequence of numbers, and level = NULL
or a sequence of numbers), hyperparam.J
object (if level
is a scalar), or hyperparam.alpha
object (if J
is a scalar);
contains information for the optimally chosen model (number of components J) and level (alpha)
based on prespecified criterion. For detail, see hyperparam.torus
, hyperparam.J
, and hyperparam.alpha
.
Jung, S., Park, K., & Kim, B. (2021). Clustering on the torus by conformal prediction. The Annals of Applied Statistics, 15(4), 1583-1603.
Mardia, K. V., Kent, J. T., Zhang, Z., Taylor, C. C., & Hamelryck, T. (2012). Mixtures of concentrated multivariate sine distributions with applications to bioinformatics. Journal of Applied Statistics, 39(11), 2475-2492.
Shin, J., Rinaldo, A., & Wasserman, L. (2019). Predictive clustering. arXiv preprint arXiv:1903.08125.
icp.torus
, hyperparam.torus
,
hyperparam.J
, hyperparam.alpha
cluster.assign.torus
data <- toydata2[, 1:2] n <- nrow(data) clus.torus(data = data, model = "kmeans", kmeansfitmethod = "general", J = 5:30, option = "risk")
data <- toydata2[, 1:2] n <- nrow(data) clus.torus(data = data, model = "kmeans", kmeansfitmethod = "general", J = 5:30, option = "risk")
cluster.assign.torus
returns clustering assignment for data
given icp.torus
objects, which can be constructed with
icp.torus
.
plot.clus.torus
plots clustering results, which is given by cluster.obj
object, with some options.
cluster.assign.torus(icp.object, data = NULL, level = NULL) ## S3 method for class 'cluster.obj' plot( x, assignment = c("outlier", "log.density", "posterior", "mahalanobis"), overlay = FALSE, out = FALSE, ... )
cluster.assign.torus(icp.object, data = NULL, level = NULL) ## S3 method for class 'cluster.obj' plot( x, assignment = c("outlier", "log.density", "posterior", "mahalanobis"), overlay = FALSE, out = FALSE, ... )
icp.object |
an object must be an |
data |
n x d matrix of toroidal data on |
level |
a scalar in |
x |
|
assignment |
A string. One of "outlier", "log.density", "posterior", "mahalanobis". Default is "outlier". |
overlay |
A boolean index which determines whether plotting ellipse-intersections on clustering plots.
Default is |
out |
An option for returning the ggplot object. Default is |
... |
additional parameter for ggplot2::ggplot() |
clustering assignment for data, given icp.torus
objects
cluster.id.by.log.density
cluster assignment result based on approximate log-density.
cluster.id.by.posterior
cluster assignment result based on the posterior probability.
cluster.id.outlier
cluster assignment result which regards data not included in conformal prediction set as outliers.
cluster.id.by.Mah.dist
cluster assignment result based on Mahalanobis distance.
level
used level which determines the size of clusters(conformal prediction set).
data
input data which are assigned to each cluster.
icp.torus
icp.torus
object which is used for cluster assignment.
Jung, S., Park, K., & Kim, B. (2021). Clustering on the torus by conformal prediction. The Annals of Applied Statistics, 15(4), 1583-1603.
Gilitschenski, I., & Hanebeck, U. D. (2012, July). A robust computational test for overlap of two arbitrary-dimensional ellipsoids in fault-detection of kalman filters. In 2012 15th International Conference on Information Fusion (pp. 396-401). IEEE.
data <- toydata1[, 1:2] icp.torus <- icp.torus(data, model = "kmeans", kmeansfitmethod = "general", J = 4, concentration = 25) level <- 0.1 cluster.assign.torus(icp.torus, level = level)
data <- toydata1[, 1:2] icp.torus <- icp.torus(data, model = "kmeans", kmeansfitmethod = "general", J = 4, concentration = 25) level <- 0.1 cluster.assign.torus(icp.torus, level = level)
cp.torus.kde
computes conformal prediction set indices
(TRUE if in the set) using kernel density estimation as conformity score.
cp.torus.kde(data, eval.point = grid.torus(), level = 0.1, concentration = 25) ## S3 method for class 'cp.torus.kde' plot(x, level.id = 1, ...)
cp.torus.kde(data, eval.point = grid.torus(), level = 0.1, concentration = 25) ## S3 method for class 'cp.torus.kde' plot(x, level.id = 1, ...)
data |
n x d matrix of toroidal data on |
eval.point |
N x N numeric matrix on |
level |
either a scalar or a vector, or even |
concentration |
positive number which has the role of |
x |
|
level.id |
an integer among |
... |
additional parameter for ggplot2::ggplot() |
If level
is NULL
, then return kde at eval.point
and at data points.
If level
is a vector, return the above and prediction set indices
for each value of level.
Jung, S., Park, K., & Kim, B. (2021). Clustering on the torus by conformal prediction. The Annals of Applied Statistics, 15(4), 1583-1603.
Di Marzio, M., Panzera, A., & Taylor, C. C. (2011). Kernel density estimation on the torus. Journal of Statistical Planning and Inference, 141(6), 2156-2173.
data <- ILE[1:200, 1:2] cp.torus.kde(data, eval.point = grid.torus(), level = 0.05, concentration = 25)
data <- ILE[1:200, 1:2] cp.torus.kde(data, eval.point = grid.torus(), level = 0.05, concentration = 25)
The torsion angle dataset of SARS-CoV-2 spike glycopreotein.
data_6VXX
data_6VXX
data_6VXX
consists of following informations:
phi
main chain torsion angle for atoms C,N,CA,C.
psi
main chain torsion angle for atoms N,CA,C,N.
omega
main chain torsion angle for atoms CA,C,N,CA.
alpha
virtual torsion angle between consecutive C-alpha atoms.
chi1
side chain torsion angle for atoms N,CA,CB,*G.
chi2
side chain torsion angle for atoms CA,CB,*G,*D.
chi3
side chain torsion angle for atoms CB,*G,*D,*E.
chi4
side chain torsion angle for atoms *G,*D,*E,*Z.
chi5
side chain torsion angle for atoms *D,*E,*Z, NH1.
coords
numeric matrix of ‘justified’ coordinates.
tbl
a numeric matrix of psi, phi and chi torsion angles.
This data can be downloaded in
https://www.rcsb.org/structure/6VXX, or with using R package
bio3d
. Precisely, we use the code: bio3d::torsion.pdb(bio3d::read.pdb("6vxx"))
Walls, A. C., Park, Y. J., Tortorici, M. A., Wall, A., McGuire, A. T., & Veesler, D. (2020). Structure, function, and antigenicity of the SARS-CoV-2 spike glycoprotein. Cell, 181(2), 281-292. Retrived from https://www.wwpdb.org/pdb?id=pdb_00006vxx
Description of the angluar information is from the 'value'
part of torsion.pdb
in the package bio3d
.
ellip.kmeans.torus
prepares the parameters for conformity scores
which are derived by k-means clustering on torus.
ellip.kmeans.torus( data, centers = 10, type = c("homogeneous-circular", "heterogeneous-circular", "ellipsoids", "general"), init = c("kmeans", "hierarchical"), d = NULL, additional.condition = TRUE, THRESHOLD = 1e-10, maxiter = 200, verbose = TRUE, ... )
ellip.kmeans.torus( data, centers = 10, type = c("homogeneous-circular", "heterogeneous-circular", "ellipsoids", "general"), init = c("kmeans", "hierarchical"), d = NULL, additional.condition = TRUE, THRESHOLD = 1e-10, maxiter = 200, verbose = TRUE, ... )
data |
data n x d matrix of toroidal data on |
centers |
either the number of clusters or a set of initial cluster centers. If a number, a random set of row in x is chosen as the initial centers. |
type |
character which must be "homogeneous-circular", "heterogeneous-circular", or "general". If "homogeneous-circular", the radii of k-spheres are identical. If "heterogeneous-circular", the radii of k-spheres may be different. If "ellipsoids", cluster with k-ellipsoids without optimized parameters. If, "general", clustering with k-ellipsoids. The parameters to construct the ellipses are optimized with elliptical k-means algorithm, which is modified for toroidal space. See references for the detail. Default is "homogeneous-circular". |
init |
determine the initial parameter for option "general". Must be "kmeans" or "hierarchical". If "kmeans", the initial parameters are obtained with extrinsic kmeans method. If "hierarchical", the initial parameters are obtained with hierarchical clustering method. Default is "hierarchical". |
d |
pairwise distance matrix( |
additional.condition |
boolean index.
If |
THRESHOLD |
number of threshold for difference between updating and updated parameters. Default is 1e-10. |
maxiter |
the maximal number of iteration. Default is 200. |
verbose |
boolean index, which indicates whether display
additional details as to what the algorithm is doing or
how many loops are done. Default is |
... |
Further arguments for argument |
returns a list, containing all values which determines the shape and location of spheres.
Jung, S., Park, K., & Kim, B. (2021). Clustering on the torus by conformal prediction. The Annals of Applied Statistics, 15(4), 1583-1603.
Mardia, K. V., Kent, J. T., Zhang, Z., Taylor, C. C., & Hamelryck, T. (2012). Mixtures of concentrated multivariate sine distributions with applications to bioinformatics. Journal of Applied Statistics, 39(11), 2475-2492.
Shin, J., Rinaldo, A., & Wasserman, L. (2019). Predictive clustering. arXiv preprint arXiv:1903.08125.
data <- ILE[1:200, 1:2] ellip.kmeans.torus(data, centers = 3, type = "general", init = "hierarchical")
data <- ILE[1:200, 1:2] ellip.kmeans.torus(data, centers = 3, type = "general", init = "hierarchical")
EMsinvMmix
returns fitted parameters of J-mixture of
bivariate sine von Mises distributions.
EMsinvMmix( data, J = 4, parammat = EMsinvMmix.init(data, J), THRESHOLD = 1e-10, maxiter = 100, type = c("circular", "axis-aligned", "general"), kmax = 500, verbose = TRUE )
EMsinvMmix( data, J = 4, parammat = EMsinvMmix.init(data, J), THRESHOLD = 1e-10, maxiter = 100, type = c("circular", "axis-aligned", "general"), kmax = 500, verbose = TRUE )
data |
n x 2 matrix of toroidal data on |
J |
number of components of mixture density |
parammat |
6 x J parameter data with the following components:
|
THRESHOLD |
number of threshold for difference between updating and updated parameters. |
maxiter |
the maximal number of iteration. |
type |
a string one of "circular", "axis-aligned", "general", and "Bayesian" which determines the fitting method. |
kmax |
the maximal number of kappa. If estimated kappa is
larger than |
verbose |
boolean index, which indicates whether display additional details as to what the algorithm is doing or how many loops are done. |
This algorithm is based on ECME algorithm. That is,
constructed with E - step and M - step and M - step
maximizes the parameters with given type
.
If type == "circular"
, then the mixture density is
just a product of two independent von Mises.
If type == "axis-aligned"
, then the mixture density is
the special case of type == "circular"
: only need to
take care of the common concentration parameter.
Iftype == "general"
, then the fitting the mixture
density is more complicated than before, check the detail of
the reference article.
returns approximated parameters for bivariate normal
distribution with list
:
list$Sigmainv[j]
: approximated covariance matrix for
j-th bivariate normal distribution, approximation of the j-th von Mises.
list$c[j]
: approximated for
j-th bivariate normal distribution, approximation of the j-th von Mises.
Jung, S., Park, K., & Kim, B. (2021). Clustering on the torus by conformal prediction. The Annals of Applied Statistics, 15(4), 1583-1603.
data <- ILE[1:200, 1:2] EMsinvMmix(data, J = 3, THRESHOLD = 1e-10, maxiter = 200, type = "general", kmax = 500, verbose = FALSE)
data <- ILE[1:200, 1:2] EMsinvMmix(data, J = 3, THRESHOLD = 1e-10, maxiter = 200, type = "general", kmax = 500, verbose = FALSE)
grid.torus
returns an equally-spaced grid on torus.
grid.torus(d = 2, grid.size = 100)
grid.torus(d = 2, grid.size = 100)
d |
number for dimension. Default is 2. |
grid.size |
number of grid for each axis. Default value is 100. |
returns (grid.size) x (grid.size) numeric matrix which indicates the grid points on torus.
grid.torus(d = 2, grid.size = 100)
grid.torus(d = 2, grid.size = 100)
hyperparam.alpha
evaluates the numbers of clusters for various
levels, and select the optimal level based on the runs of the cluster numbers.
hyperparam.alpha(icp.torus, alphavec = NULL, alpha.lim = 0.15) ## S3 method for class 'hyperparam.alpha' plot(x, ...)
hyperparam.alpha(icp.torus, alphavec = NULL, alpha.lim = 0.15) ## S3 method for class 'hyperparam.alpha' plot(x, ...)
icp.torus |
an object containing all values to compute the conformity
score, which will be constructed with |
alphavec |
either a scalar or a vector, or even |
alpha.lim |
a positive number lower than 1, which is the upper bound of Default is 0.15. |
x |
|
... |
additional parameter for ggplot2::ggplot() |
returns a hyperparam.alpha
object which contains a data.frame
for
the numbers of clusters corresponding to the levels and the optimal level.
hyperparam.J
, hyperparam.torus
icp.torus
data <- toydata2[, 1:2] n <- nrow(data) split.id <- rep(2, n) split.id[sample(n, floor(n/2))] <- 1 icp.torus <- icp.torus(data, split.id = split.id, model = "kmeans", kmeansfitmethod = "ge", init = "h", J = 25, verbose = TRUE) hyperparam.alpha(icp.torus)
data <- toydata2[, 1:2] n <- nrow(data) split.id <- rep(2, n) split.id[sample(n, floor(n/2))] <- 1 icp.torus <- icp.torus(data, split.id = split.id, model = "kmeans", kmeansfitmethod = "ge", init = "h", J = 25, verbose = TRUE) hyperparam.alpha(icp.torus)
hyperparam.J
evaluates criterion for each icp.torus
objects, and select
the optimal number of mixture components based on the evaluated criterion.
hyperparam.J(icp.torus.objects, option = c("risk", "AIC", "BIC")) ## S3 method for class 'hyperparam.J' plot(x, ...)
hyperparam.J(icp.torus.objects, option = c("risk", "AIC", "BIC")) ## S3 method for class 'hyperparam.J' plot(x, ...)
icp.torus.objects |
a list whose elements are icp.torus objects, generated by
|
option |
a string one of "risk", "AIC", or "BIC", which determines the criterion for the model selection. "risk" is based on the negative log-likelihood, "AIC" for the Akaike Information Criterion, and "BIC" for the Bayesian Information Criterion. |
x |
|
... |
additional parameter for ggplot2::ggplot() |
returns a hyperparam.J
object which contains a data.frame
for
the evaluated criterion corresponding to each number of components, the optimal
number of components, and the corresponding icp.torus
object.
Jung, S., Park, K., & Kim, B. (2021). Clustering on the torus by conformal prediction. The Annals of Applied Statistics, 15(4), 1583-1603.
Akaike, H. (1974). A new look at the statistical model identification. IEEE transactions on automatic control, 19(6), 716-723.
Schwarz, G. (1978). Estimating the dimension of a model. The annals of statistics, 461-464.
icp.torus
, hyperparam.torus
,
hyperparam.alpha
data <- toydata1[,1:2] n <- nrow(data) split.id <- rep(2,n) split.id[ sample(n,floor(n/2)) ] <- 1 Jvec = 4:20 icp.torus.objects <- icp.torus(data, split.id = split.id, model = "kmeans", J = Jvec) hyperparam.J(icp.torus.objects, option = "AIC")
data <- toydata1[,1:2] n <- nrow(data) split.id <- rep(2,n) split.id[ sample(n,floor(n/2)) ] <- 1 Jvec = 4:20 icp.torus.objects <- icp.torus(data, split.id = split.id, model = "kmeans", J = Jvec) hyperparam.J(icp.torus.objects, option = "AIC")
hyperparam.torus
selects optimal hyperparameters for constructing the conformal prediction
set, based on the type of postulated model and the criterion.
hyperparam.torus( icp.torus.objects, option = NULL, alphavec = NULL, alpha.lim = NULL, eval.point = NULL ) ## S3 method for class 'hyperparam.torus' plot(x, color = "auto", ...)
hyperparam.torus( icp.torus.objects, option = NULL, alphavec = NULL, alpha.lim = NULL, eval.point = NULL ) ## S3 method for class 'hyperparam.torus' plot(x, color = "auto", ...)
icp.torus.objects |
list whose elements are icp.torus objects, generated by
|
option |
A string. One of "elbow", "risk", "AIC", or "BIC", which determines the
criterion for the model selection. "risk" is based on the negative log-likelihood, "AIC"
for the Akaike Information Criterion, and "BIC" for the Bayesian Information Criterion.
"elbow" is based on minimizing the criterion used in Jung et. al.(2021). Default is
|
alphavec |
either a scalar or a vector, or even |
alpha.lim |
a positive number lower than 1. Default value is |
eval.point |
N x N numeric matrix on |
x |
|
color |
A string for plotting |
... |
additional parameter for ggplot2::ggplot() |
returns a list object which contains data.frame
objects for
the evaluated criterion corresponding to each hyperparameter,
selected hyperparameters based on the designated criterion, and
an icp.torus
object based the selected hyperparameters.
Jung, S., Park, K., & Kim, B. (2021). Clustering on the torus by conformal prediction. The Annals of Applied Statistics, 15(4), 1583-1603.
Akaike, H. (1974). A new look at the statistical model identification. IEEE transactions on automatic control, 19(6), 716-723.
Schwarz, G. (1978). Estimating the dimension of a model. The annals of statistics, 461-464.
data <- toydata2[, 1:2] n <- nrow(data) split.id <- rep(2, n) split.id[sample(n, floor(n/2))] <- 1 Jvec <- 3:35 icp.torus.objects <- icp.torus(data, split.id = split.id, model = "kmeans", kmeansfitmethod = "ge", init = "h", J = Jvec, verbose = TRUE) hyperparam.torus(icp.torus.objects, option = "risk")
data <- toydata2[, 1:2] n <- nrow(data) split.id <- rep(2, n) split.id[sample(n, floor(n/2))] <- 1 Jvec <- 3:35 icp.torus.objects <- icp.torus(data, split.id = split.id, model = "kmeans", kmeansfitmethod = "ge", init = "h", J = Jvec, verbose = TRUE) hyperparam.torus(icp.torus.objects, option = "risk")
icp.torus
prepares all values
for computing the conformity score for specified methods.
plot.icp.torus
plots icp.torus
object with some options.
icp.torus( data, split.id = NULL, model = c("kmeans", "kde", "mixture"), mixturefitmethod = c("axis-aligned", "circular", "general"), kmeansfitmethod = c("general", "homogeneous-circular", "heterogeneous-circular", "ellipsoids"), init = c("hierarchical", "kmeans"), d = NULL, additional.condition = TRUE, J = 4, concentration = 25, kmax = 500, THRESHOLD = 1e-10, maxiter = 200, verbose = TRUE, ... ) ## S3 method for class 'icp.torus' logLik(object, ...) ## S3 method for class 'icp.torus' predict(object, newdata, ...) ## S3 method for class 'icp.torus' plot( x, data = NULL, level = 0.1, ellipse = TRUE, out = FALSE, type = NULL, ... )
icp.torus( data, split.id = NULL, model = c("kmeans", "kde", "mixture"), mixturefitmethod = c("axis-aligned", "circular", "general"), kmeansfitmethod = c("general", "homogeneous-circular", "heterogeneous-circular", "ellipsoids"), init = c("hierarchical", "kmeans"), d = NULL, additional.condition = TRUE, J = 4, concentration = 25, kmax = 500, THRESHOLD = 1e-10, maxiter = 200, verbose = TRUE, ... ) ## S3 method for class 'icp.torus' logLik(object, ...) ## S3 method for class 'icp.torus' predict(object, newdata, ...) ## S3 method for class 'icp.torus' plot( x, data = NULL, level = 0.1, ellipse = TRUE, out = FALSE, type = NULL, ... )
data |
n x d matrix of toroidal data on |
split.id |
a n-dimensional vector consisting of values 1 (estimation) and 2(evaluation) |
model |
A string. One of "kde", "mixture", and "kmeans" which
determines the model or estimation methods. If "kde", the model is based
on the kernel density estimates. It supports the kde-based conformity score
only. If "mixture", the model is based on the von Mises mixture, fitted
with an EM algorithm. It supports the von Mises mixture and its variants
based conformity scores. If "kmeans", the model is also based on the von
Mises mixture, but the parameter estimation is implemented with the
elliptical k-means algorithm illustrated in Appendix. It supports the
log-max-mixture based conformity score only. If the
dimension of data space is greater than 2, only "kmeans" is supported.
Default is |
mixturefitmethod |
A string. One of "circular", "axis-aligned", and
"general" which determines the constraint of the EM fitting. Default is
"axis-aligned". This argument only works for |
kmeansfitmethod |
A string. One of "general", ellipsoids",
"heterogeneous-circular" or "homogeneous-circular". If "general", the
elliptical k-means algorithm with no constraint is used. If "ellipsoids",
only the one iteration of the algorithm is used. If"heterogeneous-circular",
the same as above, but with the constraint that ellipsoids must be spheres.
If "homogeneous-circular", the same as above but the radii of the spheres are
identical. Default is "general". This argument only works for |
init |
Methods for choosing initial values of "kmeans" fitting.
Must be "hierarchical" or "kmeans". If "hierarchical", the initial
parameters are obtained with hierarchical clustering method.
If "kmeans", the initial parameters are obtained with extrinsic k-means method.
Additional arguments for k-means clustering and hierarchical clustering can be designated
via argument |
d |
pairwise distance matrix( |
additional.condition |
boolean index.
If |
J |
A scalar or numeric vector for the number(s) of components for |
concentration |
A scalar or numeric vector for the concentration parameter(s) for |
kmax |
the maximal number of kappa. If estimated kappa is
larger than |
THRESHOLD |
number for difference between updating and updated parameters. Default is 1e-10. |
maxiter |
the maximal number of iteration. Default is 200. |
verbose |
boolean index, which indicates whether display
additional details as to what the algorithm is doing or
how many loops are done. Moreover, if |
... |
additional parameters. For plotting icp.torus, these parameters are for ggplot2::ggplot(). |
object |
|
newdata |
n x d matrix of toroidal data on |
x |
|
level |
either a numeric scalar or a vector in |
ellipse |
A boolean index which determines whether plotting ellipses from
mixture models. Default is |
out |
An option for returning the ggplot object. Default is |
type |
A string. One of "mix", "max" or "e". This argument is only available if |
icp.torus
returns an icp.torus
object, containing all values
to compute the conformity score (if J
or concentration
is a
single value). if J
or concentration
is a vector containing
multiple values, then icp.torus
returns a list of icp.torus
objects
Jung, S., Park, K., & Kim, B. (2021). Clustering on the torus by conformal prediction. The Annals of Applied Statistics, 15(4), 1583-1603.
Mardia, K. V., Kent, J. T., Zhang, Z., Taylor, C. C., & Hamelryck, T. (2012). Mixtures of concentrated multivariate sine distributions with applications to bioinformatics. Journal of Applied Statistics, 39(11), 2475-2492.
Di Marzio, M., Panzera, A., & Taylor, C. C. (2011). Kernel density estimation on the torus. Journal of Statistical Planning and Inference, 141(6), 2156-2173.
Shin, J., Rinaldo, A., & Wasserman, L. (2019). Predictive clustering. arXiv preprint arXiv:1903.08125.
data <- toydata1[, 1:2] icp.torus <- icp.torus(data, model = "kmeans", kmeansfitmethod = "general", J = 4, concentration = 25)
data <- toydata1[, 1:2] icp.torus <- icp.torus(data, model = "kmeans", kmeansfitmethod = "general", J = 4, concentration = 25)
icp.torus.eval
evaluates whether each pre-specified evaluation point
is contained in the inductive conformal prediction sets for each given
level.
icp.torus.eval(icp.torus, level = 0.1, eval.point = grid.torus())
icp.torus.eval(icp.torus, level = 0.1, eval.point = grid.torus())
icp.torus |
an object containing all values to compute the conformity
score, which will be constructed with |
level |
either a scalar or a vector, or even |
eval.point |
N x N numeric matrix on |
returns a cp
object with the boolean values which
indicate whether each evaluation point is contained in the inductive
conformal prediction sets for each given level.
Jung, S., Park, K., & Kim, B. (2021). Clustering on the torus by conformal prediction. The Annals of Applied Statistics, 15(4), 1583-1603.
data <- toydata1[, 1:2] icp.torus <- icp.torus(data, model = "kmeans", mixturefitmethod = "general", J = 4, concentration = 25) icp.torus.eval(icp.torus, level = c(0.1, 0.08), eval.point = grid.torus())
data <- toydata1[, 1:2] icp.torus <- icp.torus(data, model = "kmeans", mixturefitmethod = "general", J = 4, concentration = 25) icp.torus.eval(icp.torus, level = c(0.1, 0.08), eval.point = grid.torus())
An isomer of leucine, essential branched-chain aliphatic amino acid found in many proteins.
ILE
ILE
This list contains the following components:
phi
main chain torsion angle for atoms C,N,CA,C.
psi
main chain torsion angle for atoms N,CA,C,N.
chi1
side chain torsion angle for atoms N,CA,CB,*G.
chi2
side chain torsion angle for atoms CA,CB,*G,*D.
ILE data is generated with collection of different pdb files. To select adequate protein data, we use PISCES server. (the method is introduced in articles of references.) To select high-quality protein data, we use several benchmarks: resolution : 1.6A(angstrom) or better, R-factor : 0.22 or better, Sequence percentage identity: <= 25 Then, we select ILE only angular data for each protein data. To see the detail code, visit https://github.com/sungkyujung/ClusTorus
This data is extracted from PISCES server http://dunbrack.fccc.edu/pisces/
Data description is from https://www.rcsb.org/ligand/ILE.
The data extracting method is from Harder, T., Boomsma, W., Paluszewski, M., Frellsen, J., Johansson, K. E., & Hamelryck, T. (2010). Beyond rotamers: a generative, probabilistic model of side chains in proteins. BMC bioinformatics, 11(1), 1-13.
Mardia, K. V., Kent, J. T., Zhang, Z., Taylor, C. C., & Hamelryck, T. (2012). Mixtures of concentrated multivariate sine distributions with applications to bioinformatics. Journal of Applied Statistics, 39(11), 2475-2492.
Description of the angluar information is from the 'value'
part of torsion.pdb
in the package bio3d
.
kde.torus
returns a kde using independent multivariate von mises kernel.
kde.torus(data, eval.point = NULL, concentration = 25)
kde.torus(data, eval.point = NULL, concentration = 25)
data |
n x d matrix of toroidal data on |
eval.point |
N x N numeric matrix on |
concentration |
positive number which has the role of |
kde.torus
returns N-dimensional vector of kdes evaluated at eval.point
Jung, S., Park, K., & Kim, B. (2021). Clustering on the torus by conformal prediction. The Annals of Applied Statistics, 15(4), 1583-1603.
Di Marzio, M., Panzera, A., & Taylor, C. C. (2011). Kernel density estimation on the torus. Journal of Statistical Planning and Inference, 141(6), 2156-2173.
data <- ILE[1:200, 1:2] kde.torus(data)
data <- ILE[1:200, 1:2] kde.torus(data)
kmeans.torus
implements extrinsic k-means clustering
on toroidal space.
kmeans.torus(data, centers = 10, ...) ## S3 method for class 'kmeans.torus' predict(object, newdata, ...)
kmeans.torus(data, centers = 10, ...) ## S3 method for class 'kmeans.torus' predict(object, newdata, ...)
data |
n x d matrix of toroidal data on |
centers |
either the number of clusters or a set of initial cluster centers. If a number, a random set of row in x is chosen as the initial centers. |
... |
additional parameter |
object |
|
newdata |
n x d matrix of toroidal data on |
In Euclidean space, we know that the total sum of squares is equal to the summation of the within cluster sum of squares and the between cluster centers sum of squares. However, toroidal space does not satisfy the property; the equality does not hold. Thus, you need to be careful to use the sum of squares.
Extrinsic k-means algorithm uses the ambient space for .
Each datum is transformed to a vector in 2d-dimensional
Euclidean space, whose elements are sine and cosine values of the datum,
then a usual k-means algorithm is applied to transformed data.
returns a kmeans
object, which contains
extrinsic.results
extrinsic k-means clustering results using ordinary kmeans algorithm.
centers
A matrix of cluster centers.
membership
A vector of integers indicating the cluster to which each point is allocated.
size
The number of points in each cluster.
withinss
Vector of within-cluster sum of squares, one component per cluster.
totss
The total sum of squares, based on angular distance.
betweenss
The between-cluster sum of squares.
Jung, S., Park, K., & Kim, B. (2021). Clustering on the torus by conformal prediction. The Annals of Applied Statistics, 15(4), 1583-1603.
Gao, Y., Wang, S., Deng, M., & Xu, J. (2018). RaptorX-Angle: real-value prediction of protein backbone dihedral angles through a hybrid method of clustering and deep learning. BMC bioinformatics, 19(4), 73-84.
data <- ILE[1:200, 1:2] kmeans.torus(data, centers = 2, iter.max = 100, nstart = 1)
data <- ILE[1:200, 1:2] kmeans.torus(data, centers = 2, iter.max = 100, nstart = 1)
on.torus
transforms d-dimensional angular data
to be on .
on.torus(x)
on.torus(x)
x |
d-dimensional angular data(vector or matrix) whose unit is the radian. |
d-dimensional radian-unit angular data on .
data <- SARS_CoV_2 * pi / 180 on.torus(data)
data <- SARS_CoV_2 * pi / 180 on.torus(data)
The torsion angle dataset of the chain B of SARS-CoV-2 spike glycopreotein. This data is originally
from first two main torsion angles of data_6VXX
.
SARS_CoV_2
SARS_CoV_2
This data.frame contains the following columns:
phi
main chain torsion angle for atoms C,N,CA,C.
psi
main chain torsion angle for atoms N,CA,C,N.
This data is obtained with following codes:
This data can be downloaded in
https://www.rcsb.org/structure/6VXX, or with using R package
bio3d
. To see the precise extracting code, visit https://github.com/sungkyujung/ClusTorus/tree/master/data-raw
Walls, A. C., Park, Y. J., Tortorici, M. A., Wall, A., McGuire, A. T., & Veesler, D. (2020). Structure, function, and antigenicity of the SARS-CoV-2 spike glycoprotein. Cell, 181(2), 281-292. Retrived from https://www.wwpdb.org/pdb?id=pdb_00006vxx
Description of the angluar information is from the 'value'
part of torsion.pdb
in the package bio3d
.
tor.minus
computes angular subtraction bewtween n x d toroidal data and
a d dimensional vector.
tor.minus(data, mu)
tor.minus(data, mu)
data |
n x d matrix of toroidal data |
mu |
a d-dimensinal vector |
angular subtraction bewtween n x d toroidal data and a d dimensional vector.
Jung, S., Park, K., & Kim, B. (2021). Clustering on the torus by conformal prediction. The Annals of Applied Statistics, 15(4), 1583-1603.
data <- ILE[1:200, 1:2] Mu1 <- c(4.5, 3) tor.minus(data, Mu1)
data <- ILE[1:200, 1:2] Mu1 <- c(4.5, 3) tor.minus(data, Mu1)
Artificially generated data on the 2 dimensional torus
toydata1
toydata1
This data.frame
contains the following components:
phi
column for the first angle
psi
column for the second angle
label
column for the clustering membership
toydata1 is an artificial data generated from a mixture of 5 clusters, where three clusters are sampled from bivariate normal distributions and the other two are each sampled from the uniform distribution on a rectangle.
Jung, S., Park, K., & Kim, B. (2021). Clustering on the torus by conformal prediction. The Annals of Applied Statistics, 15(4), 1583-1603.
Artificially generated data on the 2 dimensional torus
toydata2
toydata2
This data.frame
contains the following components:
phi
column for the first angle
psi
column for the second angle
label
column for the clustering membership
toydata2 is an artificial data generated from a mixture of 3 clusters, where the first cluster is sampled from a spherical normal distribution, the second cluster is from the uniform distribution on a large “L”-shaped region, and the third cluster of size 50 is sampled from the uniform distribution on the entire 2-dimensional torus.
Jung, S., Park, K., & Kim, B. (2021). Clustering on the torus by conformal prediction. The Annals of Applied Statistics, 15(4), 1583-1603.
wtd.stat.ang
computes weighted extrinsic mean direction and
mean resultant length.
wtd.stat.ang(data, w)
wtd.stat.ang(data, w)
data |
angular data whose elements are in |
w |
numeric vector whose each element is non-negative and
|
list which is consisting of the following components:
Mean
weighted extrinsic mean direction
R
mean resultant length
Jung, S., Park, K., & Kim, B. (2021). Clustering on the torus by conformal prediction. The Annals of Applied Statistics, 15(4), 1583-1603.
data <- matrix(c(pi/3, pi/3, pi/2, pi, pi/4, pi/2, 0, pi/3, pi/6), ncol = 3, byrow = TRUE) w <- c(0.3, 0.3, 0.4) wtd.stat.ang(data, w)
data <- matrix(c(pi/3, pi/3, pi/2, pi, pi/4, pi/2, 0, pi/3, pi/6), ncol = 3, byrow = TRUE) w <- c(0.3, 0.3, 0.4) wtd.stat.ang(data, w)