--- title: "Demonstration of latrend package" author: "Niek Den Teuling" output: rmarkdown::html_vignette: toc: true toc_depth: 2 vignette: > %\VignetteIndexEntry{Demonstration of latrend package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} %\VignetteDepends{ggplot2} %\VignetteDepends{kml} %\VignetteDepends{mclustcomp} --- ```{r setup, include = FALSE} set.seed(1) knitr::opts_chunk$set( cache = TRUE, collapse = TRUE, fig.width = 7, fig.align = "center", fig.topcaption = TRUE, comment = "#>", eval = all(vapply(c('ggplot2', 'kml', 'lme4', 'mclustcomp'), requireNamespace, FUN.VALUE = TRUE, quietly = TRUE)) # needed to prevent errors for _R_CHECK_DEPENDS_ONLY_=true despite VignetteDepends declaration ) ``` This vignette describes the core functionality of the package by identifying common trends in the longitudinal dataset that is included with the package. We begin by loading the required package and the `latrendData` dataset. ```{r, results='hide',message=FALSE,warning=FALSE} library(latrend) library(ggplot2) ``` # Initial data exploration The `latrendData` is a synthetic dataset for which the reference group of each trajectory is available, as indicated by the `Class` column. We will use this column at the end of this vignette to validate the identified model. ```{r} data(latrendData) head(latrendData) ``` Many of the functions of the package require the specification of the trajectory identifier variable (named `Id`) and the time variable (named `Time`). For convenience, we specify these variables as package options. ```{r} options(latrend.id = "Id", latrend.time = "Time") ``` Prior to attempting to model the data, it is worthwhile to visually inspect it. ```{r, fig.asp = .6, fig.cap='Visualizing the trajectories of the `latrend` dataset.'} plotTrajectories(latrendData, response = "Y") ``` The presence of clusters is not apparent from the plot. With any longitudinal analysis, one should first consider whether clustering brings any benefit to the representation of heterogeneity of the data over a single common trend representation, or a multilevel model. In this demonstration, we omit this step under the prior knowledge that the data was generated via distinct mechanisms. # Non-parametric trajectory clustering Assuming the appropriate (cluster) trajectory model is not known in advance, non-parametric longitudinal cluster models can provide a suitable starting point. As an example, we apply longitudinal $k$-means (KML). First, we need to define the method. At the very least we need to indicate the response variable the method should operate on. Secondly, we should indicate how many clusters we expect. We do not need to define the `id` and `time` arguments as we have set these as package options. We use the `nbRedrawing` argument provided by the KML package for reducing the number of repeated random starts to only a single model estimation, in order to reduce the run-time of this example. ```{r} kmlMethod <- lcMethodKML(response = "Y", nClusters = 2, nbRedrawing = 1) kmlMethod ``` As seen in the output from the `lcMethodKML` object, the KML method is defined by additional arguments. These are specific to the `kml` package. The KML model is estimated on the dataset via the `latrend` function. ```{r} kmlModel <- latrend(kmlMethod, data = latrendData) ``` Now that we have fitted the KML model with 2 clusters, we can print a summary by calling: ```{r} kmlModel ``` ## Identifying the number of clusters As we do not know the best number of clusters needed to represent the data, we should consider fitting the KML model for a range of clusters. We can then select the best representation by comparing the solutions by one or more cluster metrics. We can specify a range of `lcMethodKML` methods based on a prototype method using the `lcMethods` function. This method outputs a list of `lcMethod` objects. A structured summary is obtained by calling `as.data.frame`. ```{r} kmlMethods <- lcMethods(kmlMethod, nClusters = 1:7) as.data.frame(kmlMethods) ``` The list of `lcMethod` objects can be fitted using the `latrendBatch` function, returning a list of `lcModel` objects. ```{r} kmlModels <- latrendBatch(kmlMethods, data = latrendData, verbose = FALSE) kmlModels ``` We can compare each of the solutions via one or more cluster metrics. Considering the consistent improvements achieved by KML for an increasing number of clusters, identifying the best solution by minimizing a metric would lead to an overestimation. Instead, we perform the selection via a manual elbow method, using the `plotMetric` function. ```{r warning=FALSE, fig.cap = 'Elbow plots of three relevant cluster metrics across the fitted models.'} plotMetric(kmlModels, c("logLik", "BIC", "WMAE")) ``` ## Investigating the preferred model We have selected the 4-cluster model as the preferred representation. We will now inspect this solution in more detail. Before we can start, we first obtain the fitted `lcModel` object from the list of fitted models. ```{r} kmlModel4 <- subset(kmlModels, nClusters == 4, drop = TRUE) kmlModel4 ``` The `plotClusterTrajectories` function shows the estimated cluster trajectories of the model. ```{r, fig.cap = 'Cluster trajectories for KML model with 4 clusters.'} plotClusterTrajectories(kmlModel4) ``` We can get a better sense of the representation of the cluster trajectories when plotted against the trajectories that have been assigned to the respective cluster. ```{r, fig.asp = .8, fig.cap = 'Cluster trajectories for KML model with 4 clusters, along with the assigned trajectories.'} plot(kmlModel4) ``` The cluster assignment for each available subject is obtained using: ```{r} trajectoryAssignments(kmlModel4) ``` We can use this to construct a table of the cluster per subject, and merge it into the original data for further analysis. ```{r} # make sure to change the Id column name for your respective id column name subjectClusters = data.frame(Id = ids(kmlModel4), Cluster = trajectoryAssignments(kmlModel4)) head(subjectClusters) posthocAnalysisData = merge(latrendData, subjectClusters, by = 'Id') head(posthocAnalysisData) aggregate(Y ~ Cluster, posthocAnalysisData, mean) ``` ## Model adequacy The list of currently supported internal model metrics can be obtained by calling the `getInternalMetricNames` function. ```{r} getInternalMetricNames() ``` As an example, we will compute the APPA (a measure of cluster separation), and the WRSS and WMAE metrics (measures of model error). ```{r} metric(kmlModel, c("APPA.mean", "WRSS", "WMAE")) ``` The quantile-quantile (QQ) plot can be used to assess the model for structural deviations. ```{r, fig.width = 5, fig.cap = 'QQ-plot of the selected KML model.'} qqPlot(kmlModel4) ``` Overall, the unexplained errors closely follow a normal distribution. In situations where structural deviations from the expected distribution are apparent, it may be fruitful to investigate the QQ plot on a per-cluster basis. ```{r, fig.asp = .8, fig.cap = 'Cluster-specific detrended QQ-plot for the selected KML model.'} qqPlot(kmlModel4, byCluster = TRUE, detrend = TRUE) ``` # Parametric trajectory clustering The KML analysis has provided us with clues on an appropriate model for the cluster trajectories. We can use these insights to define a parametric representation of each of the trajectories and cluster them using k-means. ```{r} lmkmMethod <- lcMethodLMKM(formula = Y ~ Time) lmkmMethod ``` We fit LMKM for 1 to 4 clusters. ```{r, echo=TRUE, results='hide'} lmkmMethods <- lcMethods(lmkmMethod, nClusters = 1:5) lmkmModels <- latrendBatch(lmkmMethods, data = latrendData, verbose = FALSE) ``` ```{r warning=FALSE, fig.cap = 'Three cluster metrics for each of the GBTMs.'} plotMetric(lmkmModels, c("logLik", "BIC", "WMAE")) ``` All metrics clearly point to the three-cluster solution. ```{r} bestLmkmModel <- subset(lmkmModels, nClusters == 3, drop=TRUE) plot(bestLmkmModel) ``` # Comparison to the reference Since we have established a preferred clustered representation of the data heterogeneity, we can now compare the resulting cluster assignments to the ground truth from which the `latrendData` data was generated. Using the reference assignments, we can also plot a non-parametric estimate of the cluster trajectories. Note how it looks similar to the cluster trajectories found by our model. ```{r, fig.width = 5, fig.cap = 'Non-parametric estimates of the cluster trajectories based on the reference assignments.'} plotClusterTrajectories(latrendData, response = "Y", cluster = "Class") ``` In order to compare the reference assignments to the trajectory assignments generated by our model, we can create a `lcModel` object based on the reference assignments using the `lcModelPartition` function. ```{r} refTrajAssigns <- aggregate(Class ~ Id, data = latrendData, FUN = data.table::first) refModel <- lcModelPartition(data = latrendData, response = "Y", trajectoryAssignments = refTrajAssigns$Class) refModel ``` ```{r, fig.cap = 'Cluster trajectories of the reference model.'} plot(refModel) ``` By constructing a reference model, we can make use of the standardized way in which `lcModel` objects can be compared. A list of supported comparison metrics can be obtained via the `getExternalMetricNames` function. ```{r} getExternalMetricNames() ``` Lastly, we compare the agreement in trajectory assignments via the adjusted Rand index. ```{r} externalMetric(bestLmkmModel, refModel, "adjustedRand") ``` With a score of `externalMetric(bestLmkmModel, refModel, "adjustedRand")`, we have a near-perfect match. This result is expected, as the dataset was generated using a growth mixture model.