Clustering Over Time and Data Set Comparison
6 Pages Posted: 1 Sep 2009 Last revised: 4 Aug 2018
Date Written: August 30, 2009
Cluster analysis is useful for data interpretation. Instead of studying thousands of records, one can create a smaller number of clusters and interpret a prototype for each. Often, however, the world being interpreted via clusters can change. The naive approach of independently reclustering the data each period has a significant drawback: even if the data's distribution is unchanged, sampling variation can cause cluster prototypes to differ from one period to the next, which creates difficulty in comparing cluster solutions. In this paper we present a method for clustering sequential data sets and comparing cluster solutions over time. At a macro level, we examine how cluster prototypes change over time; at a micro level, we examine how objects transition among these prototypes. The method works as follows. We take as given cluster prototypes from the first data set. In clustering the new data, the previous prototypes are constrained to remain unchanged; this enables consistency among old and new prototypes. However, to fit the new data well, the second clustering must be flexible enough to add new prototypes where needed. This amounts to an optimization criteria that trades off consistency (reuse of old prototypes) with model fit (cluster fit on the new data). We formulate this as a constrained optimization problem and present a solution technique. A feature of the technique is its ability to incorporate prior knowledge from the first period to define an appropriate consistency-fit tradeoff. We envision the method will have particular relevance for business, as firms increasingly manage their customers through segments for which new data arrives over time.
Keywords: clustering, cluster analysis, resampling, penalty methods
Suggested Citation: Suggested Citation