Clustering Over Time and Data Set Comparison

6 Pages Posted: 1 Sep 2009 Last revised: 4 Aug 2018

See all articles by Daniel M. Fleder

Daniel M. Fleder

University of Pennsylvania - The Wharton School

Balaji Padmanabhan

University of South Florida - College of Business Administration

Date Written: August 30, 2009

Abstract

Cluster analysis is useful for data interpretation. Instead of studying thousands of records, one can create a smaller number of clusters and interpret a prototype for each. Often, however, the world being interpreted via clusters can change. The naive approach of independently reclustering the data each period has a significant drawback: even if the data's distribution is unchanged, sampling variation can cause cluster prototypes to differ from one period to the next, which creates difficulty in comparing cluster solutions. In this paper we present a method for clustering sequential data sets and comparing cluster solutions over time. At a macro level, we examine how cluster prototypes change over time; at a micro level, we examine how objects transition among these prototypes. The method works as follows. We take as given cluster prototypes from the first data set. In clustering the new data, the previous prototypes are constrained to remain unchanged; this enables consistency among old and new prototypes. However, to fit the new data well, the second clustering must be flexible enough to add new prototypes where needed. This amounts to an optimization criteria that trades off consistency (reuse of old prototypes) with model fit (cluster fit on the new data). We formulate this as a constrained optimization problem and present a solution technique. A feature of the technique is its ability to incorporate prior knowledge from the first period to define an appropriate consistency-fit tradeoff. We envision the method will have particular relevance for business, as firms increasingly manage their customers through segments for which new data arrives over time.

Keywords: clustering, cluster analysis, resampling, penalty methods

Suggested Citation

Fleder, Daniel M. and Padmanabhan, Balaji, Clustering Over Time and Data Set Comparison (August 30, 2009). Available at SSRN: https://ssrn.com/abstract=1464537 or http://dx.doi.org/10.2139/ssrn.1464537

Daniel M. Fleder (Contact Author)

University of Pennsylvania - The Wharton School ( email )

Philadelphia, PA 19104
United States

Balaji Padmanabhan

University of South Florida - College of Business Administration ( email )

4202 E. Fowler Avenue, BSN 3403
Tampa, FL 33620-5500
United States

Do you have a job opening that you would like to promote on SSRN?

Paper statistics

Downloads
9
Abstract Views
1,265
PlumX Metrics