𝔖 Bobbio Scriptorium
✦   LIBER   ✦

Sample size and dimensionality in multivariate classification: Implications for body surface potential mapping

✍ Scribed by Gy. Kozmann; R.L. Lux; Marshall Scott


Publisher
Elsevier Science
Year
1991
Tongue
English
Weight
779 KB
Volume
24
Category
Article
ISSN
0010-4809

No coin nor oath required. For personal study only.

✦ Synopsis


This paper presents empirically determined guidelines for specifying the number of features appropriate for multivariate classification studies for given sample sizes. Sample size was considered adequate if the mean distance between two sample sets, taken from the same continuous multivariate distribution and projected onto the best separating direction, remained below a prescribed level. To quantitate the sample size requirement, homogeneity of sample set pairs of equal size, N, taken from the same continuous multivariate distribution was studied as a function of dimensionality, M. Homogeneity was characterized by the maximum absolute distances (D,,) between the corresponding pairs of empirical cumulative probability distributions on the best separating projection. Computer generated data sets were used to estimate the cumulative probability distribution, P(D)M,N, for sample sizes, N, ranging from 5 to 100 and the dimensionality, M, ranging from 1 to 4. An empirical relationship between the estimated step-polygons and the Kolmogorov type one dimensional limiting distribution L(Z) has been established. Based on the sample size data of 34 key papers on clinical body surface potential mapping (BSPM) it is noted that in 30% of the cases only one, and in 6% of the cases only two parameters could be used for statistical group representation to ensure a reasonable reliability (D,,, < 0.2). In 56% of the published cases the sample sizes could not guarantee this reliability even for one feature or parameter. Q 1991 Academic Press, Inc.

An important problem in the development of practical multivariate classification schemes is the proper selection of the number of classification variables relative to class sample sizes. Use of too many classification variables will likely yield overly optimistic predictions of classifier performance and will prove unacceptable when prospectively tested on data with large class size. In this paper we addressed the problem by designing a classifier to separate two data classes, each of size N and generated from the same M dimensional multivariate probability distribution. The rationale for this approach is that ability to separate two samples of data generated from the same distribution is a chance event


📜 SIMILAR VOLUMES


Counting, measuring, and mapping in fish
✍ Andrew D. Carothers 📂 Article 📅 1994 🏛 John Wiley and Sons 🌐 English ⚖ 685 KB

## Abstract Statistical models are used to investigate the need for automation in several potential areas of application of FISH‐la‐belling techniques, including perinatal and tumour cytogenetics, genetic toxicology, and gene mapping. Predictions of the models, based on current estimates of likely