Numero: a statistical framework to define multivariable subgroups in complex population-based datasets

Song Gao, Stefan Mutter, Aaron Casey, Ville-Petteri Makinen

Research output: Contribution to journalArticle

1 Citation (Scopus)

Abstract

Large-scale epidemiological and population data provide opportunities to identify subgroups of people who are at risk of disease or exposed to adverse environments. Clustering algorithms are popular data-driven tools to identify these subgroups; however, relying exclusively on algorithms may not produce the best results if the dataset does not have a clustered structure. For this reason, we propose a framework (the R-library Numero) that combines the self-organizing map algorithm, permutation analysis for statistical evidence and a final expert-driven subgrouping step. We used Numero to define subgroups in two examples without an obvious clustering structure: a biomedical dataset of kidney disease and another dataset of community-level socioeconomic indicators. We benchmarked the Numero subgroupings against popular clustering algorithms (principal components, K-means and hierarchical clustering). The Numero subgroupings were more intuitive and easier to interpret without losing mathematical quality. Therefore, we expect Numero to be useful for exploratory analyses of population-based epidemiological datasets.

LanguageEnglish
Pages369-374
Number of pages6
JournalInternational Journal of Epidemiology
Volume48
Issue number2
DOIs
Publication statusPublished - 1 Jan 2019

Keywords

  • Data-driven subgrouping
  • Multivariable statistics
  • Population data
  • Self-organizing map

ASJC Scopus subject areas

  • Epidemiology

Cite this

@article{77224a06acba4b42a51de53c6a865864,
title = "Numero: a statistical framework to define multivariable subgroups in complex population-based datasets",
abstract = "Large-scale epidemiological and population data provide opportunities to identify subgroups of people who are at risk of disease or exposed to adverse environments. Clustering algorithms are popular data-driven tools to identify these subgroups; however, relying exclusively on algorithms may not produce the best results if the dataset does not have a clustered structure. For this reason, we propose a framework (the R-library Numero) that combines the self-organizing map algorithm, permutation analysis for statistical evidence and a final expert-driven subgrouping step. We used Numero to define subgroups in two examples without an obvious clustering structure: a biomedical dataset of kidney disease and another dataset of community-level socioeconomic indicators. We benchmarked the Numero subgroupings against popular clustering algorithms (principal components, K-means and hierarchical clustering). The Numero subgroupings were more intuitive and easier to interpret without losing mathematical quality. Therefore, we expect Numero to be useful for exploratory analyses of population-based epidemiological datasets.",
keywords = "Data-driven subgrouping, Multivariable statistics, Population data, Self-organizing map",
author = "Song Gao and Stefan Mutter and Aaron Casey and Ville-Petteri Makinen",
year = "2019",
month = "1",
day = "1",
doi = "10.1093/ije/dyy113",
language = "English",
volume = "48",
pages = "369--374",
journal = "International Journal of Epidemiology",
issn = "0300-5771",
publisher = "Oxford University Press",
number = "2",

}

Numero: a statistical framework to define multivariable subgroups in complex population-based datasets. / Gao, Song; Mutter, Stefan; Casey, Aaron; Makinen, Ville-Petteri.

In: International Journal of Epidemiology, Vol. 48, No. 2, 01.01.2019, p. 369-374.

Research output: Contribution to journalArticle

TY - JOUR

T1 - Numero: a statistical framework to define multivariable subgroups in complex population-based datasets

AU - Gao, Song

AU - Mutter, Stefan

AU - Casey, Aaron

AU - Makinen, Ville-Petteri

PY - 2019/1/1

Y1 - 2019/1/1

N2 - Large-scale epidemiological and population data provide opportunities to identify subgroups of people who are at risk of disease or exposed to adverse environments. Clustering algorithms are popular data-driven tools to identify these subgroups; however, relying exclusively on algorithms may not produce the best results if the dataset does not have a clustered structure. For this reason, we propose a framework (the R-library Numero) that combines the self-organizing map algorithm, permutation analysis for statistical evidence and a final expert-driven subgrouping step. We used Numero to define subgroups in two examples without an obvious clustering structure: a biomedical dataset of kidney disease and another dataset of community-level socioeconomic indicators. We benchmarked the Numero subgroupings against popular clustering algorithms (principal components, K-means and hierarchical clustering). The Numero subgroupings were more intuitive and easier to interpret without losing mathematical quality. Therefore, we expect Numero to be useful for exploratory analyses of population-based epidemiological datasets.

AB - Large-scale epidemiological and population data provide opportunities to identify subgroups of people who are at risk of disease or exposed to adverse environments. Clustering algorithms are popular data-driven tools to identify these subgroups; however, relying exclusively on algorithms may not produce the best results if the dataset does not have a clustered structure. For this reason, we propose a framework (the R-library Numero) that combines the self-organizing map algorithm, permutation analysis for statistical evidence and a final expert-driven subgrouping step. We used Numero to define subgroups in two examples without an obvious clustering structure: a biomedical dataset of kidney disease and another dataset of community-level socioeconomic indicators. We benchmarked the Numero subgroupings against popular clustering algorithms (principal components, K-means and hierarchical clustering). The Numero subgroupings were more intuitive and easier to interpret without losing mathematical quality. Therefore, we expect Numero to be useful for exploratory analyses of population-based epidemiological datasets.

KW - Data-driven subgrouping

KW - Multivariable statistics

KW - Population data

KW - Self-organizing map

UR - http://www.scopus.com/inward/record.url?scp=85067563708&partnerID=8YFLogxK

U2 - 10.1093/ije/dyy113

DO - 10.1093/ije/dyy113

M3 - Article

VL - 48

SP - 369

EP - 374

JO - International Journal of Epidemiology

T2 - International Journal of Epidemiology

JF - International Journal of Epidemiology

SN - 0300-5771

IS - 2

ER -