Unsupervised Discovery of Depression Biomarkers

Using multimodal machine learning to identify hidden subtypes of Major Depressive Disorder

2
Subtypes Discovered
0.0112
p-value (Significant!)
33
Clinical Interviews
(14 depressed, 19 healthy)
396
Features Extracted
(100 text + 296 acoustic)

Project Overview

The Problem

Depression affects 280 million people worldwide, yet diagnosis remains subjective, relying on clinical interviews and questionnaires. Traditional methods can't detect hidden subtypes or objective biomarkers.

The Solution

Apply unsupervised machine learning to discover hidden depression subtypes from multimodal clinical data: speech acoustics, linguistic patterns, and behavioral features.

The Impact

Enable objective, data-driven depression diagnosis. Move from subjective assessment to personalized treatment based on discovered subtypes and biomarker patterns.

Key Results

Statistical Validation

Chi-Square (χ²)
6.44
p-value
0.0112
Significant! (p < 0.05)
Degrees of Freedom
1

Clustering Performance

Optimal k
2
Two distinct subtypes
Silhouette Score
0.168
Positive separation
Davies-Bouldin
1.871
Moderate compactness
Calinski-Harabasz
8.3
Moderate density

Key Findings

  • 2 distinct depression subtypes identified through unsupervised learning
  • Statistically significant correlation with PHQ-8 clinical labels
  • Multimodal features (text + acoustic) enable subtype discovery
  • 95% variance captured with 93% dimensionality reduction
  • Gold standard validation against DAIC-WOZ benchmark dataset

Methodology

1

Data Collection

DAIC-WOZ Depression Database

33 interviews
2

Feature Extraction

TF-IDF (100) + COVAREP (296)

396 features
3

Dimensionality Reduction

PCA (95% variance)

27 components
4

Clustering

K-Means Algorithm

k=2
5

Validation

Chi-square test

p=0.0112

Text Features (100)

  • TF-IDF Vectorization
  • Unigrams + Bigrams
  • min_df=2, max_df=0.8
  • Participant responses only

Acoustic Features (296)

  • COVAREP 74 features
  • F0, NAQ, QOQ, H1H2, PSP
  • 4 statistics per feature
  • Mean, Std, Min, Max

Technologies Used

Python
Scikit-learn
Pandas
NumPy
Matplotlib
Seaborn
SciPy
Jupyter

DAIC-WOZ Dataset

Distress Analysis Interview Corpus - Wizard of Oz

Gold standard clinical depression database from USC Institute for Creative Technologies (AVEC 2017)

189
Total Interviews
107 / 82
Training / Test
33
Sessions Analyzed
14 / 19
Depressed / Healthy
8.4 ± 5.3
Mean PHQ-8 Score

Available Modalities

Text Transcripts

Full interview transcripts with timestamps

Audio Features

COVAREP acoustic features (74 dimensions)

Video/Facial

Facial Action Units (not used in this study)

Clinical Labels

PHQ-8 depression severity scores (0-24)

Visualizations & Analysis

PCA Variance Analysis

PCA Variance Plot

27 components retain 95.4% of variance (396 → 27 dimensions)

t-SNE by Clinical Labels

t-SNE True Labels

Green: Healthy (PHQ8<10), Red: Depressed (PHQ8≥10)

t-SNE Discovered Clusters

t-SNE Clusters

Unsupervised K-Means clustering (k=2)

Cluster vs Depression Status

Cluster Heatmap

χ² = 6.44, p = 0.0112 (significant correlation)

Clustering Metrics

Clustering Metrics

Elbow method and silhouette score optimization (k=2 optimal)

Research Questions & Answers

Can unsupervised algorithms detect meaningful latent subtypes of MDD patients?
YES. K-Means clustering identified 2 distinct subtypes with statistically significant correlation to clinical labels (p=0.0112).
What multimodal biomarkers define these subtypes?
Combined text (TF-IDF linguistic patterns) and acoustic features (COVAREP voice analysis) with 396 dimensions reduced to 27 via PCA.
Do discovered clusters correlate with PHQ-8 depression severity?
YES. Chi-square test confirms significant association (χ²=6.44, p=0.0112 < 0.05).
Can dimensionality reduction reveal interpretable patterns?
YES. PCA reduced 396 features to 27 components while retaining 95.4% variance, enabling effective clustering.

Get in Touch

Acknowledgments

USC Institute for Creative Technologies - DAIC-WOZ Depression Database (AVEC 2017)

COVAREP Team - Acoustic feature extraction toolkit

MIT License - For research and educational purposes

Ethical Note: This project is for research purposes only. It is not intended to replace professional medical diagnosis or treatment. If you or someone you know is experiencing depression, please seek help from qualified mental health professionals.

Crisis Resources: National Suicide Prevention Lifeline: 988 Crisis Text Line: Text HOME to 741741