WGTDA is a framework for topological biomarker discovery, network analysis, and classification using persistent homology applied to gene co-expression structures. It enables researchers to uncover higher-order gene–gene interaction motifs that are not detectable through classical statistics or machine learning alone.
This repository provides:
-
Biomarker discovery pipeline → Generates persistent interactions and co-occurrence networks.
-
Prediction pipeline → Uses topological embeddings (e.g., persistence landscapes) for classification.
-
Interactive Streamlit dashboard → Visualize WGTDA networks, hub genes, and scale-free topology.
WGTDA/ │ ├── wgtda_discovery.py # Biomarker discovery pipeline (interactions.csv) ├── wgtda_prediction.py # TDA landscape-based classification ├── wgtda_app.py # Streamlit dashboard │ ├── src/ │ ├── correlation/ # DTEM, wTO, Pearson, DC computation │ ├── tda/ # Rips complex, persistence, biomarker pipeline │ ├── web_app/ # Visualization + network stats │ └── filters/ # Lifespan filtering for interactions │ ├── interactions/ # stored interactions CSVs └── README.md
This pipeline generates:
-
Persistence diagrams
-
Topological interaction tables (interactions.csv)
-
Genesets for hubs and cycles
-
WGTDA co-occurrence network
python wgtda_discovery.py -p data/treatment_response/cptac_radiotherapy/fpkm_matrix.csv -pp data/treatment_response/cptac_radiotherapy/sig_genes.csv -padj 3 -l 2
streamlit run wgtda_app.pyAs an example please find the input file for the streamlit web application in folder interactions!!!
Run manually:
streamlit run wgtda_app.pyUpload interactions.csv and explore:
Features:
✓ Interactive gene–gene network ✓ WGTDA hub gene identification ✓ Betweenness, degree, and lifespan metrics ✓ Scale-free topology fitting ✓ Downloadable CSVs for all tables
- Classification Pipeline (TDA Landscapes)
This pipeline uses:
-
Coexpression matrix
-
Rips complex
-
Persistence diagrams
-
Persistence landscapes
python wgtda_prediction.py -p data/treatment_response/cptac_radiotherapy/fpkm_matrix.csv -pp data/treatment_response/cptac_radiotherapy/sig_genes.csvOutputs:
-
landscapes.npy
-
Accuracy & F1 score
This module provides a clean API and command-line interface for computing gene-to-gene matrices (G2G matrices) used in Weighted Gene Topological Data Analysis (WG-TDA). These matrices describe pairwise relationships between genes using various similarity or correlation measures:
- Pearson correlation
- Distance correlation
- Weighted Topological Overlap (WTO)
- DTEM (Distance to Empirical Measure)
The G2G matrix is the first stage of the WGTDA pipeline before building simplicial complexes, computing persistent homology, and generating topological features.
- Modular correlation methods in
src/correlation/
All correlation methods are available directly from Python.
Import the factory import numpy as np from src.correlation.gene_to_gene_factory import compute_gene_to_gene_matrix
df = np.random.randn(50, 100) # 50 samples, 100 genes
matrix = compute_gene_to_gene_matrix(df, method="pearson")
If you use WGTDA for research, please consider citing the reference paper:
@article{nyase2024wgtda,
title={WGTDA: A Topological Perspective to Biomarker Discovery in Gene Expression Data},
author={Nyase, Ndivhuwo and Mashatola, Lebohang and Kohlakala, Aviwe and Rhrissorrakrai, Kahn and Muller, Stephanie},
journal={arXiv preprint arXiv:2402.08807},
year={2024}
}
Contributions to the WGTDA codebase are welcome! Please see contributions.md
To install WGTDA:
-
Make sure that the python version you use in line with our setup file, using a fresh environment is always a good idea:
conda create -n wgtda python=3.9 -y conda activate wgtda -
Install the
mainbranch to keep up to date with the latest supported features:pip install git+https://github.com/IBM/WGTDA
Have a look at the tutorial for more detailed usage of WGTDA link to tutorial
The output file contains the proposed biomarkers identified through the WGTDA analysis. Each row in this file represents a topological interaction between genes in
In the context of topological features, the higher Betti numbers indicate a more complex topological structure with more independent cycles or voids. A higher Betti number suggests increased topological complexity, which may be associated with more intricate and robust biological processes. Furthermore, by focusing on these top persistent interactions, researchers can prioritize genes for further experimental validation and study, ultimately contributing to the understanding and manipulation of lifespan-associated pathways.



