WGTDA (Weighted Gene Topological Data Analysis)

WGTDA is a framework for topological biomarker discovery, network analysis, and classification using persistent homology applied to gene co-expression structures. It enables researchers to uncover higher-order gene–gene interaction motifs that are not detectable through classical statistics or machine learning alone.

This repository provides:

Biomarker discovery pipeline → Generates persistent interactions and co-occurrence networks.
Prediction pipeline → Uses topological embeddings (e.g., persistence landscapes) for classification.
Interactive Streamlit dashboard → Visualize WGTDA networks, hub genes, and scale-free topology.

Repo Structure

WGTDA/ │ ├── wgtda_discovery.py # Biomarker discovery pipeline (interactions.csv) ├── wgtda_prediction.py # TDA landscape-based classification ├── wgtda_app.py # Streamlit dashboard │ ├── src/ │ ├── correlation/ # DTEM, wTO, Pearson, DC computation │ ├── tda/ # Rips complex, persistence, biomarker pipeline │ ├── web_app/ # Visualization + network stats │ └── filters/ # Lifespan filtering for interactions │ ├── interactions/ # stored interactions CSVs └── README.md

1. Biomarker Discovery Pipeline

This pipeline generates:

Persistence diagrams
Topological interaction tables (interactions.csv)
Genesets for hubs and cycles
WGTDA co-occurrence network

python wgtda_discovery.py -p data/treatment_response/cptac_radiotherapy/fpkm_matrix.csv -pp data/treatment_response/cptac_radiotherapy/sig_genes.csv -padj 3 -l 2
streamlit run wgtda_app.py

As an example please find the input file for the streamlit web application in folder interactions!!!

2. Interactive WGTDA Dashboard

Run manually:

streamlit run wgtda_app.py

Upload interactions.csv and explore:

Features:

✓ Interactive gene–gene network ✓ WGTDA hub gene identification ✓ Betweenness, degree, and lifespan metrics ✓ Scale-free topology fitting ✓ Downloadable CSVs for all tables

Classification Pipeline (TDA Landscapes)

This pipeline uses:

Coexpression matrix
Rips complex
Persistence diagrams
Persistence landscapes

python wgtda_prediction.py -p data/treatment_response/cptac_radiotherapy/fpkm_matrix.csv -pp data/treatment_response/cptac_radiotherapy/sig_genes.csv

Outputs:

landscapes.npy
Accuracy & F1 score

Gene-to-Gene Matrix Computation (WGTDA)

This module provides a clean API and command-line interface for computing gene-to-gene matrices (G2G matrices) used in Weighted Gene Topological Data Analysis (WG-TDA). These matrices describe pairwise relationships between genes using various similarity or correlation measures:

Pearson correlation
Distance correlation
Weighted Topological Overlap (WTO)
DTEM (Distance to Empirical Measure)

The G2G matrix is the first stage of the WGTDA pipeline before building simplicial complexes, computing persistent homology, and generating topological features.

Modular correlation methods in src/correlation/

Using as a Python Package

All correlation methods are available directly from Python.

Import the factory import numpy as np from src.correlation.gene_to_gene_factory import compute_gene_to_gene_matrix

df = np.random.randn(50, 100) # 50 samples, 100 genes

matrix = compute_gene_to_gene_matrix(df, method="pearson")

Citation

If you use WGTDA for research, please consider citing the reference paper:

@article{nyase2024wgtda,
  title={WGTDA: A Topological Perspective to Biomarker Discovery in Gene Expression Data},
  author={Nyase, Ndivhuwo and Mashatola, Lebohang and Kohlakala, Aviwe and Rhrissorrakrai, Kahn and Muller, Stephanie},
  journal={arXiv preprint arXiv:2402.08807},
  year={2024}
}

Contribution

Contributions to the WGTDA codebase are welcome! Please see contributions.md

Getting Started

To install WGTDA:

Make sure that the python version you use in line with our setup file, using a fresh environment is always a good idea:
```
conda create -n wgtda python=3.9 -y
conda activate wgtda
```
Install the main branch to keep up to date with the latest supported features:
```
pip install git+https://github.com/IBM/WGTDA
```

Have a look at the tutorial for more detailed usage of WGTDA link to tutorial

Outputs (Topological Gene Interactions)

The output file contains the proposed biomarkers identified through the WGTDA analysis. Each row in this file represents a topological interaction between genes in $betti_0, betti_1, betti_2$ space. Betti numbers are used to differentiate topological spaces based on the connectivity of $n$-dimensional simplicial complexes. For example, $Betti_0$ corresponds to the number of connected components or clusters, $Betti_1$ represents the number of non-contractible loops or cycles, and $Betti_2$ indicates the number of voids or enclosed regions in the data space.

In the context of topological features, the higher Betti numbers indicate a more complex topological structure with more independent cycles or voids. A higher Betti number suggests increased topological complexity, which may be associated with more intricate and robust biological processes. Furthermore, by focusing on these top persistent interactions, researchers can prioritize genes for further experimental validation and study, ultimately contributing to the understanding and manipulation of lifespan-associated pathways.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
data		data
docs/imgs		docs/imgs
interactions		interactions
landscapes		landscapes
output		output
src		src
tests		tests
tutorials		tutorials
.coverage		.coverage
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.whitesource		.whitesource
CITATION.cff		CITATION.cff
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTIONS.md		CONTRIBUTIONS.md
LICENSE		LICENSE
MAINTAINERS.md		MAINTAINERS.md
README.md		README.md
main.py		main.py
plot.py		plot.py
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.py		setup.py
wgtda_app.py		wgtda_app.py
wgtda_discovery.py		wgtda_discovery.py
wgtda_prediction.py		wgtda_prediction.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WGTDA (Weighted Gene Topological Data Analysis)

Repo Structure

1. Biomarker Discovery Pipeline

2. Interactive WGTDA Dashboard

Gene-to-Gene Matrix Computation (WGTDA)

Using as a Python Package

Citation

Contribution

Getting Started

Outputs (Topological Gene Interactions)

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

License

IBM/WGTDA

Folders and files

Latest commit

History

Repository files navigation

WGTDA (Weighted Gene Topological Data Analysis)

Repo Structure

1. Biomarker Discovery Pipeline

2. Interactive WGTDA Dashboard

Gene-to-Gene Matrix Computation (WGTDA)

Using as a Python Package

Citation

Contribution

Getting Started

Outputs (Topological Gene Interactions)

About

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages