Skip to content
View WatchTree-19's full-sized avatar

Block or report WatchTree-19

Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
WatchTree-19/README.md

Hi, I'm Sandeep

This is my open research page, feel free to reach out by email, if you feel I could help out! (Columbia University, alumnus), UK/US-based.

I place myself as an Methodological researcher with interest in idealogical expansion of emerging areas.

[ What I'm building

  • Independent writing on AI evaluation methodology, observability, and the structural overlap between quant trading and LLM eval.
  • Asymmetric-information solutions in ML evaluation, surfacing what labs know internally about benchmark noise and drift.
  • Calibration tooling for benchmark drift, distinguishing genuine model improvement from eval movement.

[ Currently working on

  • A foundational essay on production observability for LLM agents.
  • A weekly paper digest series on alignment, evaluation methodology, and AI safety research.
  • "Benchmark crowding": mapping factor decay in quant finance to benchmark saturation in LLM evaluation.

[ Around the web

Pinned Loading

  1. llm-judge-calibration llm-judge-calibration Public

    Measure how much your LLM judges actually agree. Inter-judge agreement metrics for LLM-as-a-judge evaluations.

    Python

  2. Tracer-Cloud/opensre Tracer-Cloud/opensre Public

    Build your own AI SRE agents. The open source toolkit for the AI era.

    Python 7.7k 1k

  3. UKGovernmentBEIS/inspect_ai UKGovernmentBEIS/inspect_ai Public

    Inspect: A framework for large language model evaluations

    Python 2.3k 583

  4. EleutherAI/lm-evaluation-harness EleutherAI/lm-evaluation-harness Public

    A framework for few-shot evaluation of language models.

    Python 13.1k 3.4k

  5. pola-rs/polars pola-rs/polars Public

    Extremely fast Query Engine for DataFrames, written in Rust

    Rust 38.9k 2.9k

  6. pixie-io/pixie pixie-io/pixie Public

    Instant Kubernetes-Native Application Observability

    C++ 6.5k 503