I am working on a person recognition system for learning purposes.
My goal is:
Maintain a small gallery of known people (multiple images per person)
Given a new query image, return the most similar person with a confidence score
The system should work even if clothing or accessories change
The query image may show front, back, or partial body views
I am currently experimenting with person re-identification models, but I am observing that matching accuracy drops significantly when clothing changes.
This makes me question the feasibility of my objective.
From a technical perspective, I would like to understand:
Are most image-based person re-identification models inherently appearance-driven (i.e., heavily dependent on clothing)?
Without video (no gait information), is clothing-invariant person recognition realistically achievable?
Is combining multiple modalities (e.g., face + body embeddings) the correct direction?
Or is this problem fundamentally limited when relying only on still RGB images?
I am looking for library recommendations and guidance on whether this objective is technically realistic and what general approach is appropriate.
Any architectural insights would be appreciated.