I am working on a person recognition system for learning purposes.

My goal is:

  • Maintain a small gallery of known people (multiple images per person)

  • Given a new query image, return the most similar person with a confidence score

  • The system should work even if clothing or accessories change

  • The query image may show front, back, or partial body views

I am currently experimenting with person re-identification models, but I am observing that matching accuracy drops significantly when clothing changes.

This makes me question the feasibility of my objective.

From a technical perspective, I would like to understand:

  1. Are most image-based person re-identification models inherently appearance-driven (i.e., heavily dependent on clothing)?

  2. Without video (no gait information), is clothing-invariant person recognition realistically achievable?

  3. Is combining multiple modalities (e.g., face + body embeddings) the correct direction?

  4. Or is this problem fundamentally limited when relying only on still RGB images?

I am looking for library recommendations and guidance on whether this objective is technically realistic and what general approach is appropriate.

Any architectural insights would be appreciated.