Skip to content

Auto-generate JSON Schema for Pyrefly config using schemars (#2448 Task 2)#2473

Open
Prathamesh-tech-eng wants to merge 8 commits into
facebook:mainfrom
Prathamesh-tech-eng:autogenerate-schema-schemars
Open

Auto-generate JSON Schema for Pyrefly config using schemars (#2448 Task 2)#2473
Prathamesh-tech-eng wants to merge 8 commits into
facebook:mainfrom
Prathamesh-tech-eng:autogenerate-schema-schemars

Conversation

@Prathamesh-tech-eng

Copy link
Copy Markdown
Contributor

Summary

Implements Task 2 from #2448: auto-generate the Pyrefly JSON Schema from Rust config types using the schemars crate, replacing the need to manually maintain schemas/pyrefly.json.

Changes

  • Add schemars = "0.8" as an optional dependency behind a jsonschema feature flag to 4 crates: pyrefly_util, pyrefly_python, pyrefly_build, pyrefly_config
  • Add JsonSchema implementations for all config-related types:
    • Derive-based : Tool, Severity, ErrorKind, UntypedDefBehavior, RecursionOverflowHandler, BxlArgs, BuildSystemArgs, BuildSystem
    • Manual impls (for types with custom serde or external types without JsonSchema): Glob, Globs, PythonVersion, PythonPlatform, CustomQueryArgs, ErrorDisplayConfig, ModuleWildcard, ConfigOrigin, ExtraConfigs, ConfigBase, PythonEnvironment, Interpreters, SubConfig, ConfigFile
  • Add generate_schema binary in pyrefly_config that outputs Draft-07 JSON Schema to stdout
  • All schema code is gated with #[cfg(feature = "jsonschema")] , zero impact on normal builds

Usage

cargo run -p pyrefly_config --features jsonschema --bin generate_schema > schemas/pyrefly.json

@meta-cla meta-cla Bot added the cla signed label Feb 20, 2026
@Prathamesh-tech-eng

Copy link
Copy Markdown
Contributor Author

@connernilsen This implements Task 2 from #2448. I had few doubts:

  1. Should the generated schema replace the existing schemas/pyrefly.json directly, or should they coexist during a transition period?
  2. Would you like me to add a CI step that runs the generator and checks the output matches the checked-in schema (to catch drift)?
  3. Cargo.toml files: I noticed the # @generated by autocargo comment , will Meta's internal tooling need to be updated separately for these dependency additions, or is my change sufficient?
@connernilsen

connernilsen commented Mar 9, 2026

Copy link
Copy Markdown
Contributor

@Prathamesh-tech-eng sorry for the delay.

  1. Should the generated schema replace the existing schemas/pyrefly.json directly, or should they coexist during a transition period?

I think it would be alright to replace them directly. Is there any specific reason you're thinking of to have them coexist?

  1. Would you like me to add a CI step that runs the generator and checks the output matches the checked-in schema (to catch drift)?

Yes! This would be amazing!

  1. Cargo.toml files: I noticed the # @generated by autocargo comment , will Meta's internal tooling need to be updated separately for these dependency additions, or is my change sufficient?

The @generated should be fine to leave in there. Our internal tooling handles that comment nicely right now :)

Thanks for working on this, and sorry again for the delay

@connernilsen connernilsen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Prathamesh-tech-eng, sorry for the delay in review. I have a few questions on how this works, since I haven't used this before, but I like the look of things so far.

The main question below is about whether we can take advantage of macro settings on fields, variants, and the overall structs instead of manually implementing the schemars::JsonSchema trait, since I'm worried we might forget to update it in the future.

Let me know if you have any questions or want to discuss stuff :)

}

#[derive(Debug, Clone, Deserialize, Serialize, PartialEq, Eq, Default, Hash)]
#[cfg_attr(feature = "jsonschema", derive(schemars::JsonSchema))]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we enable that as a feature everywhere in the pyrefly_build, pyrefly_config, and pyrefly_python crates?

Here's an example of how we did it for another feature in one of our crates

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! The jsonschema feature already propagates through all crates via Cargo.toml:

pyrefly_config → pyrefly_build/jsonschema, pyrefly_python/jsonschema, pyrefly_util/jsonschema
pyrefly_build → pyrefly_python/jsonschema, pyrefly_util/jsonschema
pyrefly_python → pyrefly_util/jsonschema

We use the same #[cfg(feature = "jsonschema")] pattern as in pyrefly_util/src/globs.rs (lines 331-353). Running cargo run -p pyrefly_config --features jsonschema enables the feature across all dependent crates.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I think I see what you're saying, but I also think I misread what is going on here. Can we remove the cfg_attr(feature = "jsonschema" part? If it's set as an enabled feature in the cargo file, do we also need to gate this here?

Comment thread crates/pyrefly_config/src/bin/generate_schema.rs
Comment thread crates/pyrefly_config/src/bin/generate_schema.rs
Comment thread .gitignore Outdated
pub repo_root: Option<PathBuf>,
}

#[cfg(feature = "jsonschema")]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are all of the extra non-derived schemas adding?

For cases where we need custom behavior for fields, can we instead use #[schemars(schema_with = "some::function")] (override specific fields), #[schemars(transform = some::transform)] (mutate the schema after auto generating it), or other macros instead of implementing schemars::JsonSchema completely for some of the structs below?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored CustomQueryArgs to use schemars derive macros as suggested. The 60+ line manual implementation is now replaced with:

#[derive(Debug, Clone, ...)]
#[cfg_attr(feature = "jsonschema", derive(schemars::JsonSchema))]
pub struct CustomQueryArgs {
#[cfg_attr(feature = "jsonschema", schemars(schema_with = "vec1_string_schema"))]
pub command: Vec1,
// ...
}

A small helper function vec1_string_schema generates the array schema with minItems: 1. This is much more maintainable and the descriptions come from doc comments automatically.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing that schema generator! It looks much cleaner and will be a lot easier to maintain in the future.

Would you also be able to update the other custom schema generators to attempt to move as much logic as possible into those configuration macros?

@Prathamesh-tech-eng Prathamesh-tech-eng force-pushed the autogenerate-schema-schemars branch from b88cabf to c1c5406 Compare March 18, 2026 10:39
@Prathamesh-tech-eng

Copy link
Copy Markdown
Contributor Author

Thanks for the review! I've pushed the changes addressing all feedback.

For the next step (Task 3 - SchemaStore submission), I have a few questions:

  1. Should I wait for this PR to be merged before submitting to SchemaStore, or can I work on the SchemaStore PR in parallel?
  2. The schema currently has $schema: "http://json-schema.org/draft-07/schema#" but no $id. Should I add $id: "https://pyrefly.org/schemas/pyrefly.json" or similar before SchemaStore submission?
  3. Are there any specific SchemaStore requirements or guidelines I should follow? I see there's a schemas/SCHEMASTORE.md file - should I update that with submission instructions?
  4. For the SchemaStore catalog entry, what file patterns should pyrefly.json match? I'm thinking:
    pyrefly.toml
    pyrefly.json
    .pyrefly.toml
@Prathamesh-tech-eng Prathamesh-tech-eng force-pushed the autogenerate-schema-schemars branch from c1c5406 to 5b4d343 Compare March 18, 2026 10:54
@connernilsen

Copy link
Copy Markdown
Contributor

Awesome, thanks for the updates.

Should I wait for this PR to be merged before submitting to SchemaStore, or can I work on the SchemaStore PR in parallel?

Yeah, I think waiting until we nail down the generation logic is the best approach. It'll help us reduce the likelihood of errors in any schema we push.

The schema currently has $schema: "http://json-schema.org/draft-07/schema#" but no $id. Should I add $id: "https://pyrefly.org/schemas/pyrefly.json" or similar before SchemaStore submission?

That might be a good idea. Probably something like https://raw.githubusercontent.com/facebook/pyrefly/refs/heads/main/schemas/pyrefly.json? What is that $id field used for?

Are there any specific SchemaStore requirements or guidelines I should follow? I see there's a schemas/SCHEMASTORE.md file - should I update that with submission instructions?

I think a lot of the recommendations here would be good to follow. Maybe some validation checks here. Other than that, I think once this is committed we should be good to start working on a submission to schemastore.

For the SchemaStore catalog entry, what file patterns should pyrefly.json match? I'm thinking: ...

We should only match pyrefly.toml, since that's the only file name we parse (other than pyproject.toml).

@connernilsen connernilsen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the direction this is going in. Thanks for working on it, and as always, let me know if you have questions

Comment thread crates/pyrefly_config/src/bin/generate_schema.rs
}

#[derive(Debug, Clone, Deserialize, Serialize, PartialEq, Eq, Default, Hash)]
#[cfg_attr(feature = "jsonschema", derive(schemars::JsonSchema))]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I think I see what you're saying, but I also think I misread what is going on here. Can we remove the cfg_attr(feature = "jsonschema" part? If it's set as an enabled feature in the cargo file, do we also need to gate this here?

Comment thread crates/pyrefly_build/src/query/custom.rs Outdated
pub repo_root: Option<PathBuf>,
}

#[cfg(feature = "jsonschema")]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing that schema generator! It looks much cleaner and will be a lot easier to maintain in the future.

Would you also be able to update the other custom schema generators to attempt to move as much logic as possible into those configuration macros?

pub(crate) struct ExtraConfigs(pub(crate) Table);

#[cfg(feature = "jsonschema")]
impl schemars::JsonSchema for ExtraConfigs {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to get schemars to generate something that says this field shouldn't exist? It might just be that we have to mark it as something schemars should skip

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered using #[schemars(skip)] but kept the current manual implementation because it more accurately represents the actual behavior.

ExtraConfigs is a #[serde(flatten)] catch-all that captures any unknown fields via toml::Table. The current manual impl generates an object schema with additionalProperties: true, which correctly tells schema validators that extra properties are allowed.

Using #[schemars(skip)] would hide this field from the schema entirely, suggesting only explicitly defined properties are valid. This would cause false validation errors for users with custom/future config fields.

We could use #[schemars(skip)] and rely on the parent schema's additionalProperties: true setting. Let me know if you'd prefer that approach.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think if that correctly shows the current behavior, that would be great. The less manual implementation we have here the better.

@@ -1,266 +1,117 @@
#!/usr/bin/env python3

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the changes to this file are getting rid of tests that validate schemars should fail when given invalid values. Can we add those back?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done! I've restored the full test suite with all 23 negative test cases from upstream/main.

Note: 3 tests are commented out with an explanation - they validate mutual exclusivity between build-system types (buck shouldn't have command, custom shouldn't have extras/isolation-dir). The auto-generated schema doesn't include additionalProperties: false in the oneOf variants, so these specific constraints can't be enforced. The comments document this known limitation.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good point. In that case, we can delete the extra tests instead of just commenting them out.

- Extract configBase into reusable  subschema
- Flatten configBase properties into top-level schema
- Reuse configBase in sub-config via
- Fix build-system: move repo_root to all types, add not conditions
- Simplify pyproject-tool-pyrefly.json to reference pyrefly.json
- All tests passing
…ow additionalProperties, make python-platform non-exhaustive
- errors: accept both boolean and severity strings (ignore/info/warn/error)
- Add missing configBase fields: tensor-shapes, recursion-depth-limit, recursion-overflow-handler
- Add 'ty' to enabled-ignores enum
- Fix project-includes default to include **/*.ipynb
- Add baseline field (top-level)
- Remove non-existent repo_root from build-system
- Add ignore-if-build-system-missing and search-path-prefix to build-system
- Fix pyproject-tool-pyrefly.json  to use relative path
- Update test files to exercise new fields and severity strings
…#2448 Task 2)

Add schemars 0.8 as an optional dependency behind a `jsonschema` feature
flag to pyrefly_util, pyrefly_python, pyrefly_build, and pyrefly_config.

Add JsonSchema implementations for all config-related types:
- Derive-based: Tool, Severity, ErrorKind, UntypedDefBehavior,
  RecursionOverflowHandler, BxlArgs, BuildSystemArgs, BuildSystem
- Manual impls (custom serde / external types): Glob, Globs,
  PythonVersion, PythonPlatform, CustomQueryArgs, ErrorDisplayConfig,
  ModuleWildcard, ConfigOrigin<T>, ExtraConfigs, ConfigBase,
  PythonEnvironment, Interpreters, SubConfig, ConfigFile

Add generate_schema binary that outputs Draft-07 JSON Schema to stdout:
  cargo run -p pyrefly_config --features jsonschema --bin generate_schema

All schema code is gated with #[cfg(feature = \"jsonschema\")] for zero
impact on normal builds."
- Simplify .gitignore: Remove complex schemas/* pattern with exceptions.
  This makes it easier to add new schemas in the future without needing
  to modify .gitignore.

- Add documentation for schema settings: Document option_nullable and
  option_add_null_type in generate_schema.rs explaining their purpose
  and why we set them to false.

- Refactor CustomQueryArgs: Replace 60+ line manual JsonSchema impl with
  schemars derive macro using #[schemars(schema_with = "...")] for the
  Vec1<String> field. This is more maintainable and less error-prone.

- Regenerate schema: Update pyrefly.json to reflect upstream changes
  from rebase.
@Prathamesh-tech-eng Prathamesh-tech-eng force-pushed the autogenerate-schema-schemars branch from 5b4d343 to 3366229 Compare March 25, 2026 16:41
Prathamesh-tech-eng added a commit to Prathamesh-tech-eng/pyrefly that referenced this pull request Mar 25, 2026
… fix schema

- Move vec1_string_schema to pyrefly_util::schema_helpers for reusability
- Restore full validate_schemas.py test suite with 23 negative test cases
  (3 tests skipped due to schemars not supporting additionalProperties:false)
- Add missing strict-callable-subtyping field to ConfigBase schema
- Use referencing library for proper $ref resolution in tests

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… fix schema

Move vec1_string_schema to pyrefly_util::schema_helpers for reusability
Restore full validate_schemas.py test suite with 23 negative test cases
(3 tests skipped due to schemars not supporting additionalProperties:false)
Add missing strict-callable-subtyping field to ConfigBase schema
Use referencing library for proper $ref resolution in tests
@Prathamesh-tech-eng Prathamesh-tech-eng force-pushed the autogenerate-schema-schemars branch from 3366229 to d6115d3 Compare March 25, 2026 16:45
@Prathamesh-tech-eng

Copy link
Copy Markdown
Contributor Author

Thanks for the review! I've addressed all the feedback:

I kept the manual implementation to accurately represent additionalProperties: true. Would you prefer me to use #[schemars(skip)] instead? Both approaches work, but have different schema output.

Also fixed a bug I discovered during the review: the strict-callable-subtyping field was missing from the ConfigBase manual JsonSchema implementation. This field is now included in the generated schema.

@github-actions

Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had recent activity for more than 2 weeks.

If you are still working on this this pull request, please add a comment or push new commits to keep it active. Otherwise, please unassign yourself and allow someone else to take over.

Thank you for your contributions!

@connernilsen connernilsen left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, thanks for continuing to work on this, and sorry for the delay. We're busy getting ready for our V1 release, so things are hectic right now.

Things are progressing well! For your question about ExtraConfigs,

I kept the manual implementation to accurately represent additionalProperties: true. Would you prefer me to use #[schemars(skip)] instead? Both approaches work, but have different schema output.

I think if additionalProperties: true and #[schemars(skip)] get Pyrefly to allow additional properties for now but hide the extra-configs value, that would be perfect. If not, feel free to leave the custom implementation for ExtraConfigs.

Also, generally, we want to keep the manual schema generation to the absolute minimum we can. I expect config stuff to change over time, so I'd like to avoid a case where we're doing something wrong because someone forgot to update the schema generator when changing the struct. For anything (including possibly ExtraConfigs) that you leave custom schema generation for, can you put a comment detailing why this is so that future editors/reviewers understand why things are the way they are now? And for anything that doesn't have a great reason, can you derive #[cfg_attr(feature = "jsonschema", derive(schemars::JsonSchema))]? If it's just a few fields that are making it difficult, prefer to use schemars(schema_with = <function>) like we did for Vec1 types instead of defining schema generation for the whole type.

}

#[cfg(feature = "jsonschema")]
impl schemars::JsonSchema for PythonEnvironment {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we can't generate most of this? It seems like as long as generators for the subschemas exist, we can just derive #[cfg_attr(feature = "jsonschema", derive(schemars::JsonSchema))] for `PythonEnvironment.

}

#[cfg(feature = "jsonschema")]
impl schemars::JsonSchema for Interpreters {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, can we derive #[cfg_attr(feature = "jsonschema", derive(schemars::JsonSchema))] on Interpreters?

use schemars::schema::*;

let mut properties = schemars::Map::new();
let required: std::collections::BTreeSet<String> = std::collections::BTreeSet::new();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is being used. Could you double check on what this was supposed to do?

}

#[cfg(feature = "jsonschema")]
impl schemars::JsonSchema for ConfigBase {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to derive this with #[cfg_attr(feature = "jsonschema", derive(schemars::JsonSchema))]?

}

#[cfg(feature = "jsonschema")]
impl schemars::JsonSchema for SubConfig {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we try flattening here with the #[serde(flatten)] value macro? If so, I'm hoping we can get rid of the manual generator here.

fn json_schema(generator: &mut schemars::r#gen::SchemaGenerator) -> schemars::schema::Schema {
use schemars::schema::*;

// Each error code maps to either a bool or a severity string

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the default derivation for severity enum not automatically allow for one_of a severity string or bool?

}

#[cfg(feature = "jsonschema")]
impl schemars::JsonSchema for ModuleWildcard {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we derive this too?


/// ConfigOrigin is transparent in schema — it serializes as the inner type T.
#[cfg(feature = "jsonschema")]
impl<T: schemars::JsonSchema> schemars::JsonSchema for ConfigOrigin<T> {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one is somewhat interesting, makes sense why we need a manual implementation here :)

..Default::default()
})),
string: Some(Box::new(schemars::schema::StringValidation {
pattern: Some(r"^\d+(\.\d+)?(\.\d+)?$".to_owned()),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we derive and use [schemars(regex...)] for this? https://docs.rs/schemars/latest/schemars/derive.JsonSchema.html#regex

.to_owned(),
),
default: Some(serde_json::json!("linux")),
examples: vec![

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to derive all of this, specify that we prefer linux, darwin, or win32, but take anything here?

@github-actions github-actions Bot removed the stale label Apr 16, 2026
@github-actions

github-actions Bot commented May 3, 2026

Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had recent activity for more than 2 weeks.

If you are still working on this this pull request, please add a comment or push new commits to keep it active. Otherwise, please unassign yourself and allow someone else to take over.

Thank you for your contributions!

@github-actions github-actions Bot added the stale label May 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment