Generate YAML with comments indicating which line of code generated each item#105
Generate YAML with comments indicating which line of code generated each item#105lihaoyi-databricks merged 18 commits intomasterfrom
Conversation
iuliand-db
left a comment
There was a problem hiding this comment.
I left a couple of comments, but overall ship it!
sjsonnet/src/sjsonnet/Util.scala
Outdated
|
|
||
|
|
||
| object Util{ | ||
| def binarySearch(lineStartsMin: Int, lineStartsMax: Int, lineStarts: Array[Int], index: Int): Int = { |
There was a problem hiding this comment.
I think Arrays.binarySearch would work as well.
There was a problem hiding this comment.
The issue with Arrays.binarySearch is that it only can check for exact value == input, whereas this binary search needs to check for value > input. In theory it should be about the same logic, but I didn't find any convenient verison built in :/
There was a problem hiding this comment.
I think it returns the -insertion_point for a value that's not in the array, if I understand this correctly:
Returns:
index of the search key, if it is contained in the array within the specified range; otherwise, (-(insertion point) - 1). The insertion point is defined as the point at which the key would be inserted into the array: the index of the first element in the range greater than the key, or toIndex if all elements in the range are less than the specified key. Note that this guarantees that the return value will be >= 0 if and only if the key is found.
There was a problem hiding this comment.
Ah cool find, I'll make use of that then!
There was a problem hiding this comment.
Turns out that the comment explaining how we use the result of java.util.Arrays.binarySearch is longer than the binary search implementation in the first place! OTOH it lets me delete 115 lines of test suite, so definitely still a win overall
| def saveCurrentPos() = { | ||
| val current = getCurrentPosition() | ||
| if (current != null){ | ||
| bufferedComment = " # " + current.currentFile.renderOffsetStr(current.offset, loadedFileContents) |
There was a problem hiding this comment.
This is in a hot code path (almost every line of output will have a comment), so maybe it would be better to avoid string concatenation. The logic where the buffer is appended to the StringWriter could do out.append(" # ") before appending the actual offset. Not sure if this would result in an observable speedup, but operations on Strings used to be a significant slowdown in the Scala compiler. Since this seems like an easy change, I'd be curious to know the outcome for very large files.
There was a problem hiding this comment.
I'll try this and see if there's any measurable difference
|
All tests pass, merging this |
Generates YAML that looks like this:
The idea being to make it easier to figure out how the materialized values are being constructed: by letting the user get back to the last line of jsonnet that generated a value, they can then use the https://github.com/databricks/intellij-jsonnet IDE plugin to further explore the jsonnet code to see how the values are passed into that line.
How it works
We capture a
Positionvalue along with every runtime valuesjsonnet.Valin our program. This is a thing wrapper around a file path and a character offset; we don't do any heavy computation until the end when we need to generate path+linenumber comments, and so it's reasonably lightweight and cheap as long as comment-generation isn't enabled (see perf numbers below). Every time a new value is computed, we store the currentPositionof the expression that computed it, and the positions of any upstream values that fed into the computation are discarded/ignoredAlthough we could in theory capture the entire computation graph (DAG) leading up to each value, the size of this forest of graphs is likely to be large and too verbose for general usage. The IDE effectively lets the user interactively explore a subset of the graph, which is probably good enough for now. Capturing and making sense of the entire computation forest can be explored in future
Only works with direct YAML output, since JSON doesn't allow comments, though in theory we could generate a look-aside-table/source-map-file with the same information and somehow get the IDE to hook them up.
As the
upickle.core.Visitorinterface doesn't support anything other thanInts as offsets to each callback method, we instead plumb thecurrentPosin a mutable variable which theMaterializersets using a callback and thePrettyYamlRendererreads using a callback. Not pretty, but it works. While in theory we could encode the entirePositionvalue using theoffset: Intthatupickle.core.Visitorreceives, I suspect it would be even more inelegant than this shared mutable variable thing.I had to plumb
offset: Intorpos: Positioninformation into a bunch of places where we previously didn't need it (e.g. all ofStd.scala). Where previously we only needed position information where we have a chance of throwing errors, now we need position information in all places where we construct newsjsonnet.Valobjects.Output comment generation is controlled under a flag
--yaml-debugComment Positioning
Due to the somewhat heterogenous output style of
PrettyYamlRenderer, which is copied from the PyYAML package, there is some subtlety in how the positioning of comments works. At a high-level:For most terminal values (bools, strings, nulls, numbers, empty arrays, empty lists), these are only ever printed at the end of a line, and so the source line comment is simply appended at the end of the line after the value, EXCEPT
For Strings rendered block-style (
|-,|+,|2, etc.) the only place we can put a comment is after the block delimiter, before the first line of the string, so we put the source line comment there.Lists and dictionaries in general are not guaranteed to appear on a line on their own (e.g. you can have nested lists in one line
- - - fooor nested dictionaries- foo: bar) and so in general we do not include a source line comment for where collections values were constructed, even though we have the data available, EXCEPTLists and dictionaries that are themselves values in another dictionary do have a spot of space where we can place a comment: after the preceding key of the enclosing dictionary. This is because nested lists/dictionaries on one line such as
- - foo:can only ever have a dictionary in the right-most position, and if the dictionary key is a collection (whether list or dict) that is always placed on the following lines. Thus in this case we put the source line comment of the following dictionary value at the end of the line of the preceding dictionary keyPerformance
Performance on my laptop, # of iterations per 20s of generation some arbitrary set of config files:
In addition to the code in this PR, I tried disabling
Position-management "globally" via an environment variable (madePosition.applyreturnnull, wrapped a lot of position-related code blocks inifguards) to see if it made a significant difference. It turns out that even with positions enabled, the performance degradation ofPR Generating JSON, positions enabledcompared to theMaster generating JSONbenchmark is small enough (2-3%) that it's probably not worth the complexity of juggling environment variablesThe
PR Generating PrettyYAML + Comments, positions enabledIs about 2x slower than without the source line comment generation; this is about as much a slowdown as I expected. To reach this, I had to cache calls tofastparse.internal.Util.lineNumberLookupthat performs the building of line-number tables for the Jsonnet files we load, andI re-implemented
fastparse.IndexedParserInput.prettyIndexinsjsonnet.Util.prettyIndexusing a binary search for performance rather than a linear scan; without these changes, it's about 300x slower on the arbitrary set of configs I was profiling aboveMaterializing the largest configuration file i could find on my laptop (not part of the benchmarks above) with a hot Sjsonnet JVM took ~3s with
--yaml-debugvs ~2.7s with--yaml-out, and on a cold JVM it was ~19s vs ~17s. Either way it's probably reasonably fast enough for now, especially since it's intended mostly for manual usage when debugging things and not really intended to be part of automated workflowsMerge-Patch
In order to preserve
Positions, I rewrote themergePatchstd lib function to avoid round-tripping thesjsonnet.Valdata structures throughujson.Values, and instead performing the merge-patch algorithm lazily directly on the lazysjsonnet.Valdata structure. This allows us to do a best-effort preservation ofPositions:std.mergePatchcallThere are probably other
std.*methods that we could rewrite in order to better preservePositioninformation while they do their work, but empirically it seemsmergePatchis the most commonly used one in our company's config codebaseTesting
Added some rudimentary unit tests in
PrettyYamlRendererTests, covering both this new line-comment generation as well as the previously-untestedPrettyYamlRenderer. Not very thorough, but at least it's something...Added a small fuzz-test to compare the output of the two
prettyIndeximplementations on a range of inputs, to try and catch any deviations between them. All such tests pass both on Scala-JVM and Scala.JSAdded few test cases in
StdMergePatchTeststo cover cases where I had trouble with semantic differences during implementation. These are in addition to the existing upstream test cases vendored insjsonnet/test/resources/test_suite/merge.jsonnet.Cleaned up the test suite a bit to DRY up copy-paste code, extracting a few common helpers in
TestUtils.scalaAll existing unit tests pass. The correctness of the changes when
--yaml-debugis not passed is also validated by running this on our work configuration codebase, and verifying that no output YAMLs were changed as the result of these refactorsWe don't actually have any fuzz tests to validate that adding the source-line comments doesn't change the parsed contents of the YAML output, but since it's mostly for manual debugging that's probably fine for now