Difference between revisions of "Vol-3194/paper12"
Jump to navigation
Jump to search
(modified through wikirestore by wf) |
(edited by wikiedit) |
||
Line 8: | Line 8: | ||
|authors=Mohamed-Amine Baazizi,Dario Colazzo,Giorgio Ghelli,Carlo Sartiani,Stefanie Scherzinger | |authors=Mohamed-Amine Baazizi,Dario Colazzo,Giorgio Ghelli,Carlo Sartiani,Stefanie Scherzinger | ||
|dblpUrl=https://dblp.org/rec/conf/sebd/BaaziziCGSS22 | |dblpUrl=https://dblp.org/rec/conf/sebd/BaaziziCGSS22 | ||
+ | |wikidataid=Q117344902 | ||
}} | }} | ||
==The Usage of Negation in Real-World JSON Schema Documents== | ==The Usage of Negation in Real-World JSON Schema Documents== |
Latest revision as of 17:54, 30 March 2023
Paper
Paper | |
---|---|
edit | |
description | |
id | Vol-3194/paper12 |
wikidataid | Q117344902→Q117344902 |
title | The Usage of Negation in Real-World JSON Schema Documents |
pdfUrl | https://ceur-ws.org/Vol-3194/paper12.pdf |
dblpUrl | https://dblp.org/rec/conf/sebd/BaaziziCGSS22 |
volume | Vol-3194→Vol-3194 |
session | → |
The Usage of Negation in Real-World JSON Schema Documents
The Usage of Negation in Real-World JSON Schema Documents Mohamed-Amine Baazizi1 , Dario Colazzo2 , Giorgio Ghelli3 , Carlo Sartiani4 and Stefanie Scherzinger5 1 Sorbonne Université, LIP6 UMR 7606, France 2 Université Paris-Dauphine, PSL Research University, France 3 Dipartimento di Informatica, Università di Pisa, Italy 4 DIMIE, Università della Basilicata, Italy 5 Universität Passau, Passau, Germany Abstract Many software tools, but also formal frameworks for working with JSON Schema, do not fully support negation. This motivates us to study whether negation is actually used in practice, for which aims, and whether it could, in principle, be replaced by simpler operators. We have collected a large corpus of 80k open source JSON Schema documents. We perform a systematic analysis, quantify usage patterns of negation, and also qualitatively analyze schemas. We show that negation is indeed used, albeit infrequently, following a stable set of patterns. Keywords Empirical Study, Conceptual Modeling, JSON Schema 1. Introduction JSON has become one of the most popular formats for data exchange. While many schema languages for JSON have been proposed [1], JSON Schema [2] is receiving considerable attention. In this language, a schema is a logical combination of assertions, describing classes of constraints on objects, arrays, and base values. JSON Schema is constantly evolving and new drafts always introduce new features. The language is increasingly used for defining domain-specific data exchange formats [3] and as a meta-language for defining other languages; a subset of JSON Schema serves as the schema language inside MongoDB [4]. As a consequence, an active and quite broad development community is releasing JSON Schema tools (validators [5], in particular). JSON Schema is powerful but complex, and its semantics is based on an intricate interplay among logical assertions. A distinctive feature is the not operator, whereby negation can be applied to any assertion. Negation is quite rare in type and schema languages, as it poses severe challenges. SEBD 2022: The 30th Italian Symposium on Advanced Database Systems, June 19-22, 2022, Tirrenia (PI), Italy $ baazizi@ia.lip6.fr (M. Baazizi); dario.colazzo@dauphine.fr (D. Colazzo); ghelli@di.unipi.it (G. Ghelli); carlo.sartiani@unibas.it (C. Sartiani); stefanie.scherzinger@uni-passau.de (S. Scherzinger) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) � 1 { " not ": 1 { " title " : " Object w / required foo ." , (a) 2 { " required ": [" DisplaceModules "] } 2 " type ": " object " , 3 } 3 " properties ": { 4 " foo ": { " type ": " integer " } , (c) 5 " bar ": { " type ": " string " } } , 1 { " description ": "..." , 6 " p at ter nP rop er tie s ": { 2 " @errorMessages ": 7 " f .∗ o ": { " type ": " integer " } } , (b) 3 { " not ": " Invalid target : ..." } , 8 " required ": [" foo "] 4 " not ": { " pattern ": "..." } ... } 9 } Figure 1: Snippets of JSON Schema documents. Example 1. One usage of not that startles novices (as discussed on StackOverflow [6]) is in combi- nation with the keyword required, as shown in Figure 1(a). While “not required” may sound like “optional”, it enforces that the object must violate the assertion, so member "DisplaceModules" must be absent. Indeed, the not-operator is often not fully supported, whether in academic prototype tools [7], commercial tools (e.g., [4]), or even formal frameworks [8], mostly because of the inherent complexity of handling negation. This inspired us to investigate the usage of this operator in real-world schemas, in a principled analysis of 80k JSON Schema documents crawled from GitHub. We formulate these research questions: (1) how frequent is negation in practice, (2) how is negation used, and (3) what are common usage patterns? Contributions. The contribution of this systematic empirical study is threefold. We first established a method for the collecting and preparing JSON Schema documents. Next, we measured the frequency of use of JSON Schema operators and of paths that include not, and quantify main patterns of use. Finally, we identified well-supported jargons, i.e., common uses of not that have the potential to mature into JSON Schema design patterns. An extended version of this study can be found here [9]. 2. Preliminaries JSON data model. The grammar below captures the syntax of JSON values, which are basic values, objects, or arrays. Basic values 𝐵 include the null value, booleans, numbers 𝑛, and strings 𝑠. Objects 𝑂 represent sets of members, each member being a name-value pair, and arrays 𝐴 represent sequences of values. 𝐽 ::= 𝐵 | 𝑂 | 𝐴 JSON expressions 𝐵 ::= null | true | false | 𝑛 | 𝑠 𝑛 ∈ Num, 𝑠 ∈ Str Basic values 𝑂 ::= {𝑙1 : 𝐽1 , . . . , 𝑙𝑛 : 𝐽𝑛 } 𝑛 ≥ 0, 𝑖 ̸= 𝑗 ⇒ 𝑙𝑖 ̸= 𝑙𝑗 Objects 𝐴 ::= [𝐽1 , . . . , 𝐽𝑛 ] 𝑛≥0 Arrays JSON Schema. JSON Schema is a language for defining constraints and requirements on the content of JSON documents. We discuss here the main keywords, and continue with two illustrative examples: Assertions include required, enum, const, pattern and type, and indicate a test that is performed on the corresponding instance. �Applicators include the boolean operators anyOf, allOf, oneOf, not, the object opera- tors properties, patternProperties, additionalProperties, the array operator items, and the reference operators $ref. Applicators indicate a request to apply a different operator to the same instance or to a component of the current instance. Annotations include title, description, and $comment, they do not affect validation, but they indicate an annotation that should be associated with the instance. Since we are mostly interested in validation, and since, moreover, annotations are removed by the not operator, we will ignore them. Example 2. In the schema in Figure 1(c), inspired from [5], line 1 carries an annotation. In defining an object (line 2), applicators define constraints on properties (lines 3), and the type of the properties matching a pattern (see line 6). Using an assertion, it is possible to indicate required properties (line 8). Example 3. JSON Schema is an open standard: in Figure 1(b), @errorMessages is a user- defined keyword whose value is an object that describes the error, and not a JSON Schema assertion. Hence, not in line 3 is just a member name, whereas negation does occur in line 4. The same string token has different semantics, depending on its context, which complicates parsing. 2.1. Pattern Queries To study which keywords occur below an instance of the not operator, we introduce a simple path language. A path such as .**.not.required matches any path that ends with an object field named required found inside an object field whose name is not. Paths are expressed using the following language. Path matching is defined as in JSONPath [10]. 𝑝 ::= 𝑠𝑡𝑒𝑝 | 𝑠𝑡𝑒𝑝 𝑝 𝑠𝑡𝑒𝑝 ::= .𝑘𝑒𝑦 | . * | [*] | .** The step .* retrieves all member values of an object, [*] retrieves all items of an array, and .** is the reflexive and transitive closure of the union of .* and [*], navigating to all nodes of the JSON tree to which it is applied. Complex sub-schemas. We say that not has a complex sub-schema, when its object argument contains more than one keyword. In this case, we say these keywords co-occur in the negated schema; otherwise, a sub-schema is simple. As an example, consider the schema of Figure 3(b): the argument of not is complex, and we match the paths .not.enum and .not.type. 3. Methodology Context. We explored GitHub for open source JSON Schema documents. We identified 91,6k URLs in July 2020, of which 85,6k could be retrieved (using wget). Discarding files with invalid syntax yields 82k files. For each retrieved file, we analyzed the $schema declarations to identify the version of JSON Schema. Draft 2019-09 is still quite new, and not really represented. Draft-04 is declared in the �vast majority of the files (79%), while Draft-07, Draft-06, and the old Draft-03 are each below 5%. An analysis of the file contents showed that the actual version that a schema follows is often different from the version declared. Data Preparation. As a first step, we renamed all references ($ref) by a new keyword $eref, with the target of the reference as its child, but we did not expand references recursively. We expanded references to external documents, provided that we were able to locate the referenced document (e.g., either contained within our corpus, or by downloading the document). References were renamed to $fref when expansion failed. We observed that by expanding references we lose the conceptual information encoded in the reference path itself. Thus, $ref is often more than just a syntactic macro. The schema corpus contains a large share of near-duplicate schemas, with small variations in syntax. We performed duplicate elimination by comparing compact schema signatures, defined as a function that maps each keyword to the number of its occurrences in the schema (encoded as a vector of keyword counts); we assumed that two schemas with the same signature are, with high probability, versions of the same schema, and we retained just one. After duplicate elimination our corpus shrunk to 11,500 distinct schemas. As illustrated in Example 3, correctly recognizing keywords can be a challenge. For this reason, we renamed all property names to avoid confusion when searching for patterns that involve the keyword not. As schema authors can define their own keywords, we have no way to know whether their value should be interpreted as an assertion. We experimented with two approaches: a “strict” approach in which we renamed everything that was inside a user-defined keyword, hence making it inaccessible by the analysis, and a “lax” approach in which we kept the content of any user-defined keyword, so that all instances of not in Figure 1(b) would be counted as keywords. With the strict approach, some interesting usage patterns are lost, and keyword usage is under-estimated. With the lax approach, we risk “false positives”, and hence over-estimation. We decided that the over-estimation of the lax approach was preferable. Analysis Process. The bulk of our effort is actually invested in data preparation. After experimenting with different data analysis platforms, we resorted to a relational encoding of the JSON Schema documents in PostgreSQL. This setup met our performance expectations, and allowed us to write queries in plain SQL. 4. Results of the Study 4.1. RQ1: How frequent is negation in practice? We study the frequency of JSON Schema keywords within our corpus, and the Boolean operators (among them, negation). The reported absolute values are mainly interesting as indicators as to the relative occurrences of operators. Figure 2 visualizes the results. From left-to-right, we sort keywords by their number of occurrence (note the log-scaled vertical axes). We also show the number of files in which keywords occur, as a further indicator of keyword relevance. The operator not appears in approx. 3% of all schemas, and occupies the 30th position, out of 46 keywords analyzed. Thus, it is a comparatively rare operator. The most common � 106 #Occ #Files 105 Number of Matches 104 103 102 101 100 rip pe in ng n qu ies pa niti ms ple . ul nc s ty a s nt e s iteum pr tion on ired de i . $ en th m schegth s ly ex $co tipl ies siv me f co amed $dnly m inIt Of de ProveM st in si on t re ertref im a red.It $id eM nt efs ad t ms m anyum unmaxim ms i q x It um fi e s ite els s eneOf ro e m pafau d m xLetterlt r l s ex nPrlOf On s m nde rtie . . op ep er en O e pr d rop th if am op p n clu m eO ax er rec tie mxclu c no de ueItem wr ain tte a on ad e m in m dP itl pe pe i sc ty N t op $ a e L ad a de P ax m e Figure 2: Number of total occurrences (#Occ), and number of files (#Files), where a JSON Schema keyword appears. Boolean operators are highlighted. Boolean operator is oneOf, more frequent than anyOf. allOf is even less common. The Boolean operator if-then-else is even less common than not, but was only been introduced in Draft-07. We found the dissemination of oneOf surprising, since the exclusive-disjunctive semantics of oneOf is more complicated than the purely disjunctive anyOf: oneOf takes as argument a collection of subschemas 𝑆1 , . . . , 𝑆𝑛 , and a value 𝐽 satisfies oneOf only if it matches exactly one subschema; anyOf is satisfied by any value 𝐽 that matches at least one of the subschemas. Our hypothesis is that the description of a class as a oneOf-combination of a set of “subclasses” is familiar from the exclusive-subclassing mechanism of object-oriented languages. The operator not appears 787 times in 298 different files out of 11,500. While not very frequent, its usage nevertheless merits a systematic study. 4.2. RQ2: How is negation used in practice? We evaluated pattern queries to identify keywords below not. Table 1 summarizes the results. Consider the left half. We match the path .**.not.* 840 times (#Occ) in 289 files (#Files). Below the top summary row, we list the individual keywords, breaking down shares of matches in percent (visualized by progress bars). The right half of the table provides statistics for sub- schemas that are negated and referenced, and therefore reachable via a path .**.not.$eref.*. In the following, we will omit the prefix “.**” from path queries, assuming the context is clear to our readers. We sorted the table on the total number of not.𝑘+not.$eref.𝑘 occurrences, and it is interesting to compare the weight of different keywords in both parts. A not may not correspond to any not.* pattern, when followed by { }. We found 16 such occurrences, expressing the schema false, which is not satisfied by any instance. This use of not is a consequence of the fact that false has only been introduced with Draft-06. Table 1 indicates a total of 840 occurrences of not.*, Figure 2 reported 787 occurrences of not. The values differ since the negated sub-schema can be complex. Most instances of not have a simple sub-schema. Most negated complex schemas have two keywords, but some have three or four. The situation is very different with $eref, i.e., references expanded in pre-processing. �Table 1 Occurrences of not.𝑘 paths (overall #Occ, and counting #Files). Path #Occ #Files Path #Occ #Files not.* 840 289 not.$eref.* 338 28 required 28.6 % 29.1 % required 10.7 % 53.6 % items 15.0 % 9.3 % items 0.0 % 0.0 % type 7.4 % 17.7 % type 15.1 % 71.4 % properties 8.5 % 16.3 % properties 11.8 % 64.3 % $eref 11.1 % 9.7 % $eref 0.0 % 0.0 % enum 7.3 % 18.0 % enum 3.6 % 28.6 % allOf 2.7 % 8.0 % allOf 11.2 % 17.9 % pattern 5.6 % 9.7 % pattern 0.0 % 0.0 % anyOf 5.4 % 12.5 % anyOf 0.6 % 7.1 % description 0.5 % 1.4 % description 12.1 % 25.0 % title 0.2 % 0.7 % title 11.5 % 25.0 % $schema 0.0 % 0.0 % $schema 12.1 % 32.1 % $fref 3.2 % 4.8 % $fref 0.0 % 0.0 % oneOf 0.7 % 1.4 % oneOf 5.3 % 10.7 % additionalProperties 1.3 % 3.8 % additionalProperties 2.7 % 25.0 % patternProperties 1.8 % 5.2 % patternProperties 0.0 % 0.0 % const 0.7 % 0.4 % const 0.0 % 0.0 % definitions 0.0 % 0.0 % definitions 0.9 % 10.7 % id 0.0 % 0.0 % id 0.6 % 7.1 % dependencies 0.0 % 0.0 % dependencies 0.6 % 7.1 % not 0.0 % 0.0 % not 0.6 % 7.1 % $ref 0.0 % 0.0 % $ref 0.6 % 7.1 % $comment 0.1 % 0.4 % $comment 0.0 % 0.0 % Here, 93 occurrences of not.$eref correspond to 338 occurrences of not.$eref.*. Thanks to the mediation of $eref, the schema designer implicitly applies negation to a complex argument, with an average of 3-4 members. The most common argument of negation is required. The pattern not.items is second- most common, followed by not.type and not.properties. While not.required dominates the not.* case, the two most common cases of the not.$eref group are not.$eref.type, whose value is object in 80% of the cases, and not.$eref.properties, which indicates that not.$eref is mostly used to negate complex object definitions. This ex- plains the much higher occurrence of descriptive keywords inside the referenced argument. 4.3. RQ3: What are common real-world usage patterns? Field and value exclusion. Field exclusion via not.required is the most frequent path. Paths not.enum and not.const are used to exclude values. Snippets of example schemas � " not ": { { " type " : " object " , (a) " enum ": [" markdown " , " oneOf ": [ " code " , { " properties ": " raw "] } { " when ": {" enum ": [" delayed "]}} , (d) " required ": [" when " ," start_in "] } , { " properties ": " not ": { { " when ": { " not ": {" enum ": [" delayed "]} (b) " enum ": [" generic − linux "] , }}} ] } " type ": " string " } { " type ": " object " , " not ": { " if ": { " items ": { " required ": [" when "] , " not ": { " properties ": (e) " type ": " string " , { " when ": {" enum ": [" delayed "]} }} , (c) " enum ": [ " then ": { " Dataset " , " Image " , " properties ": " Video " , " Sound " , { " when ": {" enum ": [" delayed "] }} , " Text " ] } } " required ": [" when " , " start_in "] }} Figure 3: JSON Schema snippets exemplifying real-world usage patterns. are shown in Figures 3(a) and (b). Such schemas have an obvious interpretation: the instance may have any type and must be different from the string or strings listed. In the majority of cases, the sub-schema is simple, as in Figure 3(a). In the complex cases, enum is always paired with a "type" : "string" assertion, as in Figure 3(b). This assertion is redundant, since all values listed by enum are strings. This co-occurrence is not specific to negation, since also in positive schemas, enum is paired with a type assertion in the vast majority of cases. Paraphrasing contains. The pattern not.items is among the most common not-paths. All such schemas have either the structure not.items.not (as in Figure 3(c)) or not.items.enum. The items assertion is verified by any instance that is not an array, or that is an empty array, or that is an array where every element satisfies the schema associated with items. Hence, it is only violated by instances that are arrays, and which contain at least one element that violates the schema. While items specifies a universally quantified property, not.items can be used to specify an existentially quantified property, as does the contains keyword. The jargon not.items.enum specifies that the array must contain at least one value that is not listed in the argument of enum. The jargon not.items.not specifies that the instance is an array that contains at least one value that satisfies 𝑆, according to the following equivalence: "not": { "items": { "not": 𝑆 } } ⇔ {"type": "array", "contains": 𝑆 } These two cases cover, with minimal variations, all occurrences of not.items. To sum up, not.items can be used to express contains. This is an instance of a pattern that may be replaced by a single (and thus simpler) operator. Paraphrasing Discriminated Unions. The schema snippet in Figure 3(d) allows interesting observations about the use of oneOf. JSON Schema specifications do not prescribe that the branches of oneOf are mutually exclusive, but they state that a value must match a single branch only. However, the two branches of oneOf happen to be mutually exclusive: if "when" is absent, then only the second branch holds. If it is present, then it is associated to complementary types in the two branches, so here, oneOf is actually anyOf. Applying equivalent rewritings (from ¬𝑎∨𝑏 to 𝑎 ⇒ 𝑏, and pushing down negation), the schema can be rewritten as shown in Figure 3(e). �Now the specification is clearer: if "when" has the value "delayed", then "start_in" is required. This suggests that oneOf is used to express a form of discriminated unions. References [1] M. A. Baazizi, D. Colazzo, G. Ghelli, C. Sartiani, Schemas and types for JSON data: From theory to practice, in: Proc. SIGMOD 2019, 2019, pp. 2060–2063. [2] json-schema org, JSON Schema, 2021. Available at https://json-schema.org. [3] B. Maiwald, B. Riedle, S. Scherzinger, What Are Real JSON Schemas Like? — An Empirical Analysis of Structural Properties, in: Proc. EmpER 2019, 2019, pp. 95–105. [4] MongoDB, Inc., MongoDB Manual: $jsonSchema (Version 4.4), 2021. [5] JSON Schema Test Suite, Available at: https://github.com/json-schema-org/ JSON-Schema-Test-Suite, version of commit hash #09fd353., 2021. [6] StackOverflow, JSON Schema – valid if object does *not* contain a partic- ular property, Available at: https://stackoverflow.com/questions/30515253/ json-schema-valid-if-object-does-not-contain-a-particular-property, 2015. [7] M. Fruth, M. A. Baazizi, D. Colazzo, G. Ghelli, C. Sartiani, S. Scherzinger, Challenges in Checking JSON Schema Containment over Evolving Real-World Schemas, in: Proc. EmpER 2020, 2020, pp. 220–230. [8] A. Habib, A. Shinnar, M. Hirzel, M. Pradel, Finding data compatibility bugs with JSON subschema checking, in: Proc. ISSTA 2021, 2021, pp. 620–632. [9] M. A. Baazizi, D. Colazzo, G. Ghelli, C. Sartiani, S. Scherzinger, An empirical study on the “usage of not” in real-world JSON schema documents, in: Proceedings of ER 2021, October 18-21, 2021, 2021, pp. 102–112. [10] J. Friesen, Java XML and JSON: Document Processing for Java SE, Apress, 2019, pp. 299–322. �