In the 20 years since YAML first appeared, its flexible and approachable way of representing data has become ubiquitous. Codethink is driven by YAML: we manage our infrastructure using Ansible, we integrate software stacks using BuildStream, we define CI pipelines using Gitlab CI or GitHub actions, we run safety analysis using STPA tools, and the list goes on. YAML has even made it to Mars.
The "human friendly" design principle of YAML means it's convenient to read and write by hand, and is most likely the secret to its dominance over competing formats. This convenience comes at a cost for developers who need to work with YAML data, though. Read on to find out more.
Safety and Validation
YAML's data model can represent arbitrary data, so when an app parses a YAML document it might get back anything. It's up to the developer to check that the data is structured how the app expects and to control what happens when it isn't. Does it report an error to the user? Is the behaviour undefined? Does it crash?
The YAML data model has strong, implicit typing. Each time it reads a value, the YAML parser will guess the type tag in a process called "tag resolution". The following example shows the "Norway problem", an exciting edge case present in YAML 1.1 and earlier:
country-list: [DK, NO, SE]
That's a list of two-letter ISO country codes. Let's load this into Python using pyyaml:
>>> import yaml
>>> yaml.safe_load("country-list: [DK, NO, SE]")
{'country-list': ['DK', False, 'SE']}
If your program expects two letter country codes, what happens if it gets boolean False
instead?
This happens because YAML 1.1 specified that YES
and NO
become boolean types.
You can work around this by quoting 'NO'
or explicitly setting type tag !!str NO
, but it's easy to forget and there are many similar edge cases.
One way or another, your app must validate the data it receives before processing it.
Schemas
You can write code to manually verify the data structure and report any issues to the user. The more times you implement this the more boring and error-prone it becomes, and you think: is there a library to make this more convenient?
The answer is yes, but the correct choice depends on the language you're working with.
Some languages have built-in types that easily map to YAML's data model. JavaScript is a good example: when you load a YAML document using js-yaml you receive a tree of JavaScript objects and values which hasn't been validated against any schema. Your code must handle situations where the input data is not formed how you expect, and you can use a general purpose validator that operates on JavaScript objects directly.
In many cases a validator supports a reasonable subset of the YAML data model, which is fine when you know what to expect. JSON-Schema is one example: the JSON data model is very close to YAML, and it's commonly used to validate data read from YAML files. We might validate our country-list above with the following JSON-Schema:
$id: https://example.com/schema.json
$schema: https://json-schema.org/draft/2020-12/schema
type: object
properties:
country-list:
type: array
items: { "type": "string" }
Other languages, such as C have no built-in way to represent YAML's model. A YAML library for a language like C must define its own types to represent the data, or use an alternative approach: the popular libyaml C library provides a low-level token-based API, which can be inconvenient to use. Some libraries improve on this by allowing developers to define a schema and data model in code. This has two benefits: library can validate the data before returning it to your app, and it can return it using types defined by you. Examples of this approach are libcyaml (for C), go-yaml (Go) and serde_yaml (Rust).
Libraries
So what's the safest and most convenient library to use for loading and validating YAML? Here are some recommendations grouped by language ecosystem.
C
Developed by Codethink engineer Michael Drake, libcyaml wraps the standard libyaml parser with a strongly-typed API for defining your data schema and loading it into structures defined by you.
See the guide to find out more about how it works.
C++
There is no schema-driven YAML loader for C++ that we know of. Both yaml-cpp and rapidyaml return a tree of C++ objects, and it's up to you to validate and process the results.
There are several libraries that handle JSON-Schema for C++, but most are tied to specific JSON parser libraries and can't be extended to YAML. The exception is valijson, whose flexible adaptor-based API integrates with many JSON libraries and could be extended to cover YAML libraries too.
If you are using Boost and could represent your data using boost::PropertyTree, you could validate it using valijson already.
Go
Go libraries can see your program's type information at runtime using the powerful reflect module. The yaml package and the newer go-yaml both use this feature to make schema validation simple: you define the expected structure of the data using Go's built-in types, and the library will raise an error if anything in the file doesn't match.
JavaScript
The most popular JavaScript YAML loader is js-yaml. There are several other options too, and all of them return the data as JavaScript objects without validating that it's what you want.
Once you have the data, you check it against a JSON-Schema using ajv, or the newer djv library.
Python
The safest way to load YAML in Python is with strictyaml. When you know what structure the input data should have, strictyaml is perfect: it's specifically designed to avoid the "Norway problem", and provides its own way to define a schema using Python code so it can verify the structure of the input and raise exceptions if it sees a problem.
You should be aware that it it imposes some limits on the incoming data, so for some use cases you'll want pyyaml or ruamel.yaml instead. These libraries work well with the Python jsonschema package for data validation.
Rust
With its emphasis on performance, Rust requires that any type introspection is done at compile-time. The Serde crate provides Serialize
and Deserialize
traits which your data structures can implement in a single line of code, using #[derive]
macros.
This enables the serde_yaml library to check data loaded from a YAML file against the data structures you expect to hold it, and return an error code if there is anything unexpected.
See the tutorial How to read and write YAML in Rust using Serde for a detailed example.
Other languages
We've listed some common languages that we use here at Codethink. Many more language ecosystems have some way to parse YAML, you can start by checking the list at yaml.org, and if you need an easy way to validate the data, start with the list of JSON-Schema validator implementations for your language.
Whatever your language, see if you can use a library to validate incoming data instead of checking everything by hand.
Other Content
- FOSDEM 2025: What to Expect from Codethink
- Codethink Joins Eclipse Foundation/Eclipse SDV Working Group
- Codethink/Arm White Paper: Arm STLs at Runtime on Linux
- Speed Up Embedded Software Testing with QEMU
- Open Source Summit Europe (OSSEU) 2024
- Watch: Real-time Scheduling Fault Simulation
- Improving systemd’s integration testing infrastructure (part 2)
- Meet the Team: Laurence Urhegyi
- A new way to develop on Linux - Part II
- Shaping the future of GNOME: GUADEC 2024
- Developing a cryptographically secure bootloader for RISC-V in Rust
- Meet the Team: Philip Martin
- Improving systemd’s integration testing infrastructure (part 1)
- A new way to develop on Linux
- RISC-V Summit Europe 2024
- Safety Frontier: A Retrospective on ELISA
- Codethink sponsors Outreachy
- The Linux kernel is a CNA - so what?
- GNOME OS + systemd-sysupdate
- Codethink has achieved ISO 9001:2015 accreditation
- Outreachy internship: Improving end-to-end testing for GNOME
- Lessons learnt from building a distributed system in Rust
- FOSDEM 2024
- QAnvas and QAD: Streamlining UI Testing for Embedded Systems
- Outreachy: Supporting the open source community through mentorship programmes
- Using Git LFS and fast-import together
- Testing in a Box: Streamlining Embedded Systems Testing
- SDV Europe: What Codethink has planned
- How do Hardware Security Modules impact the automotive sector? The final blog in a three part discussion
- How do Hardware Security Modules impact the automotive sector? Part two of a three part discussion
- How do Hardware Security Modules impact the automotive sector? Part one of a three part discussion
- Automated Kernel Testing on RISC-V Hardware
- Automated end-to-end testing for Android Automotive on Hardware
- GUADEC 2023
- Embedded Open Source Summit 2023
- RISC-V: Exploring a Bug in Stack Unwinding
- Adding RISC-V Vector Cryptography Extension support to QEMU
- Introducing Our New Open-Source Tool: Quality Assurance Daemon
- Achieving Long-Term Maintainability with Open Source
- FOSDEM 2023
- Think before you Pip
- BuildStream 2.0 is here, just in time for the holidays!
- A Valuable & Comprehensive Firmware Code Review by Codethink
- GNOME OS & Atomic Upgrades on the PinePhone
- Flathub-Codethink Collaboration
- Codethink proudly sponsors GUADEC 2022
- Tracking Down an Obscure Reproducibility Bug in glibc
- Web app test automation with `cdt`
- FOSDEM Testing and Automation talk
- Protecting your project from dependency access problems
- Full archive