Here at Open mHealth, we build free and open-source tools to get data out of silos, understand it, and make magical stuff with it. But what if you don’t have troves of data sitting around waiting for you to use? How do you test your systems before you go live, or before your study starts? Or what if the data you do have access to doesn’t match what you’re trying to process or visualize? If you don’t have access to data containing the measures, time scales, and patterns you need, you’ll need to generate some. We’ve just created a sample data generator that does that for you.
There are plenty of shapes a data generator can take. To figure out what to build, we asked ourselves three critical questions.
Who’s it for?
The people who need to generate data come from different backgrounds. Developers use sample data to test code. Researchers and data scientists use sample data to test feature computations and models. Designers use sample data to test visualizations and get product feedback. We needed to give all of these people a simple way to express the data to generate, which eliminated making them modify code. The approach needed to be declarative. It needed to be easy to experiment with. It needed to be quick.
We settled on using a configuration file to define the data to generate. The configuration file is written in YAML, which is human-readable and doesn’t take much getting used to. It’s also easy to change one line in the file and regenerate, making it trivial to experiment with different configurations. The end result is akin to a Domain Specific Language, just using a familiar format.
The data generator is run as a command-line tool, which our target audience is accustomed to using. We’ll find out in due course if a command-line tool is too cumbersome, and if it is, we’ll layer a web interface on top. The semantics of the configuration file map directly to a web interface, so we should be able to pull that off quickly.
What can users do with it?
In some cases, a user will just want basic data. For example, a developer might want blood pressure data for a year, just to check if it’s ingested properly in the system under test. In other cases, a user will want something specific. For example, a UI designer might want blood pressure data trending upwards from a normal level to a high level, to check that a graph correctly colors data points as they cross a warning threshold. The goal of the data generator, and everything else we build at Open mHealth, is to make simple things easy, and complex things possible.
To accomplish this, the YAML configuration makes heavy use of convention over configuration. If a user doesn’t care about something, they don’t have to specify it. Here’s an example of a minimal configuration to create body weight data:
And here’s the whole shebang:
As you can see, the configuration is straightforward, but it’s also quite powerful. It lets you control
- what kind of data to create, e.g. body weight, blood pressure, heart rate, etc.
- when the data occurs, e.g. this week, last year, the past decade
- how spread out the data is, e.g. one data point per second, one data point per week, etc.
- whether data that occurs during night-time hours should be generated
- what bounds to specify on the data
- how much variance is in the data
- how the data trends over time, e.g. blood pressure increases from 120/80 to 150/90 over a six month period
A graph of the end result could look something like this:
or even like this
(If you want to make graphs like these, check out our visualization library.)
We documented the generator’s features in depth on GitHub should anyone want to dig in, but we know people love to just get up and running. To help with that, we’ve filled the sample configuration file with explanations of the different settings so that it’s easy to understand them, and put in commented configuration blocks in there that you can uncomment to use. Finally, if you really just want a ready-made data set, you can get a year’s worth of data over here that matches the defaults in the configuration file.
What if users want to do more with the generator?
It’s open-source under Apache 2.0, so you can add features and change it as you see fit. And if you need any help, we’re always reachable on our forum or on GitHub.
We strive to make our open-source code easy to change. To ensure many people understand it, we write code in mainstream languages. In this case, the data generator is written in Java. We also try to keep the code as small as possible. The less code there is, the lower its conceptual weight, making it simpler to understand and change. We keep code small by leveraging frameworks to handle infrastructure concerns and let them do as much of the heavy lifting as possible. Since this is a Java project, we depend on the Spring Framework, Spring Boot, and Jackson to handle configuration and serialization. The mapping from the configuration YAML to Java objects, for example, is handled transparently by Spring Boot and SnakeYAML.
A code change we foresee people making is the addition of more measure generators. A measure generator is responsible for creating data specific to one data type, such as body weight. The following code snippet is the entirety of the body weight generator.
As you can see, it’s clear and small. A measure generator is typically less than 50 lines of code.
To release the data generator quickly, we supported a handful of popular measures out of the box. But by keeping obvious extension points encapsulated and clear, we improve the chances of pull requests and community contribution to grow the product.
Now that you’ve seen what the generator is for, what it can do, and how it does it, the last step is to download it and try it out. You can pull it as a Docker image, or download a JAR. We’d love to hear what you think of it and where you think we can do better. Feel free to comment below, or post issues on the GitHub repository.