When it’s appropriate to use fake data (no, really!!!)

It isn’t uncommon for me to include data examples whenever I’m writing documentation. I’ve written before about how good examples will enhance documentation.

Let me make one thing clear. I am not talking about using data or statistics in and of itself to back up any assertions that I make. Rather, I am talking about illustrating a concept that just happens to include data as part of the picture. In this scenario, the illustration is the important part; the data itself is irrelevant. In other words, the information within the data isn’t the example; the data is the example. Take a moment to let that sink in, then read on. Once you grasp that, you’ll understand the point of this article.

I need to strike a type of balance as to what kind of examples I use. Since I work in a multi-client data application environment, I need to take extra steps to ensure that any data examples I use are client agnostic. Clients should not see — nor is it appropriate to use — data examples that are specific to, or identifies, a client.

There’s also a matter of data security. I needn’t explain how big of a deal data security is these days. We are governed by laws such as HIPAA, GDPR, and a number of other data protection laws. Lawsuits and criminal charges have come about because of the unauthorized release of data.

For me, being mindful of what data I use for examples is a part of my daily professional life. Whenever I need data examples, I’ll go through the data to make sure that I’m not using any live or customer data. If I don’t have any other source, I’ll make sure I alter the data to make it appear generic and untraceable, sometimes even going as far as to alter screen captures pixel by pixel to change the data. I’ll often go through great pains to ensure the data I display is agnostic.

I remember a situation years ago when a person asking a question on the SQLServerCentral forums posted live data as an example. I called him out on it, telling him, “you just broke the law.” He insisted that it wasn’t a big deal because he “mixed up names so they didn’t match the data.” I, along with other forum posters, kept trying to tell him that what he was doing was illegal and unethical, and to cease and desist, but he just didn’t get it. Eventually, one of the system moderators removed his post. I don’t know what happened to the original poster, but it wouldn’t surprise me if, at a minimum, he lost his job over that post.

Whenever I need to display data examples, there are a number of sources I’ll employ to generate the data that I need.

  • Data sources that are public domain or publically available — Data that is considered public domain is pretty much fair game. Baseball (and other sports) statistics come to mind off the top of my head.
  • Roll your own — I’ll often make up names (e.g. John Doe, Wile E. Coyote, etc.) and data wherever I need to do so. As an added bonus, I often have fun while I’m doing it!

Are there any other examples I missed? If you have any others, feel free to comment below.

So if you’re writing documentation in which you’re using an illustration that includes data, be mindful of the data in your illustration. Don’t be the person who is inadvertently responsible for a data breach in your organization because you exposed live data in your illustration.