#SQL101: Create tables from CSV flat files

With my previous article about getting into REST applications, I figured it would be a good idea for me to set up a data source so I could practice. Besides, I had reinstalled SQL Server 2019 on my machine, and I needed to import some data so that I could brush up on my SQL skills as well.

Being the baseball nut that I am, of course, I had to import baseball statistics, so I decided to reimport the most recent data from Sean Lahman’s baseball database. The last time I did this exercise, I downloaded a database format. I don’t remember what format I used (the links all say “Access” — which I don’t remember downloading), but the files I used had an .sql extension. This time, I used the comma-delimited version, which downloaded a zip file containing files with a .csv extension.

I wanted to import the files directly into my database and have them create the tables upon doing so, so I opened up my SSMS, created a new Baseball database, and looked into how to do this. After poking around a bit (and a little bit of Googling), I found that flat files could be imported by right-clicking the database of your choice (in the below example, “Baseball”), clicking Tasks, and selecting Import Flat File.

Selecting this opened an Import Flat File wizard. First, it prompted me to select the input file. (Note: if you are importing multiple files, as I did for this little exercise, the wizard is smart enough to remember your last folder when you click Browse.)

When it looks at the flat file, it gives you a preview of the data that you’re importing. Since, for this exercise, I’m importing comma-delimited flat files, it was able to put my data into nice, neat columns.

Clicking “Next” brought me to a screen where I could modify the columns. I like this option a lot, as it gives me an opportunity to set up my data schema the way I want. If you’re a SQL or database newbie, I strongly suggest that you learn about primary keys and data types and take the time to set them up at this point.

In this particular example, I set my yearID to char(4), stint to int (I will likely change this to tinyint), teamID to char(3), lgID to char(2), and pretty much everything after lgID to int. I also set my first five columns as a composite primary key and everything else to be nullable.

I must have set these columns up successfully, because when I ran it, it did so without complaining.

I wish I could say that I imported all of my flat files without a hitch, but I did run into a few that didn’t run successfully the first time. Here are some of the issues that I came across.

  • I had opened a file in Excel to check data types, forgotten to close it, and the import complained that it couldn’t work because the file was still open.
  • I miscalculated a few field sizes, and came across messages saying that my column sizes were too short (for example, setting nvarchar(10) for a column that included data with 15 characters).
  • There were a few cases where I simply had the wrong data type.
  • My Pitching table included a column for ERA, which I was surprised to see. Reason: ERA (Earned Run Average, for those of you who are baseball-challenged) is a calculated statistic, like batting average. However, batting average was not included in the Batting table. So, I set the column data type to float. However, when I tried to import it, it failed. When I looked at the data, I found entries under ERA that said “inf” (for “infinity”)*. In this case, I did some data cleansing. I got rid of these entries and saved the flat file. It then imported with no problem

(*Some of you might be wondering, how do you get an ERA of infinity? Answer: you give up runs without getting anyone out! Mathematically, you would get a divide-by-zero error for calculating ERA, but in baseball parlance, it means you give up runs and can’t get anyone out!)

So hopefully at this point, you now have an idea as to how to import flat files into a SQL Server database (and maybe even got a small taste of data types and primary keys). And hopefully, this little utility saves you a lot of grief when trying to import flat files.

Notes are not documentation

Since I speak frequently at SQL Saturday, much of my audience is usually made up of data professionals. So whenever I do my presentation about technical writing or talking to non-technical people, I always ask the following question: do you understand the difference between data and information?

For those of you who don’t understand what I’m saying, let me elaborate. The terms data and information are not, I repeat, not interchangeable. Data refers to the collection of statistics, facts, and figures that are gathered through observation, research, and log captures. Information, on the other hand, is an interpretation of that data. Its creation involves analyzing the data, interpreting it, drawing conclusions from it, and presenting it in a way that can be understood by others. To put it another way, data refers to the raw materials, while information is the “finished” product.

(We could also get into a discussion that talks about how bias is introduced when data is analyzed and interpreted by humans, all of whom have some measure of preconceived bias, no matter how small, about the data they’re researching, but it goes beyond the scope of this article, and I won’t get into that now. That is another topic for another day.)

Once I establish that distinction, that’s when I go into what I believe is an important point about written communication: there is a difference between notes and documentation.

I believe this distinction is critical, and is often overlooked by people trying to write documentation. (One of the biggest things that disparaged one of my previous employers was that they did not understand — or worse, did not care about — that difference.) How many times have you been frustrated whenever you’ve asked for documentation about something, and the person you asked pointed you to something like a loose collection of seemingly-meaningless and disorganized scribbled notes that are kept in a three-ring binder? (Okay, maybe they’re kept digitally these days, not in a binder, but humor me, here.)

Granted, notes are important. They are thoughts in someone’s head that are expressed in written form. They are often important concepts that are jotted down so they can be remembered later. Those notes can come in many forms. How many of the best ideas, for example, started out as notes scribbled on a cocktail napkin?

However, more often than not, notes are only meaningful to the person who wrote them. Notes are mnemonics. They are not conveying information to a wide audience. In all likelihood, most people who read notes will not understand what they mean. Only the person writing the notes (or, if they’re jotted down during a meeting, the people attending the meeting) will understand what the notes are.

Documentation, however, is an interpretation of those notes. Documentation is notes that are presented in a way so that they can be understood by a wider audience. This is what professional communicators, such as technical writers, do. They’re in the business of taking loose data, such as notes, and making it meaningful.

Let me say it again for emphasis: notes are not documentation. To go back to my data and information analogy, data is to information what notes are to documentation. Notes are the raw material, but documentation is what makes the notes meaningful.

When it’s appropriate to use fake data (no, really!!!)

It isn’t uncommon for me to include data examples whenever I’m writing documentation. I’ve written before about how good examples will enhance documentation.

Let me make one thing clear. I am not talking about using data or statistics in and of itself to back up any assertions that I make. Rather, I am talking about illustrating a concept that just happens to include data as part of the picture. In this scenario, the illustration is the important part; the data itself is irrelevant. In other words, the information within the data isn’t the example; the data is the example. Take a moment to let that sink in, then read on. Once you grasp that, you’ll understand the point of this article.

I need to strike a type of balance as to what kind of examples I use. Since I work in a multi-client data application environment, I need to take extra steps to ensure that any data examples I use are client agnostic. Clients should not see — nor is it appropriate to use — data examples that are specific to, or identifies, a client.

There’s also a matter of data security. I needn’t explain how big of a deal data security is these days. We are governed by laws such as HIPAA, GDPR, and a number of other data protection laws. Lawsuits and criminal charges have come about because of the unauthorized release of data.

For me, being mindful of what data I use for examples is a part of my daily professional life. Whenever I need data examples, I’ll go through the data to make sure that I’m not using any live or customer data. If I don’t have any other source, I’ll make sure I alter the data to make it appear generic and untraceable, sometimes even going as far as to alter screen captures pixel by pixel to change the data. I’ll often go through great pains to ensure the data I display is agnostic.

I remember a situation years ago when a person asking a question on the SQLServerCentral forums posted live data as an example. I called him out on it, telling him, “you just broke the law.” He insisted that it wasn’t a big deal because he “mixed up names so they didn’t match the data.” I, along with other forum posters, kept trying to tell him that what he was doing was illegal and unethical, and to cease and desist, but he just didn’t get it. Eventually, one of the system moderators removed his post. I don’t know what happened to the original poster, but it wouldn’t surprise me if, at a minimum, he lost his job over that post.

Whenever I need to display data examples, there are a number of sources I’ll employ to generate the data that I need.

  • Data sources that are public domain or publically available — Data that is considered public domain is pretty much fair game. Baseball (and other sports) statistics come to mind off the top of my head.
  • Roll your own — I’ll often make up names (e.g. John Doe, Wile E. Coyote, etc.) and data wherever I need to do so. As an added bonus, I often have fun while I’m doing it!

Are there any other examples I missed? If you have any others, feel free to comment below.

So if you’re writing documentation in which you’re using an illustration that includes data, be mindful of the data in your illustration. Don’t be the person who is inadvertently responsible for a data breach in your organization because you exposed live data in your illustration.

SQL Saturday #741, Albany, NY — Come to upstate New York!

On Saturday, July 28 (a week from tomorrow as I write this), our local Albany-area SQL user group will host our fifth SQL Saturday!  I have participated in all five; I worked as a volunteer at the first one, and I presented at the other four (including next Saturday).  This is one of my favorite events, and I look forward to it every single year!

This year, I am debuting a brand-new presentation: “Networking: it isn’t just for breakfast anymore,” based upon my ‘blog article of the same name.  (An alternate name for it could be “Networking 101: networking for beginners.”)  This presentation is primarily for people who want to get better at networking but don’t know how to do it, although seasoned veterans might be able to get something out of it as well.  It’s one of the first sessions of the day (8:30 am!), so come early!

As much as I promote my own presentations, mine is not the only one on the docket.  There are many wonderful speakers and presentations being given at this event, and I encourage you to come out and check out as many as you can that interest you!

SQL Saturday is always a great time, a great opportunity for free learning, and a great opportunity to network with data professionals.  The Capital District region here in upstate New York has been my home for many years.  I hope to see you here in my home turf!

#BI101: An introduction to BI using baseball

Edit: This is the first of a series of articles (I hope!) in which I’m trying to teach myself about BI.  Any articles I write that are related to this, starting with this one, will be preceded with “#BI101” in the title.

As I stated in a previous article, one topic about which I’m interested in learning more is business intelligence (BI).  For those of you who are new to BI, it is a broad topic.  In a nutshell, it can probably be described as “consuming and interpreting data so it can be used for business decisions and/or applications.”

I’ll admit that I don’t know a lot about BI (at least the fine details, anyway).  I did work a previous job where I touched upon it; I was tasked with performing some data analysis, and I was introduced to concepts such as OLAP cubes and pivot tables.  I’ve gotten better at creating pivot tables — I’ve done a few of them using MS Excel — but I’ll admit that I’m still not completely comfortable with building cubes.  I suppose that’ll come as I delve further into this.

A while back, my friend, Paresh Motiwala, suggested that I submit a presentation for Boston SQL Saturday BI edition.  At the time, I said to him, “the only thing I know about BI is how to spell it!”  He said to me (something like), “hey, you know how to spell SQL, don’t you?”  Looking back at the link, I might have been able to submit (I didn’t realize, at the time, that they were running a professional development track).  That said, Paresh did indeed had a point.  As I often tell people, I am not necessarily a SQL expert — I know enough SQL to be dangerous — nevertheless, that does not stop me from applying to speak at SQL Saturday.  Likewise, as I dive further into this topic, I’m finding that I probably know more about BI than I’ve led myself to believe.  Still, there is always room for improvement.

To tackle this endeavor, once again, I decided to jump into this using a subject that I enjoy profusely: baseball.  Baseball is my favorite sport, and it is a great source of data for stat-heads, mathematicians, and data geeks.  I’ve always been of the opinion that if I’m going to learn something new, I should make it fun!

Besides, the use of statistical analysis in baseball has exploded.  Baseball analytics is a big deal, ever since Bill James introduced sabermetrics (there is some debate as to whether James has enhanced or ruined baseball).  So what better way to introduce myself to BI concepts?

For starters, I came across some articles (listed below, for my own reference as much as anything else):

I also posted a related question in the SSC forums.  We’ll see what kind of responses (if any) I get to my query.

Let’s start with the basics — what is BI?

Since I’m using baseball to drive this concept, let’s use a baseball example to illustrate this.

Let’s say you’re (NY Yankees manager) Aaron Boone.  You’re down by a run with two outs in the bottom of the 9th.  You have Brett Gardner on first, Aaron Judge at bat, and you’re facing Craig Kimbrel on the mound.

What do you do?  How does BI come into play here?

Let’s talk a little about what BI is.  You have all these statistics available — Judge’s batting average, Kimbrel’s earned run average, Gardner’s stolen base percentage, and so on.  In years BS — “before sabermetrics” — a manager likely would have “gone with his gut,” decided that Judge is your best bet to hit the game-winning home run, and let him swing away.  But is this the best decision to make?

Let’s put this another way.  You have a plethora of data available at your fingertips.  BI represents the ability to analyze all this data and provide information that allows you to make a good decision.

If Aaron Boone (theoretically) had this data available at his fingertips (to my knowledge, Major League Baseball bans the use of electronic devices in the dugout during games), he could use the data to consider Kimbrel’s pitching tendencies, Judge’s career numbers against Kimbrel, and so on.  BI enables Boone to make the best possible decision based upon the information he has at hand.

I do want to make one important distinction.  In the above paragraphs, I used the words data and information.  These two words are not interchangeable.  Data refers to the raw numbers that are generated by the players.  Information refers to the interpretation of that data.  Therein lies the heart of what BI is — it is the process of generating information based upon data.

What’s there to know about BI?

I’ve already mentioned some buzzwords, including OLAP, cubes, and pivot tables.  That’s just scratching the surface.  There’s also KPIs, reporting services, decision support systems, data mining, data warehousing, and a number of others that I haven’t thought of at this point (if you have any suggestions, please feel free to add them in the comments section below).  Other than including the Wikipedia definition links, I won’t delve too deeply into them now, especially when I’m trying to learn about these myself.

So why bother learning about BI?

I have my reasons for learning more about BI.  Among other things…

  • It is a way to keep myself technically relevant.  I’ve written before about how difficult it is to stay up-to-date with technology.  (For further reading regarding this, I highly recommend Eugene Meidinger’s article about keeping up with technology; he also has a related SQL Saturday presentation that I also highly recommend.)  I feel that BI is a subject I’m able to grasp, learn about, and contribute.  By learning about BI, I can continue making myself technically valuable, even as my other technical skills become increasingly obsolete.  Speaking of which…
  • It’s another adjustment.  Again, I’ve written before about making adjustments to keep myself professionally relevant.  If there’s one thing I’ve learned, it’s that if you want to survive professionally, you need to learn to adjust to your environment.
  • It is a subject that interests me.  I’m sure that many of you, as kids, had “imaginary friends.”  (I’ll bet some adults have, too — just look at Lieutenant Kije and Captain Tuttle.)  When I was a kid, I actually had an imaginary baseball team.  I went as far as to create an entire roster full of fictitious ballplayers, even coming up with full batting and pitching statistics for them.  My star player was a power-hitting second baseman who had won MVP awards in both the National and American leagues, winning several batting titles (including a Triple Crown) and leading my imaginary team to three World Series championships.  I figured, if my interest in statistics went that far back, there must be something behind it.  Granted, now that I’ve grown up older, I’m not as passionate about baseball statistics as I was as a kid, but some level of interest still remains, nevertheless.
  • It is a baseline for learning new things.  I’ve seen an increasing number of SQL Saturday presentations related to BI, such as PowerBI, reporting services, and R.  I’m recognizing that these potentially have value for my workplace.  But before I learn more about them, I also need to understand the fundamental baseline that they support.  I feel that I need to learn the “language” of BI before I can learn about the tools that support it.

So, hopefully, this article makes a good introduction (for both you and myself) for talking about BI.  I’ll try to write more as I learn new things.  We’ll see where this journey goes, and I hope you enjoy coming along for the ride.