#SQL101: Create tables from CSV flat files

With my previous article about getting into REST applications, I figured it would be a good idea for me to set up a data source so I could practice. Besides, I had reinstalled SQL Server 2019 on my machine, and I needed to import some data so that I could brush up on my SQL skills as well.

Being the baseball nut that I am, of course, I had to import baseball statistics, so I decided to reimport the most recent data from Sean Lahman’s baseball database. The last time I did this exercise, I downloaded a database format. I don’t remember what format I used (the links all say “Access” — which I don’t remember downloading), but the files I used had an .sql extension. This time, I used the comma-delimited version, which downloaded a zip file containing files with a .csv extension.

I wanted to import the files directly into my database and have them create the tables upon doing so, so I opened up my SSMS, created a new Baseball database, and looked into how to do this. After poking around a bit (and a little bit of Googling), I found that flat files could be imported by right-clicking the database of your choice (in the below example, “Baseball”), clicking Tasks, and selecting Import Flat File.

Selecting this opened an Import Flat File wizard. First, it prompted me to select the input file. (Note: if you are importing multiple files, as I did for this little exercise, the wizard is smart enough to remember your last folder when you click Browse.)

When it looks at the flat file, it gives you a preview of the data that you’re importing. Since, for this exercise, I’m importing comma-delimited flat files, it was able to put my data into nice, neat columns.

Clicking “Next” brought me to a screen where I could modify the columns. I like this option a lot, as it gives me an opportunity to set up my data schema the way I want. If you’re a SQL or database newbie, I strongly suggest that you learn about primary keys and data types and take the time to set them up at this point.

In this particular example, I set my yearID to char(4), stint to int (I will likely change this to tinyint), teamID to char(3), lgID to char(2), and pretty much everything after lgID to int. I also set my first five columns as a composite primary key and everything else to be nullable.

I must have set these columns up successfully, because when I ran it, it did so without complaining.

I wish I could say that I imported all of my flat files without a hitch, but I did run into a few that didn’t run successfully the first time. Here are some of the issues that I came across.

  • I had opened a file in Excel to check data types, forgotten to close it, and the import complained that it couldn’t work because the file was still open.
  • I miscalculated a few field sizes, and came across messages saying that my column sizes were too short (for example, setting nvarchar(10) for a column that included data with 15 characters).
  • There were a few cases where I simply had the wrong data type.
  • My Pitching table included a column for ERA, which I was surprised to see. Reason: ERA (Earned Run Average, for those of you who are baseball-challenged) is a calculated statistic, like batting average. However, batting average was not included in the Batting table. So, I set the column data type to float. However, when I tried to import it, it failed. When I looked at the data, I found entries under ERA that said “inf” (for “infinity”)*. In this case, I did some data cleansing. I got rid of these entries and saved the flat file. It then imported with no problem

(*Some of you might be wondering, how do you get an ERA of infinity? Answer: you give up runs without getting anyone out! Mathematically, you would get a divide-by-zero error for calculating ERA, but in baseball parlance, it means you give up runs and can’t get anyone out!)

So hopefully at this point, you now have an idea as to how to import flat files into a SQL Server database (and maybe even got a small taste of data types and primary keys). And hopefully, this little utility saves you a lot of grief when trying to import flat files.

#SQL101: Setting up a primary key

When I set up my baseball database, I noticed something in the SQL code used to build the tables.  In the CREATE TABLE T-SQL code that was supplied by the Sean Lahman‚Äôs baseball database, I saw code that looked like this:

CREATE TABLE [dbo].[Batting](
	[playerID] [varchar](50) NULL,
	[yearID] [varchar](50) NULL,
	[stint] [varchar](50) NULL,
	[teamID] [varchar](50) NULL,
	[lgID] [varchar](50) NULL,
	[G] [varchar](50) NULL,
	[AB] [varchar](50) NULL,
	[R] [varchar](50) NULL,
	[H] [varchar](50) NULL,
	[2B] [varchar](50) NULL,
	[3B] [varchar](50) NULL,
	[HR] [varchar](50) NULL,
	[RBI] [varchar](50) NULL,
	[SB] [varchar](50) NULL,
	[CS] [varchar](50) NULL,
	[BB] [varchar](50) NULL,
	[SO] [varchar](50) NULL,
	[IBB] [varchar](50) NULL,
	[HBP] [varchar](50) NULL,
	[SH] [varchar](50) NULL,
	[SF] [varchar](50) NULL,
	[GIDP] [varchar](50) NULL
) ON [PRIMARY]

Granted, there are a number of things about this code that strikes me, but one thing hit me immediately: all the columns are nullable.  That tells me that there are no primary keys defined.  I checked the other CREATE TABLE statements in the code as well; likewise, all of them had all their columns set to be nullable.

(There are some other things about the CREATE TABLE statements that struck me as well, but I’ll save those topics for another time.  For the purposes of this article, I’ll concentrate on primary keys.)

First, if you’re getting your feet wet with relational databases, it behooves the question…

What is a primary key?

First, I’ll start with a few links that help explain this:

primary key uniquely identifies a row of data.  To illustrate this, here’s an example.

Let’s say we have a person named John Smith taking night classes at State University.  In his database file (which, for the purposes of this example, has his name and address), he has the following information:

Name Address
John Smith 123 Main Street

Now, let’s say his son, John Smith Jr. (who lives at the same address) enrolls at State University (for the purposes of this example, let’s assume the database doesn’t allow suffixes like “Jr.”).  He is entered in the database.  So now, you have this.

Name Address
John Smith 123 Main Street
John Smith 123 Main Street

How do you know which record is which?  Answer: you don’t.  So, let’s add an ID column to our table.

ID Name Address
1 John Smith 123 Main Street
2 John Smith 123 Main Street

Now we know that we have two different records for two different people who are both named John Smith who live at 123 Main Street.

Granted, this is an overly-simplistic example, but the point is that the ID column — which is always unique and is never NULL — uniquely identifies each row.

So now that we know what a primary key is, let’s go back to our baseball database.

Creating primary keys for our baseball tables

We could create an ID column for our baseball tables, but I don’t want to do that here.  Instead, I want to make use of existing columns for our primary keys.  I’ll start with the MASTER table, which keeps track of ballplayers.

I ran a simple SELECT * FROM MASTER to get an idea of what’s in the table.  Running the query gave me this.

BaseballMasterQuery

I noted a couple of things.  First, I see a column called “playerID” which could be unique and used as a primary key.  Second, I see that the table contains 18,846 rows.  So can I use “playerID” for my primary key?  Let’s find out.

I ran two query statements.  The first gives me a count of unique entries I have for “playerID.”  The second tells me if any NULL values for “playerID” exist.

Here’s what I got.

BaseballMasterPlayerID.png

The first statement returns 18,846, which matches the number of rows in my earlier query, telling me that all values are unique.  The second statement doesn’t return anything, which tells me that there are no NULL values for “playerID.”  Therefore, the “playerID” column can be used as the primary key for this table.

So how do we make this column the primary key?  There are a couple of ways to do it in SQL Server.  For this, I decided to use T-SQL.

alter table dbo.Master
add constraint PK_playerID primary key clustered (playerID)

When I ran this, it returned an error.

Msg 8111, Level 16, State 1, Line 1
Cannot define PRIMARY KEY constraint on nullable column in table 'Master'.
Msg 1750, Level 16, State 0, Line 1
Could not create constraint or index. See previous errors.

Okay.  Why did this appear?  Keep in mind that a primary key cannot be null.  When our code created our tables, it made them all nullable.  So we need to add a NOT NULL constraint before we create the primary key.

alter table dbo.Master
alter column playerID varchar(50) not null

Running this statement sets the column to not allow NULL values.  Once I ran this, I re-ran the command to create the primary key, and it completed successfully.  So I now have a primary key on my Master table.

What if I don’t have a single unique column?

Let’s take a look at the Batting table.

BaseballBatting

So, what should we use for our primary key?  A batter can play multiple seasons, and can play for multiple teams in a single season.  Chances are that you won’t be able to make use of a single column.  However, we can make use of multiple columns for our primary key (this is called a “composite” primary key).

We find that the Batting table contains 101,332 rows, so we need a combination of columns that will give us 101,332 unique results.  I figured that playerID, yearID, and stint should give us those results.

I ran the following query.

select distinct playerID, yearID, stint from Batting

This returned 101,332 results.  Indeed, I seem to have identified my composite primary key.

My first task is to set these three columns to not accept NULL values (otherwise, we’d get the same error from earlier).

alter table dbo.Batting alter column playerID varchar(50) not null
alter table dbo.Batting alter column yearID varchar(50) not null
alter table dbo.Batting alter column stint varchar(50) not null

Once the NOT NULL constraint is in place, we can create our composite primary key.  Note that in my T-SQL statement below, I provide three columns (playerID, yearID, stint), not just one.

alter table dbo.Batting 
add constraint PK_playerID_yearID_stint 
primary key clustered (playerID, yearID, stint)

The query ran successfully.  We now have a composite primary key on the Batting table that uses three columns.

Hopefully, this gives you a good introduction to primary keys on database tables.  In upcoming articles, I’ll cover other aspects such as foreign keys, indexes, and data types as we continue to further refine and improve our baseball database.