#SQL101: Create tables from CSV flat files

With my previous article about getting into REST applications, I figured it would be a good idea for me to set up a data source so I could practice. Besides, I had reinstalled SQL Server 2019 on my machine, and I needed to import some data so that I could brush up on my SQL skills as well.

Being the baseball nut that I am, of course, I had to import baseball statistics, so I decided to reimport the most recent data from Sean Lahman’s baseball database. The last time I did this exercise, I downloaded a database format. I don’t remember what format I used (the links all say “Access” — which I don’t remember downloading), but the files I used had an .sql extension. This time, I used the comma-delimited version, which downloaded a zip file containing files with a .csv extension.

I wanted to import the files directly into my database and have them create the tables upon doing so, so I opened up my SSMS, created a new Baseball database, and looked into how to do this. After poking around a bit (and a little bit of Googling), I found that flat files could be imported by right-clicking the database of your choice (in the below example, “Baseball”), clicking Tasks, and selecting Import Flat File.

Selecting this opened an Import Flat File wizard. First, it prompted me to select the input file. (Note: if you are importing multiple files, as I did for this little exercise, the wizard is smart enough to remember your last folder when you click Browse.)

When it looks at the flat file, it gives you a preview of the data that you’re importing. Since, for this exercise, I’m importing comma-delimited flat files, it was able to put my data into nice, neat columns.

Clicking “Next” brought me to a screen where I could modify the columns. I like this option a lot, as it gives me an opportunity to set up my data schema the way I want. If you’re a SQL or database newbie, I strongly suggest that you learn about primary keys and data types and take the time to set them up at this point.

In this particular example, I set my yearID to char(4), stint to int (I will likely change this to tinyint), teamID to char(3), lgID to char(2), and pretty much everything after lgID to int. I also set my first five columns as a composite primary key and everything else to be nullable.

I must have set these columns up successfully, because when I ran it, it did so without complaining.

I wish I could say that I imported all of my flat files without a hitch, but I did run into a few that didn’t run successfully the first time. Here are some of the issues that I came across.

  • I had opened a file in Excel to check data types, forgotten to close it, and the import complained that it couldn’t work because the file was still open.
  • I miscalculated a few field sizes, and came across messages saying that my column sizes were too short (for example, setting nvarchar(10) for a column that included data with 15 characters).
  • There were a few cases where I simply had the wrong data type.
  • My Pitching table included a column for ERA, which I was surprised to see. Reason: ERA (Earned Run Average, for those of you who are baseball-challenged) is a calculated statistic, like batting average. However, batting average was not included in the Batting table. So, I set the column data type to float. However, when I tried to import it, it failed. When I looked at the data, I found entries under ERA that said “inf” (for “infinity”)*. In this case, I did some data cleansing. I got rid of these entries and saved the flat file. It then imported with no problem

(*Some of you might be wondering, how do you get an ERA of infinity? Answer: you give up runs without getting anyone out! Mathematically, you would get a divide-by-zero error for calculating ERA, but in baseball parlance, it means you give up runs and can’t get anyone out!)

So hopefully at this point, you now have an idea as to how to import flat files into a SQL Server database (and maybe even got a small taste of data types and primary keys). And hopefully, this little utility saves you a lot of grief when trying to import flat files.

#SQL101: Installing #SQLServer 2019 Developer on my laptop

A while back, I wrote about installing SQL Server 2016 on my laptop. Since then, Microsoft has released SQL Server 2019. Additionally, I bought a new laptop last November; my tired old HP 4430s had served me well for several years, but it was showing its age, so I decided it was time to upgrade. Since we have updated versions of SQL Server, and I have a (still relatively) new laptop on which to install it, not to mention that I have some time with this COVID-19 isolation, I figured this would be a good time to install SQL Server 2019 on my new laptop.

Before we begin, let me start with my laptop specs. I make no claims that these are the recommended specs for SQL Server, but this is what I have.

  • HP Pavilion x360 Convertible 14m-dh1xxx
  • Intel Core i5-10210U @ 160 GHz
  • 8 GB RAM
  • Windows 10 Home Edition (it’s what came installed)
  • 129GB available disk space

I started by going to the SQL Server downloads page and downloaded the freeware version of SQL Server 2019 Developer. There are a number of versions on this page, including (among other things) a trial version of SQL Server 2019 on-premises and SQL Server 2019 on Azure (and, of course, the Express version of SQL Server). For my rather modest needs, which includes practicing SQL Server skills, writing about it from time to time, and having some fun with it, Developer version should suit my needs.

The link downloaded SQL2019-SSEI-Dev.exe to my machine. I ran the file and was greeted by a screen asking for the installation type.

I decided to use the Custom option. The lazy body in me thought about running the Basic installation type, but since I’m documenting this installation, I figured it would defeat part of the purpose.

The next screen asked where to download the media. By default, it goes to C:\SQL2019. Since most everything I download goes to my Downloads folder, I decided to switch it there. I set it to download to a SQL2019 folder within my Downloads folder. It also indicated that I would need 8.9GB free space, with a download size of 1.4GB. My new laptop doesn’t have the disk size that my old one did, but I still have plenty available, so it shouldn’t be a problem. (One thing I should note: my new laptop uses a SSD, as opposed to the traditional storage disk on my old machine.)

I clicked Install, and the install package started to download.

Once the Installer finished downloading, the SQL Server Installation Center appeared.

I ran the System Configuration Checker, and it came up cleanly. I decided to proceed with installation. I clicked the Installation tab and selected New SQL Server stand-alone installation or add features to an existing installation.

On the Product Key page, I selected Developer under free edition. The next few screens were straightforward — the only warning I saw was for my firewall — until it got to the Feature Selection screen. I went ahead and selected all features, which would take up 14GB of disk space. If you’re installing SQL Server on your own machine, you’ll need to decide what features you want to install at this point.

I went with the default instance for the instance configuration.

I selected standalone PolyBase-enabled instance. I’m using this on a standalone configuration, after all.

Since I don’t have Java installed on my machine, I used the Open JRE included with the installation. If you have Java on your machine, you’ll need to determine what instance of Java you want to use.

Under Server Configuration, I used the default service accounts. There’s a note that reads: “Microsoft recommends that you use a separate account for each SQL Server service.” I am not sure about the implications of using the default service accounts; this would be a question for someone who knows more about SQL Server than I do.

Under the Database Engine Configuration, I went with Windows authentication mode. If I was installing this under any configuration other than my own machine and login, I would not go with this option; I would use Mixed Mode and specify a SQL user account. I added myself (clicking Add Current User) as the administrator for this account. Again, this is not something I, personally, would recommend for a large-scale installation, but since I’m the only one who’ll be using this instance, and I have no intention of using this for anything other than demo, practice, and documentation purposes, I went ahead and used Windows Authentication.

I pretty much went with the defaults for the rest of the installation. I did need to consent to install R and Python. I got to the Ready to Install screen, clicked Install, and let it do its thing.

Installation was straightforward and painless. In years and installations past, I’d be writing about the errors that came up and the number of times I’d have to click or press Enter to continue with the installation, but there were no such prompts. I let it go and went off to do other things. I’m not sure how long it took — I’ll guess around twenty minutes, although it seemed longer — but when I looked again, SQL Server was installed on my laptop.

That was as far as I got for this installation. I still need to tinker with post-installation configurations, including SSMS, SSIS, and any tools that I need to actually do something with SQL Server. That’ll likely come later when I have a chance to tinker some more.

#SQL101: Developing an introductory presentation to SQL Server

So far, all of my SQL Saturday presentations have been professional development talks — “soft topics,” as they’re often described. I don’t present about technical topics, but I do present topics that are of interest to technology (and perhaps other) professionals.

This is not to say that I don’t have technical skills. I do have a background in development and databases, but as I often introduce myself during my SQL Saturday presentations, I probably fall under the category of “knows enough SQL to be dangerous.” I am neither a SQL expert nor an MVP. While I am knowledgeable about SQL Server, I likely won’t be doing any presentations about power BI, data compression, or data security anytime soon.

I can, however, discuss rudimentary topics about SQL Server that might be of interest to people who are just getting started with SQL Server. When I first started my ‘blog, I wrote some articles about how to get started with SQL Server. As my ‘blog (and my professional life) has evolved, I’ve been moving more toward the soft topics about which I’m more knowledgeable and tend to present, and away from the hardcore technological topics.

An idea that has been in the back of my mind for some time is to develop presentations geared toward people who are just getting started with SQL Server and even databases in general. This idea is not new; I’ve toyed with it for a while, and only lack of time has kept me from developing it further.

One observation I’ve made during my frequent trips to SQL Saturday events is that many of the presentations are geared more toward “seasoned” SQL personnel; that is, people who already have some background knowledge of SQL Server and its workings. They are all very good topics, but for a person who is just getting started, they can be overwhelming — as is often described, a proverbial “drink from a firehose.”

There does seem to be a market for this idea. I’ve spoken to Grant Fritchey a few times about my idea, and he has encouraged me to pursue it. One thing that was mentioned to me was that part of the reason why many SQL Saturday presentations tend to be more advanced is that the presenters themselves are fairly advanced. A lot of them are SQL experts and MVPs, and are presenting topics at a much higher level from where a SQL beginner would need to start. It would be akin to asking a college professor to teach kindergarten.

Grant even suggested that I make these presentations into an entire precon — as there is way too much material to cover in a single SQL Saturday presentation. This is an idea that intrigues me, and it’s something that I’m interested in developing. It’s just a matter of me taking the time to sit down and putting it together.

I have a few reasons for writing this article. Among them are a form of self-encouragement to pursue this endeavor and a forum to list some of my thoughts. On the latter, I wanted to list a few topic ideas listed so that I can refer to and develop it as I go along.

Some of the topics I would cover would likely include the following.

  • A general high-level introduction to SQL Server and databases in general
  • Basics of T-SQL
  • An introduction to relational tables
  • Basics of data normalization
  • An introduction to database applications

I’m sure there are some other topics that haven’t occurred to me. If you have any suggestions, feel free to list them below in the comments.

This is an idea that has been kicking around my head for at least a few years. Maybe sometime, I’ll actually sit down and start working on it. Hopefully, that sometime will be soon.

#SQL101: On data types

Previously, I talked about something I saw in the code that built the tables in my baseball database.  I wrote last time about primary keys.  In this article, I want to talk about data types.

Let’s revisit the code that builds the Batting table again.

CREATE TABLE [dbo].[Batting](
	[playerID] [varchar](50) NULL,
	[yearID] [varchar](50) NULL,
	[stint] [varchar](50) NULL,
	[teamID] [varchar](50) NULL,
	[lgID] [varchar](50) NULL,
	[G] [varchar](50) NULL,
	[AB] [varchar](50) NULL,
	[R] [varchar](50) NULL,
	[H] [varchar](50) NULL,
	[2B] [varchar](50) NULL,
	[3B] [varchar](50) NULL,
	[HR] [varchar](50) NULL,
	[RBI] [varchar](50) NULL,
	[SB] [varchar](50) NULL,
	[CS] [varchar](50) NULL,
	[BB] [varchar](50) NULL,
	[SO] [varchar](50) NULL,
	[IBB] [varchar](50) NULL,
	[HBP] [varchar](50) NULL,
	[SH] [varchar](50) NULL,
	[SF] [varchar](50) NULL,
	[GIDP] [varchar](50) NULL
) ON [PRIMARY]

Last time, I mentioned that every single column allowed NULL values (and I’d like to expand on that in a future article).  Today, I’d like to talk about the fact that every single column is defined as varchar(50).

For those of you who are getting your feet wet with SQL Server (or don’t know what I’m talking about), let’s take a moment to discuss…

What are data types, and why are they important?

Imagine that you’re presented with a math problem (don’t worry; we’ll keep it simple).  However, when you’re presented with the problem, it looks like this:

XII + IX

Want to give it a shot?  The truth is, there’s a reason why we don’t normally use Roman numerals in our society.  Unlike the Arabic numeral system that we use in everyday life, the Roman numeral system is inefficient and not very intuitive at a glance.  For us to use Roman numerals, we’d first need to convert them into a format that we can easily understand.

So, knowing that XII = 12 and IX = 9, our new math problem is 12 + 9.  We now know our answer is 21.  (And if we want to convert back to Roman, it’d be XXI.)

Data types operate along the same principle.  A number that’s stored as a data type of, let’s say, varchar(50) (for those of you who don’t yet know T-SQL, that’s “variable character of up to 50 characters”) needs to be converted to a numeric type of (for example) int (for “integer”) before it can be processed.

“Okay,” you might ask, “why is that such a big deal?  Just let the system do the conversion and let it do the work.”

Let’s go back to that Roman numeral example for a moment.  In order for you to understand what the Roman numerals were, your brain had to do some work to perform the conversion.  Imagine that you had to process dozens, even hundreds, of those problems.  That’s a lot of Roman numerals that you have to process.  That makes for a lot of work.

Likewise, imagine that a computer system has to convert millions of data types.  That’s a lot of extra work that your computer has to do.  That can make for a very inefficient system.  One of the big issues that a DBA or a developer often has to deal with is a slow system.  Using proper data types can often help with trying to fix a slow system.

What kinds of data types are there?

In a word, many.  A number of web sites list data types for various systems, including MSDN and W3Schools.  Since I’m writing primarily about SQL Server, I’ll focus on SQL Server data types.  I won’t talk about all SQL Server data types (I’ll leave you to look at the links for that); rather, I’ll just discuss a few that are relevant to our conversation.

Let’s start first with the varchar data type, since that’s what’s defined in our tables.  Varchar is for variable character.  What that means is the data type accepts character strings up to a length that you specify.  A data type of varchar(50) means that the field accepts strings up to 50 characters.

There is also a char data type.  Like the varchar type, you specify the number of characters allowed by the data column.  Unlike the varchar type, however, it is a fixed length string.  A column of type char(50) means that the data will always have 50 characters.  If data that is fewer than 50 characters is stored, it pads the rest with spaces.  This is important to consider, because those characters (even the space characters) take up space.  If you’re storing millions of rows of data, and you’re using char data types for data that takes up fewer than the number of characters you specify, that’s a lot of wasted space.

So, let’s revisit our Batting table.

BaseballBatting

When we built our primary key, we created a composite primary key from the playerID, yearID, and stint columns.  Can we improve upon these columns?

Let’s look at playerID.  Do we need to reserve fifty characters for this column?  To find out, I ran the following query.

select distinct playerID, len(playerID) from batting
order by len(playerID) desc

To translate this query into English*, this means “give me each distinct playerID, along with the length of the playerID, from the batting table, and give me the results in descending order of length.”

(*For those of you who don’t speak T-SQL, yes, I do intend to write an article about it.)

The results looked like this:

PlayerIDLength

So what does this mean?  It means the longest playerID is nine characters.  Could it be longer than nine?  Possibly.  Will it ever get up to fifty?  Unlikely.  So while the varchar(50) will definitely accommodate this, my thinking is that it’s overkill.  I’m going to shorten the column.  I’m going to cut it down to 25 from 50.

Now, let’s look at the yearID column.  I ran the same query as earlier, but this time, I replaced playerID with yearID.  Every column was four characters long.

SQL Server provides us with a number of date and time data types.  My first thought was to use one of these data types, but at that point, I stopped to think.  What, exactly, is the purpose of the yearID column?  It identifies a season.  That’s it.  Will I perform any calculations with it?  It’s possible, but unlikely.  Will it ever change from four characters/digits?  No.  I’m going to make this a fixed four-character field.  For that, we’ll use char(4).

Now, let’s look at stint.  What is this column?  It keeps track of a player’s stint with a team in a season.  So, let’s see how this works.

I went into the database and looked to see what stint values were in the Batting table.  There were five.  I looked up the player who had five stints, and found a man named Frank Huelsman, who played in the early 1900s.  I ran a query that gave me this.

FrankHuelsmanStint

In 1904, he had five stints with four different teams, having played for the White Sox (twice), Tigers, Browns, and Senators.

So, what data type should we use for stint?  Let’s think about this.  We could use a char(1), but I don’t think we should.  Is it possible that a player could have ten or more stints in a year?  Unlikely, but yes.  So we’d need at least two characters.  Let’s try char(2) — or should we?

I say no.  I say that we should use a tinyint data type for stint.  Why?

I say it for a couple of reasons.  First, there the matter of ordering.  Let’s order the numerals 1-10.  How would you expect them to be ordered if they were stored as characters?  If you say 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, you’d be wrong.  If it was defined as a character data type, the order would look like this: 1, 10, 2, 3, 4, 5, 6, 7, 8, 9.  When we’re talking about characters, 10 comes after 1.

Second, there’s a matter of memory.  tinyint uses only 1 byte, as opposed to 4 bytes for int. The memory used by char and varchar data types are dependent on their sizes (i.e. how much is allocated).  Also, tinyint accommodates values from 0 to 255, which is plenty for our purposes.

So, let’s start by redefining the data types for our composite primary key.  Because T-SQL in SQL Server won’t let us alter multiple columns at once, we need to run them as separate queries.  I wrote three statements like this:

alter table Batting
alter column playerID varchar(25)

alter table Batting
alter column yearID char(4)

alter table Batting
alter column stint tinyint

This should do the trick — or so I thought.  When I ran the queries, SSMS gave me error messages that looked like this:

Msg 5074, Level 16, State 1, Line 1
The object 'PK_playerID_yearID_stint' is dependent on column 'playerID'.
Msg 4922, Level 16, State 9, Line 1
ALTER TABLE ALTER COLUMN playerID failed because one or more objects 
access this column.

What’s going on here?  Keep in mind what we’re trying to change: three columns that are part of a primary key.  So before we can alter these columns, let’s drop our primary key.

alter table Batting
drop constraint PK_playerID_yearID_stint

Now when we run our alter table statements, we should be set.

Or are we?  I tried to recreate the primary key (I decided to rename the primary key, while I was at it).

alter table Batting
add constraint PK_Batting primary key (playerID, yearID, stint)

Instead, it gave me the following error:

Msg 8111, Level 16, State 1, Line 1
Cannot define PRIMARY KEY constraint on nullable column in table 'Batting'.
Msg 1750, Level 16, State 0, Line 1
Could not create constraint or index. See previous errors.

Now what’s going on?  Apparently, I’d forgotten to include the NOT NULL statement when I altered the columns.  So this time, I’ll rewrite the statements like this and rerun them.

alter table Batting
alter column playerID varchar(25) not null

alter table Batting
alter column yearID char(4) not null

alter table Batting
alter column stint tinyint not null

After running these statements (which ran successfully), I retried my statement that added the primary key.  This time, it worked.

So, our primary key columns are now altered.  Let’s look at the rest of our table.

Our first two columns are teamID (for “team ID”) and lgID (for “league ID”).  I checked the data in these two columns (I’ll save you the trouble of looking at the T-SQL I used).  I found that they are never null, and they are always 3 characters and 2 characters, respectively.  They are also not unique.  For our table, I’ll redefine them as char(3) and char(2).

Now let’s look at the rest of the table.  For the benefit of anyone who doesn’t understand baseball, let’s write out what these columns are.

G Games
AB At Bats
R Runs (Scored)
H Hits
2B Doubles
3B Triples
HR Home Runs
RBI Runs Batted In
SB Stolen Bases
CS Caught Stealing
BB Base On Balls (Walks)
SO Strikeouts
IBB Intentional Base On Balls (Walks)
HBP Hit By Pitch
SH Sacrifice Hits
SF Sacrifice Flies
GIDP Grounded Into Double Play

They’re all statistics, and they’re all integer values.  For now, we’ll make them all integer values.  Do we make them NOT NULL?  Not necessarily.  Why?  Because for some records, certain statistics didn’t exist during their time, so they might not exist.  Some of these numbers may very well be NULL.  So we won’t add the NULL constraint.

Here’s what I wrote for my T-SQL:

alter table Batting alter column [teamID] char(3) not null
alter table Batting alter column [lgID] char(2) not null
alter table Batting alter column [G] int
alter table Batting alter column [AB] int
alter table Batting alter column [R] int
alter table Batting alter column [H] int
alter table Batting alter column [2B] int
alter table Batting alter column [3B] int
alter table Batting alter column [HR] int
alter table Batting alter column [RBI] int
alter table Batting alter column [SB] int
alter table Batting alter column [CS] int
alter table Batting alter column [BB] int
alter table Batting alter column [SO] int
alter table Batting alter column [IBB] int
alter table Batting alter column [HBP] int
alter table Batting alter column [SH] int
alter table Batting alter column [SF] int
alter table Batting alter column [GIDP] int

The query, which took 20 seconds to run, ran successfully.

Note that I only dealt with a few data types in this article.  I didn’t talk about decimal data types or computed values.  (What if, for example, we wanted to keep track of batting averages?)  I’ll cover those in a later article.  Hopefully, at this point, you’ll have a better idea of what data types are and what they can do.

#SQL101: Setting up a primary key

When I set up my baseball database, I noticed something in the SQL code used to build the tables.  In the CREATE TABLE T-SQL code that was supplied by the Sean Lahman’s baseball database, I saw code that looked like this:

CREATE TABLE [dbo].[Batting](
	[playerID] [varchar](50) NULL,
	[yearID] [varchar](50) NULL,
	[stint] [varchar](50) NULL,
	[teamID] [varchar](50) NULL,
	[lgID] [varchar](50) NULL,
	[G] [varchar](50) NULL,
	[AB] [varchar](50) NULL,
	[R] [varchar](50) NULL,
	[H] [varchar](50) NULL,
	[2B] [varchar](50) NULL,
	[3B] [varchar](50) NULL,
	[HR] [varchar](50) NULL,
	[RBI] [varchar](50) NULL,
	[SB] [varchar](50) NULL,
	[CS] [varchar](50) NULL,
	[BB] [varchar](50) NULL,
	[SO] [varchar](50) NULL,
	[IBB] [varchar](50) NULL,
	[HBP] [varchar](50) NULL,
	[SH] [varchar](50) NULL,
	[SF] [varchar](50) NULL,
	[GIDP] [varchar](50) NULL
) ON [PRIMARY]

Granted, there are a number of things about this code that strikes me, but one thing hit me immediately: all the columns are nullable.  That tells me that there are no primary keys defined.  I checked the other CREATE TABLE statements in the code as well; likewise, all of them had all their columns set to be nullable.

(There are some other things about the CREATE TABLE statements that struck me as well, but I’ll save those topics for another time.  For the purposes of this article, I’ll concentrate on primary keys.)

First, if you’re getting your feet wet with relational databases, it behooves the question…

What is a primary key?

First, I’ll start with a few links that help explain this:

primary key uniquely identifies a row of data.  To illustrate this, here’s an example.

Let’s say we have a person named John Smith taking night classes at State University.  In his database file (which, for the purposes of this example, has his name and address), he has the following information:

Name Address
John Smith 123 Main Street

Now, let’s say his son, John Smith Jr. (who lives at the same address) enrolls at State University (for the purposes of this example, let’s assume the database doesn’t allow suffixes like “Jr.”).  He is entered in the database.  So now, you have this.

Name Address
John Smith 123 Main Street
John Smith 123 Main Street

How do you know which record is which?  Answer: you don’t.  So, let’s add an ID column to our table.

ID Name Address
1 John Smith 123 Main Street
2 John Smith 123 Main Street

Now we know that we have two different records for two different people who are both named John Smith who live at 123 Main Street.

Granted, this is an overly-simplistic example, but the point is that the ID column — which is always unique and is never NULL — uniquely identifies each row.

So now that we know what a primary key is, let’s go back to our baseball database.

Creating primary keys for our baseball tables

We could create an ID column for our baseball tables, but I don’t want to do that here.  Instead, I want to make use of existing columns for our primary keys.  I’ll start with the MASTER table, which keeps track of ballplayers.

I ran a simple SELECT * FROM MASTER to get an idea of what’s in the table.  Running the query gave me this.

BaseballMasterQuery

I noted a couple of things.  First, I see a column called “playerID” which could be unique and used as a primary key.  Second, I see that the table contains 18,846 rows.  So can I use “playerID” for my primary key?  Let’s find out.

I ran two query statements.  The first gives me a count of unique entries I have for “playerID.”  The second tells me if any NULL values for “playerID” exist.

Here’s what I got.

BaseballMasterPlayerID.png

The first statement returns 18,846, which matches the number of rows in my earlier query, telling me that all values are unique.  The second statement doesn’t return anything, which tells me that there are no NULL values for “playerID.”  Therefore, the “playerID” column can be used as the primary key for this table.

So how do we make this column the primary key?  There are a couple of ways to do it in SQL Server.  For this, I decided to use T-SQL.

alter table dbo.Master
add constraint PK_playerID primary key clustered (playerID)

When I ran this, it returned an error.

Msg 8111, Level 16, State 1, Line 1
Cannot define PRIMARY KEY constraint on nullable column in table 'Master'.
Msg 1750, Level 16, State 0, Line 1
Could not create constraint or index. See previous errors.

Okay.  Why did this appear?  Keep in mind that a primary key cannot be null.  When our code created our tables, it made them all nullable.  So we need to add a NOT NULL constraint before we create the primary key.

alter table dbo.Master
alter column playerID varchar(50) not null

Running this statement sets the column to not allow NULL values.  Once I ran this, I re-ran the command to create the primary key, and it completed successfully.  So I now have a primary key on my Master table.

What if I don’t have a single unique column?

Let’s take a look at the Batting table.

BaseballBatting

So, what should we use for our primary key?  A batter can play multiple seasons, and can play for multiple teams in a single season.  Chances are that you won’t be able to make use of a single column.  However, we can make use of multiple columns for our primary key (this is called a “composite” primary key).

We find that the Batting table contains 101,332 rows, so we need a combination of columns that will give us 101,332 unique results.  I figured that playerID, yearID, and stint should give us those results.

I ran the following query.

select distinct playerID, yearID, stint from Batting

This returned 101,332 results.  Indeed, I seem to have identified my composite primary key.

My first task is to set these three columns to not accept NULL values (otherwise, we’d get the same error from earlier).

alter table dbo.Batting alter column playerID varchar(50) not null
alter table dbo.Batting alter column yearID varchar(50) not null
alter table dbo.Batting alter column stint varchar(50) not null

Once the NOT NULL constraint is in place, we can create our composite primary key.  Note that in my T-SQL statement below, I provide three columns (playerID, yearID, stint), not just one.

alter table dbo.Batting 
add constraint PK_playerID_yearID_stint 
primary key clustered (playerID, yearID, stint)

The query ran successfully.  We now have a composite primary key on the Batting table that uses three columns.

Hopefully, this gives you a good introduction to primary keys on database tables.  In upcoming articles, I’ll cover other aspects such as foreign keys, indexes, and data types as we continue to further refine and improve our baseball database.

#SQL101: Installing a Baseball Database

Earlier, I wrote about how I installed SQL Server 2016 on my machine.  In this article, I talk about creating and populating a baseball database.

I am a huge baseball fan.  I started following the New York Yankees when I was about 12.  (In an act that is likely anathema to religious fanatics, I married a Boston Red Sox fan; during baseball season, one of us — metaphorically speaking — ends up sleeping on the couch!)  I have been known to schedule vacations around Major League Baseball schedules.  I believe that Cooperstown is Mecca.

Why, you might ask, is this relevant to installing SQL Server?  Because I believe that one of the best ways to learn something is to have fun while you’re doing it.  So in this implementation, instead of installing NorthWind Traders or AdventureWorks (the standard practice databases used by SQL Server enthusiasts everywhere), I will instead install a copy of Sean Lahman’s baseball database.  (At the time of this article, the most recent version goes to and includes the 2015 baseball season.)

I should note that this isn’t to say NorthWind or AdventureWorks isn’t fun; rather, baseball is something about which I’m passionate, and is more likely to keep my attention.  At some point, I’ll likely install NorthWind or AdventureWorks as well, as many SQL references and guides refer to them.

I downloaded the most recent SQL version from Sean Lahman’s website, which left me with a .ZIP file.  Inside the file, I found a MSSQLMASTER.SQL file (you’ll find it in \mysql core\core\) that I extracted from the .ZIP file.  I opened the file with SSMS.

The first thing I noticed was that it created a database called “stats.”  I did not want to use this as my database name, so I first replaced all instances of “stats’ with “baseball”.  Once this was done, I ran the database build query.

Unfortunately, it would not be that easy.  The query file is very large.  When I ran the query, I received the following message.

SSMSOutOfMemory

Okay.  It has a lot of rows to insert (the data includes over a hundred years of statistics, after all).  Instead, I opted for just creating the database and tables.  However, even that proved to be problematic.

SQLBaseballBuildErrors

“Okay,” I said to myself, “what’s going on now?”

First, I had a permission issue to deal with (this doesn’t appear in the above log; I neglected to include that message).  I looked for the C:\Program Files\Microsoft SQL Server\MSSQL13.MSSQLSERVER\MSSQL\DATA folder.  When I tried to open it, I got a message saying that I had insufficient permissions to open it.  “What?” I said, “this is my own machine, and I’m the administrator!”  However, it also prompted me asking whether I wanted to open it.  I answered yes.  Once I did, I tried running the query again, and the permission error seemed to disappear.

However, that’s when I got the errors shown above.

That’s when I noticed that the filename references MSSQL11.  My database path includes MSSQL13.  Okay.  I changed the path in the filename from MSSQL11 to MSSQL13 and reran the query.  This time, it ran successfully.

I did notice one thing about the table infrastructure.  All columns were defined as NULL, which indicated to me that there were no unique indices or primary keys defined.  All tables would be heaps.  That’s something that I’d need to fix, but that was likely another project for another time.  For now, I was more concerned about building the database infrastructure and inserting the data.

Once my database and tables were built, I commented out the CREATE DATABASE and CREATE TABLE lines from the code, leaving only the INSERT commands.  Just for grins, I tried running the query again, and once again got the “insufficient memory” message.  Okay, no dice.  How was I going to get around this?  I suppose I could’ve selected and run the INSERT queries piecemeal, but that was going to be a long and painful process.

I ran a Google search, and came across my old friend, StackOverflow, where I found this entry.  So I tried the command line, substituting my database name and SQL filename (including the path).

sqlcmd -d baseball -i Downloads\mssqlmaster.sql

Here’s what I got:

sqlcmd: Error: Connection failure. SQL Native Client is not 
installed correctly. To correct this, run SQL Server Setup.

Interesting.  I went to look up SQL Server 2016 Native Client, and found this.  So SQL Server 2016 does not include Native Client.  So, I tried to download and install Native Client.

Of course, nothing ever goes as planned.  I got this message.

NativeClientErrorMsg

So, I went into my Settings to check my Native Client.  It told me that 2012 Native Client was already installed.

After Google-searching my error messages, I realized that I had SQL Server 2008 Express installed, and my SQLCMD was likely getting confused between versions.  That was likely the culprit.  So I went ahead and uninstalled SQL Server 2008, and tried my SQLCMD command line again.  This time, the SQLCMD did not give me any errors, and my command prompt started scrolling with the familiar “(1 rows affected)” message that appears when SQL Server inserts rows into tables.  The script finished without any problem.

Once the script finished, I ran a query to see if my data had made it into the tables.

BattingQuery

So it appears that we are good to go!  Our baseball database is populated!

I now have a baseball database with which I can play to my heart’s content.  I’ll be using this to practice my SQL skills — and ‘blog about my adventures (or misadventures, as the case may be) as I do so.

#SQL101: Installing SQL Server 2016

In an earlier article, I discussed how to get started with SQL Server.  I mentioned downloading a copy of SQL Server Developer Edition and installing a sample database.

It had been quite some time since I performed a SQL Server installation (my current job only requires that I know T-SQL, not how to install or implement it; after all, I am a developer, not a DBA).  Additionally, I figured that having SQL Server installed on my own local machine gave me a platform on which I could practice my skills in a safe environment.  With that, I decided that I would shake off the rust and perform an installation on my own machine.  This article documents my efforts.

I’ll start by describing the platform on which I’m installing SQL Server.  I am installing this on my own laptop PC, not a server.  It might or might not be an ideal environment for running SQL Server, but it’s what I have.

  • HP ProBook 4430s laptop
  • Intel Core i5-2450M processor @ 2.50 GHz
  • Windows 10 Pro 64 bit
  • 8GB RAM
  • 325GB available disk space

I decided that I wanted to keep up with the latest version (which, at the time of this article, is SQL Server 2016), so I needed to find a copy to install on my machine.  To do that, I need to go into Visual Studio Dev Essentials.  Visual Studio Dev Essentials is free, but it does require you to create an account.  If you don’t have an account, you will be prompted to create one when you click “Join Now.”

Once your account is established (or if you already have one), you are redirected to the Visual Studio Dev Essentials page.

VSDevEss

SQL Server can be downloaded from the Downloads menu at the top of the page, as well as the Microsoft SQL Server icon under Tools.

I went to the Dev Essentials page, and saw a note that SQL Server 2016 had a critical prerequisite, so I started with downloading and installing the patch.  Select the download that matches your system (for me, it was English-United States x64.exe).  Note that this requires a restart after setup completes.  Once the patch was installed, I went ahead and downloaded SQL Server 2016.  It’s a 2.1GB download.

Once the download finished, I ran the setup.exe program, which brought me to the SQL Server Installation Center.SQLServerInstallationCenter

Clicking the Hardware and Software Requirements brought me to the corresponding page.  Reading through the requirements, I didn’t see anything that would immediately be an issue.  The only thing I saw was that it required .NET Framework 4.6, which supposedly is installed by the program.

I decided to go ahead with the installation.  After clicking the Installation tab, I clicked New SQL Server stand-alone installation or add features to an existing installation.  For the Product Key, I specified the free Developer edition, clicked Next, and followed the subsequent prompts.  I received a warning about Windows Firewall being enabled.  I opened my Settings, went into Network & Internet, and checked my Windows Firewall.  Interestingly, my Windows Firewall would not let me turn it off; it said it was controlled by Norton 360, which is installed on my computer.  This was interesting; I had no idea that Norton 360 was controlling my Windows Firewall.

For the moment, my notification regarding Windows Firewall was a warning, not a failure.  I decided to continue to see what would happen.  I went ahead and clicked Next.  This brought me to the Feature Selection window.

SQLServerFeatureSelection

I went ahead and clicked Select All.  I had no idea what features I would use, I figured that I was installing SQL Server 2016 so I could practice and learn about its features, and it would only take 8GB of disk space — considering that I have 325GB available, it’s only a drop in the bucket.

However, when I clicked Next this time, I came across an error.

PolybaseInstallFailOracle

Interesting.  Should I install the Oracle SE Java Runtime Environment or not?  To answer that question, I asked myself whether or not I needed the PolyBase Query Service for External Data.  For that matter, what, exactly, is the PolyBase Query Service?  I found my answer here.  (Note that this feature is new to SQL Server 2016.)

For the moment, I decided that I didn’t need to work with data in either Hadoop or Azure blob storage, nor did I foresee an immediate need to interface with BI tools.  If I ever come across a need for them, I figure that I could always add it later.  So I unchecked the PolyBase Query Service (for now) and decided to proceed.  (Unchecking this, by the way, brought my disk requirement down to 6.5GB.)

I went with the default named instance (MSSQLSERVER).  In fact, I pretty much went with the default configurations (the one exception being selecting Mixed Mode instead of Windows authentication mode; I did add myself as a SQL Server administrator, as well as other services configurations).

I didn’t think to keep track of how long it took to install SQL Server 2016, but I would guess it took about an hour.  SQL Server installed with no further issues.

SQL Server Management Studio 2016

For as long as I can remember, SQL Server Management Studio (SSMS) was included as a standard part of installing SQL Server.  That is not the case with SQL Server 2016; SSMS must be installed separately.  Clicking Install SQL Server Management Tools in the SQL Server Installation Center takes you to this page.  Because SSMS tends to be my go-to tool for using SQL Server, I wanted to make sure it was included with my installation.

I downloaded the SSMS .EXE file, which is 825MB.  Installing SSMS is straightforward; click Accept to accept the license agreement, and the package installs.

Other SQL Server 2016 Tools

The SQL Server Installation Center also includes support for SQL Server data tools (SSDT) and for a new R server installation.  For now, I am bypassing these tools; I will likely install them at a later date.

After all was said and done…

Once everything was installed, I opened SSMS, which ran successfully.  (One thing I’ll add is that SSMS now looks like Visual Studio, which makes sense since it is powered by the VS engine.)

I now have SQL Server 2016 installed on my machine.

What’s next?

I will be implementing Sean Lahman’s baseball database into my SQL Server.  My exploits will be documented in a future ‘blog article called “Installing a Baseball Database.”  Stay tuned!

#SQL101: SQL Server: How to get started

A few weeks ago, my friend Jim called me.  His company was ending its relationship with a major client, and as a result, his position was likely to be reduced (or eliminated — I don’t remember which one he said) over the long term.  He knew that I’d done a lot of work with SQL Server (mind you, I don’t know enough to be an expert, but I know enough to get by), and had some questions for me — mainly to the effect of, “I want to learn SQL Server.  How do I get started?”

I see this question come up a lot in various forums, especially on SQLServerCentral.com (which, by the way, was one the first references that I gave Jim).  I suggested that he create a free account on the site so he could access and ask questions on the forum.

I mentioned he should look into attending SQL Saturday.  And on hindsight, I’d forgotten to mention that he should look into joining a PASS chapter.

I suggested that he obtain and install a copy of Microsoft SQL Server Developer Edition, which is a free download.  I also mentioned downloading the famous (infamous?) Northwind/AdventureWorks sample databases.  Practice makes perfect, after all, and if he could get hands-on experience with SQL Server in a sandbox environment, all the better.

Side note: Jim (a Red Sox fan) and I (a Yankee fan) are both big baseball fans (yes, we argue frequently!).  When I think of databases on which to practice, my preference is to download a copy of Sean Lahman’s baseball database.  I have nothing against either the Northwind or AdventureWorks sample databases; my thinking is that if you’re going to learn a new technology, you might as well make it fun!

And, of course, no introduction to SQL Server would be complete without Books Online (frequently abbreviated “BOL”).

Those are my suggestions.  Do you have any more?  If so, please feel free to leave them in the comments below!