It was the best of times, it was the worst of times

Steve Jones recently asked: what were your best days at work, and what was your worst?  He also issued a challenge to write about our typical days at work.  In terms of writing about a typical workday, I have the first of four (per Steve’s challenge instructions) draft articles warming up in the bullpen; hopefully, I’ll crank that one out sometime within the next week or so.  But in the meantime, for this article, I want to take a moment to address the best and worst days.

In terms of the worst day, I don’t think there is any contest.  I think it’s pretty safe to say that 9/11 was my worst day at work.  (For the benefit of those of you who don’t feel like clicking my article link, I worked for a company that had an office in the World Trade Center — when 9/11 happened.)

In terms of the best days, however, that requires a little more thought.  It’s not that I haven’t had any great days — that isn’t true — it’s just that there are a number of them, and having been a working professional for (cough! cough!!!) years, trying to pick out a few that stand out over the years is difficult for me to do.  So what I’ll do is pick out a few project victories that I’ve had over the course of my career.  Granted, I’m picking these from the top of my head, and there very well might be others that were more significant that I’m not remembering right now, but for purposes of this exercise, I’ll write about some projects with which I was involved and take a measure of pride.

I’ll start with a project related to the worst day that I mentioned above.  One of my tasks was to maintain an inventory of the servers in our data centers.  For a long time, this was a tediously manual task.  I went through our data centers with a clipboard, checking to see what servers were in each rack, noting any changes and adding new servers and racks that I found.  I drew a map of each room in Visio, even going as far as to count the floor tiles so that I could draw them to scale.  I populated the maps with boxes representing server racks and came up with a naming scheme directly tied into a row-and-column location scheme, making it possible to identify and label each rack so they’d be easy to find.  Included with the maps was a listing of servers within each rack.  I maintained the Visio file on my PC, making sure it was backed up to a departmental file server, and keeping hardcopies in each server room as a reference for various IT workers.

Because this was a manual process, the maps were never completely accurate — I have no doubt that new servers were continually added in-between map updates — and it was a tedious process.  All the while, I kept thinking, “there has to be a better way to do this.”  Sure enough, I found one!

I discovered that all our servers included a product called Insight Manager.  Among other things, it included the ability to collect server BIOS information and store it in a format suitable for importing into a database such as SQL Server.  Using the Insight Manager data structure as a template, I set up a SQL Server database on one of our departmental servers and created a system that enabled it to import data from any server on demand through Insight Manager.  I now had a central database with server data that could be updated at any given time!

Of course, data isn’t information unless it can be interpreted and understood, so the next step was to create an interface for it.  I was responsible for maintaining a departmental intranet site; although it was an internal intranet, I treated it as though it was a full-blown web site.  I created a web site to display the server data stored in my back-end.  I took my Visio server room maps and created image files from them.  From the image files, I created image maps that enabled a user to click a server rack on the map, drilling down to a list of servers in that rack.  Clicking on a server displayed data for that server — serial numbers, IP addresses, applications, and so on.

My server inventory system, which I previously had to update manually, was now automated!

This project was a major milestone for my career.  It was my first significant foray into SQL Server.  (At that time, I hadn’t yet learned about data normalization; had I known about it, I could’ve made the back-end even better.)  It gave me some experience with image maps, HTML, and classic ASP (the technology used on that intranet server at that time).  Most of all, it was my first taste of what it was like to be a web applications developer.

Memories of this project also reminded me of another good day I had on the job.  One particular day, I traveled to remote offices in Yorktown Heights and Middletown to survey their data centers for the server inventory system that I just described.  I hopped into my truck (I owned a small Toyota pickup truck at the time) and drove to Yorktown.  It was a gorgeous picture-perfect day; the sun was out and it was comfortably warm with low humidity.  It was comfortable enough that I left the air conditioner turned off and drove the entire trip with my windows rolled all the way down.  It was the kind of day where I wished I owned a convertible!  My route between the Yorktown and Middletown offices took me through the Bear Mountain area, including traversing the Hudson River over the Bear Mountain Bridge.  If you’ve never driven through that area of New York State, it is an absolutely picturesque and beautiful drive.  The entire trip was so fun and relaxing that it did not feel like a business trip at all; I actually felt as though I was on a vacation!

Finally, I want to talk about one last project in which I had a hand.  In one of my first jobs out of college (my second job out of school, actually), I was responsible for supporting my company’s document imaging system at a client site (a client that eventually ended up hiring me directly).  One of the system’s components was a WORM optical platter jukebox that was in constant use and occasionally needed operations maintenance, which ranged from simple tasks such as inserting an optical platter to complex ones such as restarting it when it locked up.  The device had to be operational 24/7, even at hours when we were not in the office.

I was tasked with putting together a small set of instructions — nothing big, just a few pages — that explained how to perform these tasks, including the correct way to insert a platter, what to do when (unfortunately, not if) the machine stopped working, and so on.  It needed to be written in such a way that the night maintenance staff could maintain the device.  So I sat down in front of my PC with a blank MS Word file and said to myself, “if I was a member of the night staff, what would I want to read that would enable me to perform these tasks?”

I had intended for the document to be a simple three to four page quick reference.  As I wrote and came up with more ideas that would help the readers who would be using it (including the innovative — for me — use of illustrations), the document kept getting bigger and bigger.  I don’t remember how big it eventually became, but I think it was somewhere in the ballpark of thirty pages.  I had absolutely zero technical writing experience at the time; I wrote the document completely by instinct.  I didn’t really know what I was doing; I just did things (illustrations, headers, subject organizations, etc.) that made sense to me.

The final product was a huge success — so much so, in fact, that management developed a training program based around this document.

A couple of years later, I discovered that nearby Rensselaer Polytechnic Institute had a Masters degree program in technical communication (note: I am not entirely sure if the program still exists).  I had been interested in pursuing a graduate degree, and I thought the program sounded interesting, so I decided to apply.  During my application interview with the faculty, I brought along a copy of that operational jukebox document I’d written.  I explained that I’d written the document completely by instinct and with no knowledge or experience in technical writing whatsoever.  The faculty seemed to be impressed with my effort on that project.

I was accepted into the program.  I now have an MS from RPI hanging on my home office wall and listed on my resume.

This article ended up being a lot longer than I expected.  Looking back on this exercise that Steve assigned, I suppose my good days at work were more significant than I thought.  These projects were major events that ended up shaping this professional career.  I suppose the moral of the story is not to underestimate job achievements.  You never know where they might lead!

SQL Saturday #741, Albany, NY — Come to upstate New York!

On Saturday, July 28 (a week from tomorrow as I write this), our local Albany-area SQL user group will host our fifth SQL Saturday!  I have participated in all five; I worked as a volunteer at the first one, and I presented at the other four (including next Saturday).  This is one of my favorite events, and I look forward to it every single year!

This year, I am debuting a brand-new presentation: “Networking: it isn’t just for breakfast anymore,” based upon my ‘blog article of the same name.  (An alternate name for it could be “Networking 101: networking for beginners.”)  This presentation is primarily for people who want to get better at networking but don’t know how to do it, although seasoned veterans might be able to get something out of it as well.  It’s one of the first sessions of the day (8:30 am!), so come early!

As much as I promote my own presentations, mine is not the only one on the docket.  There are many wonderful speakers and presentations being given at this event, and I encourage you to come out and check out as many as you can that interest you!

SQL Saturday is always a great time, a great opportunity for free learning, and a great opportunity to network with data professionals.  The Capital District region here in upstate New York has been my home for many years.  I hope to see you here in my home turf!

#BI101: Some SQL Saturday speakers to check out

This is part of a series of articles in which I’m trying to teach myself about BI.  Any related articles I write are preceded with “#BI101” in the title.

As a speaker on the SQL Saturday circuit, I’ve had the honor and privilege of having met, connected with, and even befriended a number of experts in SQL, data, and BI.  If you can get to get to a SQL Saturday, you can also have that opportunity.

In a couple of weeks (July 28), we will be hosting SQL Saturday here in Albany, NY.  I was going through the schedule, and noticed a number of speakers on the docket who will be talking about various BI topics.  I’ve attended a lot of their sessions, and I recommend these speakers highly!

(Note: for purposes of this article, I am limiting this list to BI topics, although these speakers may be giving other presentations as well.)

SQL Saturday is a great free learning resource, a great opportunity to network, and is always a good time!  If you’re looking to learn about BI or other data-related or professional topics, go check out a SQL Saturday event near you!

#BI101: Introduction to data warehousing

This is part of a series of articles in which I’m trying to teach myself about BI.  Any related articles I write are preceded with “#BI101” in the title.

Because this is a new (to me) topic, it’s possible that what I write might be inaccurate.  I invite you to correct me in the comments, and I will make it a point to edit this article for accuracy.

As part of my personal education about business intelligence, I kicked off a SkillSoft course made available to me through my employer.  My strategy is to take the course, perform some supportive research, and write about what I learn.  It turned out that BI was one of the training options available.  So, I kicked off the course and began my training.

The initial topic discussed the concept of data warehousing.  The course began with a pre-test — a proverbial “how much do I really know?”  There were a few subtopics that were familiar to me — normalization and denormalization, for one — so my initial thought was, how much do I have to learn?  As it turned out, the answer was, a lot.

What is a data warehouse, and what does it do?

The short and simple answer is that a data warehouse is, as the name implies, a storage repository for data.

That’s the short answer.  The longer answer gets a little more complicated.

Data warehouse architecture (source: Wikipedia)

I learned that a lot of the concepts behind a data warehouse pretty much breaks a lot of what I thought I knew about relational databases.  For starters, I’ve always been under the impression that all relational databases needed to be normalized.  However, I learned that, in the case of a data warehouse, that might not necessarily be the case.  While a data warehouse could be normalized, it might not necessarily be.  While a normalized database is designed to minimize data redundancy, a denormalized data table might have redundant data.  Although it occupies more space, having redundant data reduces the number of table joins, thus reducing query time (as the SkillSoft lesson put it, sacrificing storage space for speed).  When a data warehouse is storing millions of rows of data, reducing the number of query joins could be significant.

Populating a data warehouse

How does data get into a data warehouse?  It turns out that a data warehouse can have multiple data sources — SQL Server, Oracle, Access, Excel, flat files, and so on.  The trick is, how does this data get into a data warehouse?  This is where the integration layer comes in.

Because the SkillSoft course I took was SQL Server-specific, it focused on SSIS.  SSIS provides tools that allow it to connect to multiple data sources, as well as tools for ETL (Extract, Transform, Load).  ETL involves a process that includes obtaining data from the various sources and processing it for data warehouse storage.  A big part of ETL is data cleansing — formatting data so it is usable and consistent.  For example, imagine that several data sources use different formats for the same Boolean data field.  One uses “1” and “0”, another uses “T” and “F”, another uses “Yes” and “No”, and so on.  Data cleansing formats these fields into a single, consistent format so that it is usable by the data warehouse.  As there are many such fields, ETL is often a very long and involved process.

Just the facts, ma’am…

I had heard of fact and dimension tables before, but I wasn’t entirely sure about their application until I started diving into BI.  In a nutshell, facts are raw data, while dimensions provide context to the facts.

To illustrate facts and dimensions, let me go back to one of my favorite subjects: baseball.  How about Derek Jeter’s hitting statistics?  Clicking “Game Log” displays a list of game-by-game statistics for a given season (the link provided defaults to the 2014 season, Jeter’s last season).  Looking at his last home game on Sept. 25, Jeter accumulated two hits, including a double, driving in three runs, scoring once, and striking out once in five at-bats.  These single-game numbers are the facts.  The facts are these raw numbers that Jeter generated in that single game.  A fact table stores these individual game statistics in a single table.  Dimensions provide context to these numbers.  Some dimensions might include total accumulated statistics for a season, batting average, and some (non-statistical) information about Jeter himself (name, hometown, birth date, etc.).  These derived statistics would be stored in dimension tables.

If you’re still confused about fact and dimension tables, I found this question on StackOverflow that does a pretty good job of answering the question.  I also came across this article that also does a good job of describing fact and dimension tables.

Stars and snowflakes

Star schema (source: Wikipedia)

A fact table usually maintains a relationship with a number of dimension tables.  The fact table connects to the dimension tables through a foreign key relationship.  The visual table design usually resembles a star, in which the fact table is at the center and the dimension tables branch out from the center.  For this reason, this structure is called a star schema.

A snowflake schema is related to the star schema.  The main difference is that the dimension tables are further normalized.  The resulting foreign key relationships result in a schema that visually resembles a snowflake.

Snowflake schema (source: Wikipedia)

Attention, data-mart shoppers…

A data mart can probably be considered either a smaller version of a data warehouse, or a subset of a data warehouse (a “dependent” data mart, according to Wikipedia).  As I understand it, a data warehouse stores data for an entire corporation, organization, or enterprise, whereas a data mart stores data for a specific business unit, division, or department.  Each business unit utilizes data marts for various information purposes, such as reporting, forecasting, and data mining.  (I’ll likely talk about these functions in a separate article; for our purposes, they go outside the scope of this article.)

So, this winds up what I’ve learned about data warehousing (so far).  Hopefully, you’ll have learned as much reading this article as I have writing it.  And hopefully, you’ll keep reading along as I continue my own education into BI.  Enjoy the ride.

 

On data types

Previously, I talked about something I saw in the code that built the tables in my baseball database.  I wrote last time about primary keys.  In this article, I want to talk about data types.

Let’s revisit the code that builds the Batting table again.

CREATE TABLE [dbo].[Batting](
	[playerID] [varchar](50) NULL,
	[yearID] [varchar](50) NULL,
	[stint] [varchar](50) NULL,
	[teamID] [varchar](50) NULL,
	[lgID] [varchar](50) NULL,
	[G] [varchar](50) NULL,
	[AB] [varchar](50) NULL,
	[R] [varchar](50) NULL,
	[H] [varchar](50) NULL,
	[2B] [varchar](50) NULL,
	[3B] [varchar](50) NULL,
	[HR] [varchar](50) NULL,
	[RBI] [varchar](50) NULL,
	[SB] [varchar](50) NULL,
	[CS] [varchar](50) NULL,
	[BB] [varchar](50) NULL,
	[SO] [varchar](50) NULL,
	[IBB] [varchar](50) NULL,
	[HBP] [varchar](50) NULL,
	[SH] [varchar](50) NULL,
	[SF] [varchar](50) NULL,
	[GIDP] [varchar](50) NULL
) ON [PRIMARY]

Last time, I mentioned that every single column allowed NULL values (and I’d like to expand on that in a future article).  Today, I’d like to talk about the fact that every single column is defined as varchar(50).

For those of you who are getting your feet wet with SQL Server (or don’t know what I’m talking about), let’s take a moment to discuss…

What are data types, and why are they important?

Imagine that you’re presented with a math problem (don’t worry; we’ll keep it simple).  However, when you’re presented with the problem, it looks like this:

XII + IX

Want to give it a shot?  The truth is, there’s a reason why we don’t normally use Roman numerals in our society.  Unlike the Arabic numeral system that we use in everyday life, the Roman numeral system is inefficient and not very intuitive at a glance.  For us to use Roman numerals, we’d first need to convert them into a format that we can easily understand.

So, knowing that XII = 12 and IX = 9, our new math problem is 12 + 9.  We now know our answer is 21.  (And if we want to convert back to Roman, it’d be XXI.)

Data types operate along the same principle.  A number that’s stored as a data type of, let’s say, varchar(50) (for those of you who don’t yet know T-SQL, that’s “variable character of up to 50 characters”) needs to be converted to a numeric type of (for example) int (for “integer”) before it can be processed.

“Okay,” you might ask, “why is that such a big deal?  Just let the system do the conversion and let it do the work.”

Let’s go back to that Roman numeral example for a moment.  In order for you to understand what the Roman numerals were, your brain had to do some work to perform the conversion.  Imagine that you had to process dozens, even hundreds, of those problems.  That’s a lot of Roman numerals that you have to process.  That makes for a lot of work.

Likewise, imagine that a computer system has to convert millions of data types.  That’s a lot of extra work that your computer has to do.  That can make for a very inefficient system.  One of the big issues that a DBA or a developer often has to deal with is a slow system.  Using proper data types can often help with trying to fix a slow system.

What kinds of data types are there?

In a word, many.  A number of web sites list data types for various systems, including MSDN and W3Schools.  Since I’m writing primarily about SQL Server, I’ll focus on SQL Server data types.  I won’t talk about all SQL Server data types (I’ll leave you to look at the links for that); rather, I’ll just discuss a few that are relevant to our conversation.

Let’s start first with the varchar data type, since that’s what’s defined in our tables.  Varchar is for variable character.  What that means is the data type accepts character strings up to a length that you specify.  A data type of varchar(50) means that the field accepts strings up to 50 characters.

There is also a char data type.  Like the varchar type, you specify the number of characters allowed by the data column.  Unlike the varchar type, however, it is a fixed length string.  A column of type char(50) means that the data will always have 50 characters.  If data that is fewer than 50 characters is stored, it pads the rest with spaces.  This is important to consider, because those characters (even the space characters) take up space.  If you’re storing millions of rows of data, and you’re using char data types for data that takes up fewer than the number of characters you specify, that’s a lot of wasted space.

So, let’s revisit our Batting table.

BaseballBatting

When we built our primary key, we created a composite primary key from the playerID, yearID, and stint columns.  Can we improve upon these columns?

Let’s look at playerID.  Do we need to reserve fifty characters for this column?  To find out, I ran the following query.

select distinct playerID, len(playerID) from batting
order by len(playerID) desc

To translate this query into English*, this means “give me each distinct playerID, along with the length of the playerID, from the batting table, and give me the results in descending order of length.”

(*For those of you who don’t speak T-SQL, yes, I do intend to write an article about it.)

The results looked like this:

PlayerIDLength

So what does this mean?  It means the longest playerID is nine characters.  Could it be longer than nine?  Possibly.  Will it ever get up to fifty?  Unlikely.  So while the varchar(50) will definitely accommodate this, my thinking is that it’s overkill.  I’m going to shorten the column.  I’m going to cut it down to 25 from 50.

Now, let’s look at the yearID column.  I ran the same query as earlier, but this time, I replaced playerID with yearID.  Every column was four characters long.

SQL Server provides us with a number of date and time data types.  My first thought was to use one of these data types, but at that point, I stopped to think.  What, exactly, is the purpose of the yearID column?  It identifies a season.  That’s it.  Will I perform any calculations with it?  It’s possible, but unlikely.  Will it ever change from four characters/digits?  No.  I’m going to make this a fixed four-character field.  For that, we’ll use char(4).

Now, let’s look at stint.  What is this column?  It keeps track of a player’s stint with a team in a season.  So, let’s see how this works.

I went into the database and looked to see what stint values were in the Batting table.  There were five.  I looked up the player who had five stints, and found a man named Frank Huelsman, who played in the early 1900s.  I ran a query that gave me this.

FrankHuelsmanStint

In 1904, he had five stints with four different teams, having played for the White Sox (twice), Tigers, Browns, and Senators.

So, what data type should we use for stint?  Let’s think about this.  We could use a char(1), but I don’t think we should.  Is it possible that a player could have ten or more stints in a year?  Unlikely, but yes.  So we’d need at least two characters.  Let’s try char(2) — or should we?

I say no.  I say that we should use a tinyint data type for stint.  Why?

I say it for a couple of reasons.  First, there the matter of ordering.  Let’s order the numerals 1-10.  How would you expect them to be ordered if they were stored as characters?  If you say 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, you’d be wrong.  If it was defined as a character data type, the order would look like this: 1, 10, 2, 3, 4, 5, 6, 7, 8, 9.  When we’re talking about characters, 10 comes after 1.

Second, there’s a matter of memory.  tinyint uses only 1 byte, as opposed to 4 bytes for int. The memory used by char and varchar data types are dependent on their sizes (i.e. how much is allocated).  Also, tinyint accommodates values from 0 to 255, which is plenty for our purposes.

So, let’s start by redefining the data types for our composite primary key.  Because T-SQL in SQL Server won’t let us alter multiple columns at once, we need to run them as separate queries.  I wrote three statements like this:

alter table Batting
alter column playerID varchar(25)

alter table Batting
alter column yearID char(4)

alter table Batting
alter column stint tinyint

This should do the trick — or so I thought.  When I ran the queries, SSMS gave me error messages that looked like this:

Msg 5074, Level 16, State 1, Line 1
The object 'PK_playerID_yearID_stint' is dependent on column 'playerID'.
Msg 4922, Level 16, State 9, Line 1
ALTER TABLE ALTER COLUMN playerID failed because one or more objects 
access this column.

What’s going on here?  Keep in mind what we’re trying to change: three columns that are part of a primary key.  So before we can alter these columns, let’s drop our primary key.

alter table Batting
drop constraint PK_playerID_yearID_stint

Now when we run our alter table statements, we should be set.

Or are we?  I tried to recreate the primary key (I decided to rename the primary key, while I was at it).

alter table Batting
add constraint PK_Batting primary key (playerID, yearID, stint)

Instead, it gave me the following error:

Msg 8111, Level 16, State 1, Line 1
Cannot define PRIMARY KEY constraint on nullable column in table 'Batting'.
Msg 1750, Level 16, State 0, Line 1
Could not create constraint or index. See previous errors.

Now what’s going on?  Apparently, I’d forgotten to include the NOT NULL statement when I altered the columns.  So this time, I’ll rewrite the statements like this and rerun them.

alter table Batting
alter column playerID varchar(25) not null

alter table Batting
alter column yearID char(4) not null

alter table Batting
alter column stint tinyint not null

After running these statements (which ran successfully), I retried my statement that added the primary key.  This time, it worked.

So, our primary key columns are now altered.  Let’s look at the rest of our table.

Our first two columns are teamID (for “team ID”) and lgID (for “league ID”).  I checked the data in these two columns (I’ll save you the trouble of looking at the T-SQL I used).  I found that they are never null, and they are always 3 characters and 2 characters, respectively.  They are also not unique.  For our table, I’ll redefine them as char(3) and char(2).

Now let’s look at the rest of the table.  For the benefit of anyone who doesn’t understand baseball, let’s write out what these columns are.

G Games
AB At Bats
R Runs (Scored)
H Hits
2B Doubles
3B Triples
HR Home Runs
RBI Runs Batted In
SB Stolen Bases
CS Caught Stealing
BB Base On Balls (Walks)
SO Strikeouts
IBB Intentional Base On Balls (Walks)
HBP Hit By Pitch
SH Sacrifice Hits
SF Sacrifice Flies
GIDP Grounded Into Double Play

They’re all statistics, and they’re all integer values.  For now, we’ll make them all integer values.  Do we make them NOT NULL?  Not necessarily.  Why?  Because for some records, certain statistics didn’t exist during their time, so they might not exist.  Some of these numbers may very well be NULL.  So we won’t add the NULL constraint.

Here’s what I wrote for my T-SQL:

alter table Batting alter column [teamID] char(3) not null
alter table Batting alter column [lgID] char(2) not null
alter table Batting alter column [G] int
alter table Batting alter column [AB] int
alter table Batting alter column [R] int
alter table Batting alter column [H] int
alter table Batting alter column [2B] int
alter table Batting alter column [3B] int
alter table Batting alter column [HR] int
alter table Batting alter column [RBI] int
alter table Batting alter column [SB] int
alter table Batting alter column [CS] int
alter table Batting alter column [BB] int
alter table Batting alter column [SO] int
alter table Batting alter column [IBB] int
alter table Batting alter column [HBP] int
alter table Batting alter column [SH] int
alter table Batting alter column [SF] int
alter table Batting alter column [GIDP] int

The query, which took 20 seconds to run, ran successfully.

Note that I only dealt with a few data types in this article.  I didn’t talk about decimal data types or computed values.  (What if, for example, we wanted to keep track of batting averages?)  I’ll cover those in a later article.  Hopefully, at this point, you’ll have a better idea of what data types are and what they can do.

Setting up a primary key

When I set up my baseball database, I noticed something in the SQL code used to build the tables.  In the CREATE TABLE T-SQL code that was supplied by the Sean Lahman’s baseball database, I saw code that looked like this:

CREATE TABLE [dbo].[Batting](
	[playerID] [varchar](50) NULL,
	[yearID] [varchar](50) NULL,
	[stint] [varchar](50) NULL,
	[teamID] [varchar](50) NULL,
	[lgID] [varchar](50) NULL,
	[G] [varchar](50) NULL,
	[AB] [varchar](50) NULL,
	[R] [varchar](50) NULL,
	[H] [varchar](50) NULL,
	[2B] [varchar](50) NULL,
	[3B] [varchar](50) NULL,
	[HR] [varchar](50) NULL,
	[RBI] [varchar](50) NULL,
	[SB] [varchar](50) NULL,
	[CS] [varchar](50) NULL,
	[BB] [varchar](50) NULL,
	[SO] [varchar](50) NULL,
	[IBB] [varchar](50) NULL,
	[HBP] [varchar](50) NULL,
	[SH] [varchar](50) NULL,
	[SF] [varchar](50) NULL,
	[GIDP] [varchar](50) NULL
) ON [PRIMARY]

Granted, there are a number of things about this code that strikes me, but one thing hit me immediately: all the columns are nullable.  That tells me that there are no primary keys defined.  I checked the other CREATE TABLE statements in the code as well; likewise, all of them had all their columns set to be nullable.

(There are some other things about the CREATE TABLE statements that struck me as well, but I’ll save those topics for another time.  For the purposes of this article, I’ll concentrate on primary keys.)

First, if you’re getting your feet wet with relational databases, it behooves the question…

What is a primary key?

First, I’ll start with a few links that help explain this:

primary key uniquely identifies a row of data.  To illustrate this, here’s an example.

Let’s say we have a person named John Smith taking night classes at State University.  In his database file (which, for the purposes of this example, has his name and address), he has the following information:

Name Address
John Smith 123 Main Street

Now, let’s say his son, John Smith Jr. (who lives at the same address) enrolls at State University (for the purposes of this example, let’s assume the database doesn’t allow suffixes like “Jr.”).  He is entered in the database.  So now, you have this.

Name Address
John Smith 123 Main Street
John Smith 123 Main Street

How do you know which record is which?  Answer: you don’t.  So, let’s add an ID column to our table.

ID Name Address
1 John Smith 123 Main Street
2 John Smith 123 Main Street

Now we know that we have two different records for two different people who are both named John Smith who live at 123 Main Street.

Granted, this is an overly-simplistic example, but the point is that the ID column — which is always unique and is never NULL — uniquely identifies each row.

So now that we know what a primary key is, let’s go back to our baseball database.

Creating primary keys for our baseball tables

We could create an ID column for our baseball tables, but I don’t want to do that here.  Instead, I want to make use of existing columns for our primary keys.  I’ll start with the MASTER table, which keeps track of ballplayers.

I ran a simple SELECT * FROM MASTER to get an idea of what’s in the table.  Running the query gave me this.

BaseballMasterQuery

I noted a couple of things.  First, I see a column called “playerID” which could be unique and used as a primary key.  Second, I see that the table contains 18,846 rows.  So can I use “playerID” for my primary key?  Let’s find out.

I ran two query statements.  The first gives me a count of unique entries I have for “playerID.”  The second tells me if any NULL values for “playerID” exist.

Here’s what I got.

BaseballMasterPlayerID.png

The first statement returns 18,846, which matches the number of rows in my earlier query, telling me that all values are unique.  The second statement doesn’t return anything, which tells me that there are no NULL values for “playerID.”  Therefore, the “playerID” column can be used as the primary key for this table.

So how do we make this column the primary key?  There are a couple of ways to do it in SQL Server.  For this, I decided to use T-SQL.

alter table dbo.Master
add constraint PK_playerID primary key clustered (playerID)

When I ran this, it returned an error.

Msg 8111, Level 16, State 1, Line 1
Cannot define PRIMARY KEY constraint on nullable column in table 'Master'.
Msg 1750, Level 16, State 0, Line 1
Could not create constraint or index. See previous errors.

Okay.  Why did this appear?  Keep in mind that a primary key cannot be null.  When our code created our tables, it made them all nullable.  So we need to add a NOT NULL constraint before we create the primary key.

alter table dbo.Master
alter column playerID varchar(50) not null

Running this statement sets the column to not allow NULL values.  Once I ran this, I re-ran the command to create the primary key, and it completed successfully.  So I now have a primary key on my Master table.

What if I don’t have a single unique column?

Let’s take a look at the Batting table.

BaseballBatting

So, what should we use for our primary key?  A batter can play multiple seasons, and can play for multiple teams in a single season.  Chances are that you won’t be able to make use of a single column.  However, we can make use of multiple columns for our primary key (this is called a “composite” primary key).

We find that the Batting table contains 101,332 rows, so we need a combination of columns that will give us 101,332 unique results.  I figured that playerID, yearID, and stint should give us those results.

I ran the following query.

select distinct playerID, yearID, stint from Batting

This returned 101,332 results.  Indeed, I seem to have identified my composite primary key.

My first task is to set these three columns to not accept NULL values (otherwise, we’d get the same error from earlier).

alter table dbo.Batting alter column playerID varchar(50) not null
alter table dbo.Batting alter column yearID varchar(50) not null
alter table dbo.Batting alter column stint varchar(50) not null

Once the NOT NULL constraint is in place, we can create our composite primary key.  Note that in my T-SQL statement below, I provide three columns (playerID, yearID, stint), not just one.

alter table dbo.Batting 
add constraint PK_playerID_yearID_stint 
primary key clustered (playerID, yearID, stint)

The query ran successfully.  We now have a composite primary key on the Batting table that uses three columns.

Hopefully, this gives you a good introduction to primary keys on database tables.  In upcoming articles, I’ll cover other aspects such as foreign keys, indexes, and data types as we continue to further refine and improve our baseball database.