My hometown SQL Saturday: Albany, NY, July 29

My local SQL user group is hosting SQL Saturday on July 29, a week from this Saturday!

I will be speaking; I will be giving my presentation on documentation.  There are also a number of other presentations that people might find of interest.

When I attended SQL Saturday in New York City a couple of months ago, I sat in on Lisa Margerum’s session on networking.  It is an excellent session, and I recommend it highly.

A number of my friends are also presenting, including Greg Moore, Thomas Grohser, George Walters, John Miner, and Ed Pollack.  They always give good presentations, and I recommend them highly.  Check out the schedule for more details.

Hope to see you there!

 

Advertisements

SQL Saturday #638, Philadelphia

This coming Saturday, June 3, I will be speaking at SQL Saturday #638, Philadelphia (okay, it’s actually in a town called Whitpain Township, not Philadelphia, but that’s what they call the event, so…)!

I will be giving the following two presentations:

  • Tech Writing for Techies: A Primer — Documentation is one of the most critical, yet most blatantly ignored and disrespected tasks when it comes to technology. Businesses and technical professionals ignore documentation at their own risk. This session discusses what tech writing and documentation is about and why it’s critical for business. It also explores possible reasons for why it’s ignored, how documentation can be improved, and how “non-writers” can contribute to the process.
  • Disaster Documents: The role of documentation in disaster recovery — I was an employee of a company that had an office in the World Trade Center on Sept. 11, 2001. Prior to that infamous date, I had written several departmental documents that ended up being critical to our recovery. In this presentation, I provide a narrative of what happened in the weeks following 9/11, and how documentation played a role in getting the organization back on its feet.

    While other disaster recovery presentations talk about strategies, plans, and techniques, this presentation focuses on the documentation itself. We will discuss the documents we had and how they were used in our recovery. We will also discuss what documents we didn’t have, and how they could have made the process better.

Hope to see you there!

My 2017 SQL Saturday Schedule (so far)

My 2017 SQL Saturday schedule is slowly taking shape!  For those of you who are interested, here’s where I’ll be (so far)!

Come on out to a SQL Saturday near you!  The events are free (although there is usually a nominal fee for lunch), there are lots of networking opportunities, and you might just learn something new!

And as an added bonus, you might even get to hear me speak! 😉

SQL Saturday #615, Baltimore, MD

My next speaking engagement is set!  On Saturday, May 6, I will be speaking at SQL Saturday #615 in Baltimore, MD.  I will be giving not one, but two presentations that day: “Tech Writing for Techies: A Primer” and “Disaster Documents: The role of documentation in disaster recovery.”

Hope to see you there!

SQL Saturday #545, Pittsburgh

I’ll be speaking in Pittsburgh this Saturday, October 1.  Hope to see you there!

On data types

Previously, I talked about something I saw in the code that built the tables in my baseball database.  I wrote last time about primary keys.  In this article, I want to talk about data types.

Let’s revisit the code that builds the Batting table again.

CREATE TABLE [dbo].[Batting](
	[playerID] [varchar](50) NULL,
	[yearID] [varchar](50) NULL,
	[stint] [varchar](50) NULL,
	[teamID] [varchar](50) NULL,
	[lgID] [varchar](50) NULL,
	[G] [varchar](50) NULL,
	[AB] [varchar](50) NULL,
	[R] [varchar](50) NULL,
	[H] [varchar](50) NULL,
	[2B] [varchar](50) NULL,
	[3B] [varchar](50) NULL,
	[HR] [varchar](50) NULL,
	[RBI] [varchar](50) NULL,
	[SB] [varchar](50) NULL,
	[CS] [varchar](50) NULL,
	[BB] [varchar](50) NULL,
	[SO] [varchar](50) NULL,
	[IBB] [varchar](50) NULL,
	[HBP] [varchar](50) NULL,
	[SH] [varchar](50) NULL,
	[SF] [varchar](50) NULL,
	[GIDP] [varchar](50) NULL
) ON [PRIMARY]

Last time, I mentioned that every single column allowed NULL values (and I’d like to expand on that in a future article).  Today, I’d like to talk about the fact that every single column is defined as varchar(50).

For those of you who are getting your feet wet with SQL Server (or don’t know what I’m talking about), let’s take a moment to discuss…

What are data types, and why are they important?

Imagine that you’re presented with a math problem (don’t worry; we’ll keep it simple).  However, when you’re presented with the problem, it looks like this:

XII + IX

Want to give it a shot?  The truth is, there’s a reason why we don’t normally use Roman numerals in our society.  Unlike the Arabic numeral system that we use in everyday life, the Roman numeral system is inefficient and not very intuitive at a glance.  For us to use Roman numerals, we’d first need to convert them into a format that we can easily understand.

So, knowing that XII = 12 and IX = 9, our new math problem is 12 + 9.  We now know our answer is 21.  (And if we want to convert back to Roman, it’d be XXI.)

Data types operate along the same principle.  A number that’s stored as a data type of, let’s say, varchar(50) (for those of you who don’t yet know T-SQL, that’s “variable character of up to 50 characters”) needs to be converted to a numeric type of (for example) int (for “integer”) before it can be processed.

“Okay,” you might ask, “why is that such a big deal?  Just let the system do the conversion and let it do the work.”

Let’s go back to that Roman numeral example for a moment.  In order for you to understand what the Roman numerals were, your brain had to do some work to perform the conversion.  Imagine that you had to process dozens, even hundreds, of those problems.  That’s a lot of Roman numerals that you have to process.  That makes for a lot of work.

Likewise, imagine that a computer system has to convert millions of data types.  That’s a lot of extra work that your computer has to do.  That can make for a very inefficient system.  One of the big issues that a DBA or a developer often has to deal with is a slow system.  Using proper data types can often help with trying to fix a slow system.

What kinds of data types are there?

In a word, many.  A number of web sites list data types for various systems, including MSDN and W3Schools.  Since I’m writing primarily about SQL Server, I’ll focus on SQL Server data types.  I won’t talk about all SQL Server data types (I’ll leave you to look at the links for that); rather, I’ll just discuss a few that are relevant to our conversation.

Let’s start first with the varchar data type, since that’s what’s defined in our tables.  Varchar is for variable character.  What that means is the data type accepts character strings up to a length that you specify.  A data type of varchar(50) means that the field accepts strings up to 50 characters.

There is also a char data type.  Like the varchar type, you specify the number of characters allowed by the data column.  Unlike the varchar type, however, it is a fixed length string.  A column of type char(50) means that the data will always have 50 characters.  If data that is fewer than 50 characters is stored, it pads the rest with spaces.  This is important to consider, because those characters (even the space characters) take up space.  If you’re storing millions of rows of data, and you’re using char data types for data that takes up fewer than the number of characters you specify, that’s a lot of wasted space.

So, let’s revisit our Batting table.

BaseballBatting

When we built our primary key, we created a composite primary key from the playerID, yearID, and stint columns.  Can we improve upon these columns?

Let’s look at playerID.  Do we need to reserve fifty characters for this column?  To find out, I ran the following query.

select distinct playerID, len(playerID) from batting
order by len(playerID) desc

To translate this query into English*, this means “give me each distinct playerID, along with the length of the playerID, from the batting table, and give me the results in descending order of length.”

(*For those of you who don’t speak T-SQL, yes, I do intend to write an article about it.)

The results looked like this:

PlayerIDLength

So what does this mean?  It means the longest playerID is nine characters.  Could it be longer than nine?  Possibly.  Will it ever get up to fifty?  Unlikely.  So while the varchar(50) will definitely accommodate this, my thinking is that it’s overkill.  I’m going to shorten the column.  I’m going to cut it down to 25 from 50.

Now, let’s look at the yearID column.  I ran the same query as earlier, but this time, I replaced playerID with yearID.  Every column was four characters long.

SQL Server provides us with a number of date and time data types.  My first thought was to use one of these data types, but at that point, I stopped to think.  What, exactly, is the purpose of the yearID column?  It identifies a season.  That’s it.  Will I perform any calculations with it?  It’s possible, but unlikely.  Will it ever change from four characters/digits?  No.  I’m going to make this a fixed four-character field.  For that, we’ll use char(4).

Now, let’s look at stint.  What is this column?  It keeps track of a player’s stint with a team in a season.  So, let’s see how this works.

I went into the database and looked to see what stint values were in the Batting table.  There were five.  I looked up the player who had five stints, and found a man named Frank Huelsman, who played in the early 1900s.  I ran a query that gave me this.

FrankHuelsmanStint

In 1904, he had five stints with four different teams, having played for the White Sox (twice), Tigers, Browns, and Senators.

So, what data type should we use for stint?  Let’s think about this.  We could use a char(1), but I don’t think we should.  Is it possible that a player could have ten or more stints in a year?  Unlikely, but yes.  So we’d need at least two characters.  Let’s try char(2) — or should we?

I say no.  I say that we should use a tinyint data type for stint.  Why?

I say it for a couple of reasons.  First, there the matter of ordering.  Let’s order the numerals 1-10.  How would you expect them to be ordered if they were stored as characters?  If you say 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, you’d be wrong.  If it was defined as a character data type, the order would look like this: 1, 10, 2, 3, 4, 5, 6, 7, 8, 9.  When we’re talking about characters, 10 comes after 1.

Second, there’s a matter of memory.  tinyint uses only 1 byte, as opposed to 4 bytes for int. The memory used by char and varchar data types are dependent on their sizes (i.e. how much is allocated).  Also, tinyint accommodates values from 0 to 255, which is plenty for our purposes.

So, let’s start by redefining the data types for our composite primary key.  Because T-SQL in SQL Server won’t let us alter multiple columns at once, we need to run them as separate queries.  I wrote three statements like this:

alter table Batting
alter column playerID varchar(25)

alter table Batting
alter column yearID char(4)

alter table Batting
alter column stint tinyint

This should do the trick — or so I thought.  When I ran the queries, SSMS gave me error messages that looked like this:

Msg 5074, Level 16, State 1, Line 1
The object 'PK_playerID_yearID_stint' is dependent on column 'playerID'.
Msg 4922, Level 16, State 9, Line 1
ALTER TABLE ALTER COLUMN playerID failed because one or more objects 
access this column.

What’s going on here?  Keep in mind what we’re trying to change: three columns that are part of a primary key.  So before we can alter these columns, let’s drop our primary key.

alter table Batting
drop constraint PK_playerID_yearID_stint

Now when we run our alter table statements, we should be set.

Or are we?  I tried to recreate the primary key (I decided to rename the primary key, while I was at it).

alter table Batting
add constraint PK_Batting primary key (playerID, yearID, stint)

Instead, it gave me the following error:

Msg 8111, Level 16, State 1, Line 1
Cannot define PRIMARY KEY constraint on nullable column in table 'Batting'.
Msg 1750, Level 16, State 0, Line 1
Could not create constraint or index. See previous errors.

Now what’s going on?  Apparently, I’d forgotten to include the NOT NULL statement when I altered the columns.  So this time, I’ll rewrite the statements like this and rerun them.

alter table Batting
alter column playerID varchar(25) not null

alter table Batting
alter column yearID char(4) not null

alter table Batting
alter column stint tinyint not null

After running these statements (which ran successfully), I retried my statement that added the primary key.  This time, it worked.

So, our primary key columns are now altered.  Let’s look at the rest of our table.

Our first two columns are teamID (for “team ID”) and lgID (for “league ID”).  I checked the data in these two columns (I’ll save you the trouble of looking at the T-SQL I used).  I found that they are never null, and they are always 3 characters and 2 characters, respectively.  They are also not unique.  For our table, I’ll redefine them as char(3) and char(2).

Now let’s look at the rest of the table.  For the benefit of anyone who doesn’t understand baseball, let’s write out what these columns are.

G Games
AB At Bats
R Runs (Scored)
H Hits
2B Doubles
3B Triples
HR Home Runs
RBI Runs Batted In
SB Stolen Bases
CS Caught Stealing
BB Base On Balls (Walks)
SO Strikeouts
IBB Intentional Base On Balls (Walks)
HBP Hit By Pitch
SH Sacrifice Hits
SF Sacrifice Flies
GIDP Grounded Into Double Play

They’re all statistics, and they’re all integer values.  For now, we’ll make them all integer values.  Do we make them NOT NULL?  Not necessarily.  Why?  Because for some records, certain statistics didn’t exist during their time, so they might not exist.  Some of these numbers may very well be NULL.  So we won’t add the NULL constraint.

Here’s what I wrote for my T-SQL:

alter table Batting alter column [teamID] char(3) not null
alter table Batting alter column [lgID] char(2) not null
alter table Batting alter column [G] int
alter table Batting alter column [AB] int
alter table Batting alter column [R] int
alter table Batting alter column [H] int
alter table Batting alter column [2B] int
alter table Batting alter column [3B] int
alter table Batting alter column [HR] int
alter table Batting alter column [RBI] int
alter table Batting alter column [SB] int
alter table Batting alter column [CS] int
alter table Batting alter column [BB] int
alter table Batting alter column [SO] int
alter table Batting alter column [IBB] int
alter table Batting alter column [HBP] int
alter table Batting alter column [SH] int
alter table Batting alter column [SF] int
alter table Batting alter column [GIDP] int

The query, which took 20 seconds to run, ran successfully.

Note that I only dealt with a few data types in this article.  I didn’t talk about decimal data types or computed values.  (What if, for example, we wanted to keep track of batting averages?)  I’ll cover those in a later article.  Hopefully, at this point, you’ll have a better idea of what data types are and what they can do.

Two days until SQL Saturday in Albany!

Hear ye, hear ye!  This Saturday — that’s two days from today — the Capital Area SQL Server User Group will be hosting SQL Saturday #513 at the University at Albany!  The event is free (there is a nominal fee for lunch), and anyone, whether you’re new to SQL Server or are a seasoned database veteran, is welcome to come!  (Just make sure you register!)

I’ll be giving a presentation about how to talk to non-technical people.  Come and check it out!

Want to know what SQL Saturday is about?  Check out my ‘blog post from Monday, where I talk about my path to becoming a SQL Saturday speaker!

Hope to see you there!