The evolution of statistics

During my lunch break, I was perusing the ESPN website and stumbled across this article. It contemplates whether or not a .300 hitter (in baseball, for those of you who are sports-challenged) is meaningful anymore. As a baseball fan, the article caught my attention. I didn’t read through the entire article (it ended up being a much longer read than I expected — too long for me to read while on a lunch break at work), but from what little I did glean from it, a couple of things struck me.

First, they talk about Mickey Mantle‘s batting average and how important hitting .300 was to him. That struck me a little funny, because (as far as I know — as I said, I didn’t get through the entire article) there was no mention of the fact that he actually finished with a batting average under .300. His career batting average was .298.

The second thing that struck me was (Yankees’ first baseman) Luke Voit saying how he felt that “feel like batting average isn’t a thing now.” Indeed, baseball is a much different game than it was, say ten, twenty, or thirty years ago. Analytics are a big part of statistics these days. A lot of stats that are prevalent now — WAR (wins above replacement), exit velocity, OPS (on-base plus slugging), etc. — didn’t even exist when I was a kid growing up, closely following my Yankees. Back when I was eating and sleeping baseball, hitting was about the triple-crown statistics — batting average, home runs, and runs batted in (RBIs). But now, we have “slash lines,” on-base percentage, slugging percentage, and so on. Even as big of a baseball fan as I am, I haven’t a clue about many of these “new age” stats. I still have no idea what WAR represents, I’m not completely sure as to what the numbers in a slash-line are, and I don’t know what constitutes a respectable OPS.

That got me thinking about how statistics have changed over the years, and whether or not that applies to statistics outside of baseball (or sports, for that matter). Maybe people who study data analytics for a living might know this better than I do, but what business statistics have a different meaning now than they did ten, twenty years ago? Are there any numbers from way back when that I should now take with a grain of salt?

I’m sure there are many examples of this outside of sports, but I struggled to come up with any. Off the top of my head, I remember how a company where I once worked made a big deal out of perfect attendance — to the point that they gave out perfect attendance awards at the end of the year. However, that had to contend with situations such as coming to work when you were sick, and so on. Do you really want someone who’s sick coming into work? These days, workplaces do not want sick people in the office, and with the advent of work-at-home provisions, perfect attendance isn’t so meaningful, anymore. (By the way, my understanding is that company no longer recognizes or rewards “perfect” attendance.)

So I suppose the takeaway is, how well do statistics age? Can they be compared with the same statistics now? What needs to be considered when analyzing statistics from years ago? It’s true that numbers often tell a story, but in order to get the full picture, you also need to understand the full context.

Advertisements

Monthly CASSUG meeting — May 2019

Greetings, data enthusiasts!

This is a reminder that our May CASSUG meeting will take place on Monday, May 13, 5:30 pm, in the Datto (formerly Autotask) cafeteria!

Our guest speaker is Mike Jones! His talk is entitled: “Using Pure ActiveCluster for SQL High Availability.”

For more information, and to RSVP, go to our Meetup link at http://meetu.ps/e/GBP2c/7fcp0/f

Thanks to our sponsors, Datto/Autotask, Capital Tech Search, and CommerceHub for making this event possible!

Monthly CASSUG meeting — April 2019

Greetings, data enthusiasts!

This is a reminder that our April CASSUG meeting will take place on Monday, April 8, 5:30 pm, in the Datto (formerly Autotask) cafeteria!

Our guest speaker is Monica Rathbun!  Her talk is entitled: Performance Tuning, Getting the Biggest Bang for Your Buck!

Note that our meeting location has moved!  Datto/Autotask is now located at 33 Tech Valley Drive.  Please do not map this address at this time, as most online maps have not yet been updated. Drive past the old building all the way to the end, where the new building is located! Refer to the map below for the new location.

For more information, and to RSVP, go to our Meetup link at http://meetu.ps/e/FWkVd/7fcp0/f

Thanks to our sponsor, Datto/Autotask, for making this event possible!

SQL Saturday #855 Albany announced!

The Capital Area SQL Server User Group (CASSUG) is pleased to announce that, for the sixth time, we will host SQL Saturday #855, Albany on July 20!

For additional information, to register for the event, or to submit a presentation, click the link above!

I’ve already submitted presentations, but I will be there, regardless of whether or not I’m picked to speak!

Hope to see you there!

Monthly CASSUG meeting — January 2019

Starting with this ‘blog post, I’m beginning a new habit. Since I’m responsible for communications for my SQL user group (CASSUG — “Capital Area SQL Server User Group”), I will start announcing monthly meetings here in my ‘blog. This is my first such entry for this year.

Our next meeting — our first of the new year — is on Monday, January 14. Andy Mallon will give his presentation titled “Demystifying Data Compression.”

For more information and to RSVP, go to our Meetup link!

Hope to see you there!

#BI101: Introduction to data warehousing

This is part of a series of articles in which I’m trying to teach myself about BI.  Any related articles I write are preceded with “#BI101” in the title.

Because this is a new (to me) topic, it’s possible that what I write might be inaccurate.  I invite you to correct me in the comments, and I will make it a point to edit this article for accuracy.

As part of my personal education about business intelligence, I kicked off a SkillSoft course made available to me through my employer.  My strategy is to take the course, perform some supportive research, and write about what I learn.  It turned out that BI was one of the training options available.  So, I kicked off the course and began my training.

The initial topic discussed the concept of data warehousing.  The course began with a pre-test — a proverbial “how much do I really know?”  There were a few subtopics that were familiar to me — normalization and denormalization, for one — so my initial thought was, how much do I have to learn?  As it turned out, the answer was, a lot.

What is a data warehouse, and what does it do?

The short and simple answer is that a data warehouse is, as the name implies, a storage repository for data.

That’s the short answer.  The longer answer gets a little more complicated.

Data warehouse architecture (source: Wikipedia)

I learned that a lot of the concepts behind a data warehouse pretty much breaks a lot of what I thought I knew about relational databases.  For starters, I’ve always been under the impression that all relational databases needed to be normalized.  However, I learned that, in the case of a data warehouse, that might not necessarily be the case.  While a data warehouse could be normalized, it might not necessarily be.  While a normalized database is designed to minimize data redundancy, a denormalized data table might have redundant data.  Although it occupies more space, having redundant data reduces the number of table joins, thus reducing query time (as the SkillSoft lesson put it, sacrificing storage space for speed).  When a data warehouse is storing millions of rows of data, reducing the number of query joins could be significant.

Populating a data warehouse

How does data get into a data warehouse?  It turns out that a data warehouse can have multiple data sources — SQL Server, Oracle, Access, Excel, flat files, and so on.  The trick is, how does this data get into a data warehouse?  This is where the integration layer comes in.

Because the SkillSoft course I took was SQL Server-specific, it focused on SSIS.  SSIS provides tools that allow it to connect to multiple data sources, as well as tools for ETL (Extract, Transform, Load).  ETL involves a process that includes obtaining data from the various sources and processing it for data warehouse storage.  A big part of ETL is data cleansing — formatting data so it is usable and consistent.  For example, imagine that several data sources use different formats for the same Boolean data field.  One uses “1” and “0”, another uses “T” and “F”, another uses “Yes” and “No”, and so on.  Data cleansing formats these fields into a single, consistent format so that it is usable by the data warehouse.  As there are many such fields, ETL is often a very long and involved process.

Just the facts, ma’am…

I had heard of fact and dimension tables before, but I wasn’t entirely sure about their application until I started diving into BI.  In a nutshell, facts are raw data, while dimensions provide context to the facts.

To illustrate facts and dimensions, let me go back to one of my favorite subjects: baseball.  How about Derek Jeter’s hitting statistics?  Clicking “Game Log” displays a list of game-by-game statistics for a given season (the link provided defaults to the 2014 season, Jeter’s last season).  Looking at his last home game on Sept. 25, Jeter accumulated two hits, including a double, driving in three runs, scoring once, and striking out once in five at-bats.  These single-game numbers are the facts.  The facts are these raw numbers that Jeter generated in that single game.  A fact table stores these individual game statistics in a single table.  Dimensions provide context to these numbers.  Some dimensions might include total accumulated statistics for a season, batting average, and some (non-statistical) information about Jeter himself (name, hometown, birth date, etc.).  These derived statistics would be stored in dimension tables.

If you’re still confused about fact and dimension tables, I found this question on StackOverflow that does a pretty good job of answering the question.  I also came across this article that also does a good job of describing fact and dimension tables.

Stars and snowflakes

Star schema (source: Wikipedia)

A fact table usually maintains a relationship with a number of dimension tables.  The fact table connects to the dimension tables through a foreign key relationship.  The visual table design usually resembles a star, in which the fact table is at the center and the dimension tables branch out from the center.  For this reason, this structure is called a star schema.

A snowflake schema is related to the star schema.  The main difference is that the dimension tables are further normalized.  The resulting foreign key relationships result in a schema that visually resembles a snowflake.

Snowflake schema (source: Wikipedia)

Attention, data-mart shoppers…

A data mart can probably be considered either a smaller version of a data warehouse, or a subset of a data warehouse (a “dependent” data mart, according to Wikipedia).  As I understand it, a data warehouse stores data for an entire corporation, organization, or enterprise, whereas a data mart stores data for a specific business unit, division, or department.  Each business unit utilizes data marts for various information purposes, such as reporting, forecasting, and data mining.  (I’ll likely talk about these functions in a separate article; for our purposes, they go outside the scope of this article.)

So, this winds up what I’ve learned about data warehousing (so far).  Hopefully, you’ll have learned as much reading this article as I have writing it.  And hopefully, you’ll keep reading along as I continue my own education into BI.  Enjoy the ride.

 

#BI101: An introduction to BI using baseball

Edit: This is the first of a series of articles (I hope!) in which I’m trying to teach myself about BI.  Any articles I write that are related to this, starting with this one, will be preceded with “#BI101” in the title.

As I stated in a previous article, one topic about which I’m interested in learning more is business intelligence (BI).  For those of you who are new to BI, it is a broad topic.  In a nutshell, it can probably be described as “consuming and interpreting data so it can be used for business decisions and/or applications.”

I’ll admit that I don’t know a lot about BI (at least the fine details, anyway).  I did work a previous job where I touched upon it; I was tasked with performing some data analysis, and I was introduced to concepts such as OLAP cubes and pivot tables.  I’ve gotten better at creating pivot tables — I’ve done a few of them using MS Excel — but I’ll admit that I’m still not completely comfortable with building cubes.  I suppose that’ll come as I delve further into this.

A while back, my friend, Paresh Motiwala, suggested that I submit a presentation for Boston SQL Saturday BI edition.  At the time, I said to him, “the only thing I know about BI is how to spell it!”  He said to me (something like), “hey, you know how to spell SQL, don’t you?”  Looking back at the link, I might have been able to submit (I didn’t realize, at the time, that they were running a professional development track).  That said, Paresh did indeed had a point.  As I often tell people, I am not necessarily a SQL expert — I know enough SQL to be dangerous — nevertheless, that does not stop me from applying to speak at SQL Saturday.  Likewise, as I dive further into this topic, I’m finding that I probably know more about BI than I’ve led myself to believe.  Still, there is always room for improvement.

To tackle this endeavor, once again, I decided to jump into this using a subject that I enjoy profusely: baseball.  Baseball is my favorite sport, and it is a great source of data for stat-heads, mathematicians, and data geeks.  I’ve always been of the opinion that if I’m going to learn something new, I should make it fun!

Besides, the use of statistical analysis in baseball has exploded.  Baseball analytics is a big deal, ever since Bill James introduced sabermetrics (there is some debate as to whether James has enhanced or ruined baseball).  So what better way to introduce myself to BI concepts?

For starters, I came across some articles (listed below, for my own reference as much as anything else):

I also posted a related question in the SSC forums.  We’ll see what kind of responses (if any) I get to my query.

Let’s start with the basics — what is BI?

Since I’m using baseball to drive this concept, let’s use a baseball example to illustrate this.

Let’s say you’re (NY Yankees manager) Aaron Boone.  You’re down by a run with two outs in the bottom of the 9th.  You have Brett Gardner on first, Aaron Judge at bat, and you’re facing Craig Kimbrel on the mound.

What do you do?  How does BI come into play here?

Let’s talk a little about what BI is.  You have all these statistics available — Judge’s batting average, Kimbrel’s earned run average, Gardner’s stolen base percentage, and so on.  In years BS — “before sabermetrics” — a manager likely would have “gone with his gut,” decided that Judge is your best bet to hit the game-winning home run, and let him swing away.  But is this the best decision to make?

Let’s put this another way.  You have a plethora of data available at your fingertips.  BI represents the ability to analyze all this data and provide information that allows you to make a good decision.

If Aaron Boone (theoretically) had this data available at his fingertips (to my knowledge, Major League Baseball bans the use of electronic devices in the dugout during games), he could use the data to consider Kimbrel’s pitching tendencies, Judge’s career numbers against Kimbrel, and so on.  BI enables Boone to make the best possible decision based upon the information he has at hand.

I do want to make one important distinction.  In the above paragraphs, I used the words data and information.  These two words are not interchangeable.  Data refers to the raw numbers that are generated by the players.  Information refers to the interpretation of that data.  Therein lies the heart of what BI is — it is the process of generating information based upon data.

What’s there to know about BI?

I’ve already mentioned some buzzwords, including OLAP, cubes, and pivot tables.  That’s just scratching the surface.  There’s also KPIs, reporting services, decision support systems, data mining, data warehousing, and a number of others that I haven’t thought of at this point (if you have any suggestions, please feel free to add them in the comments section below).  Other than including the Wikipedia definition links, I won’t delve too deeply into them now, especially when I’m trying to learn about these myself.

So why bother learning about BI?

I have my reasons for learning more about BI.  Among other things…

  • It is a way to keep myself technically relevant.  I’ve written before about how difficult it is to stay up-to-date with technology.  (For further reading regarding this, I highly recommend Eugene Meidinger’s article about keeping up with technology; he also has a related SQL Saturday presentation that I also highly recommend.)  I feel that BI is a subject I’m able to grasp, learn about, and contribute.  By learning about BI, I can continue making myself technically valuable, even as my other technical skills become increasingly obsolete.  Speaking of which…
  • It’s another adjustment.  Again, I’ve written before about making adjustments to keep myself professionally relevant.  If there’s one thing I’ve learned, it’s that if you want to survive professionally, you need to learn to adjust to your environment.
  • It is a subject that interests me.  I’m sure that many of you, as kids, had “imaginary friends.”  (I’ll bet some adults have, too — just look at Lieutenant Kije and Captain Tuttle.)  When I was a kid, I actually had an imaginary baseball team.  I went as far as to create an entire roster full of fictitious ballplayers, even coming up with full batting and pitching statistics for them.  My star player was a power-hitting second baseman who had won MVP awards in both the National and American leagues, winning several batting titles (including a Triple Crown) and leading my imaginary team to three World Series championships.  I figured, if my interest in statistics went that far back, there must be something behind it.  Granted, now that I’ve grown up older, I’m not as passionate about baseball statistics as I was as a kid, but some level of interest still remains, nevertheless.
  • It is a baseline for learning new things.  I’ve seen an increasing number of SQL Saturday presentations related to BI, such as PowerBI, reporting services, and R.  I’m recognizing that these potentially have value for my workplace.  But before I learn more about them, I also need to understand the fundamental baseline that they support.  I feel that I need to learn the “language” of BI before I can learn about the tools that support it.

So, hopefully, this article makes a good introduction (for both you and myself) for talking about BI.  I’ll try to write more as I learn new things.  We’ll see where this journey goes, and I hope you enjoy coming along for the ride.