DataOps 101
Evolution of the Modern Data
Stack: Demystifying Data’s Notorious “Black Box”
“What in the heck is a Modern Data Stack and why should I care?”
It’s a common question we get with companies just starting out on their data journey, and frankly, it makes a lot of sense. The title can be both confusing and daunting for first timers (and even for some data vets out there).
For those who don’t know, a data stack is a collection of tools centered around the gathering, manipulation, and use of data. As technology has developed since the first days of the data stack, the stack’s role has grown to include more tools that make the process even more efficient and the end results more useful. However, if data stacks have been evolving for decades, why do we settle on such a static naming convention: the “Modern” Data Stack? If we’ve learned anything from the silly naming of artistic eras like “Modernism”, one might question what we’ll have to name the next era of data tools. Do we resign to calling it the “Postmodern” data stack (and then even worse, the “post-postmodern data stack”)?
Fortunately, the term “modern” could be less indicative of an era, and more indicative of a core competency of today’s data stacks, which is their ability to stay a relevant and efficient part of an evolving data landscape, incorporating new software and technical advancements efficiently into existing frameworks. Almost paradoxically, in order for data stacks to remain truly “modern,” the systems we use, as well as the individual instances of data themselves, must evolve with the needs of today’s companies and their customers.
Here we’ll take a closer look at the history behind the data stack and the developments that have kept our data systems up to date and “modern,” even within an era of rapid technological development.
The Stack: An Origin Story
It might be impossible to imagine now, but the original digital data storehouse started in the 1950s, recorded on old-school punch cards. Any data you would want to save came with tons of labor and even larger amounts of space to store and maintain the absolute beasts that early computing is famous for. Cards had to be manually fed into machines that recorded the data, and afterward, data was stored in its physical form, once again by teams of people that organized and watched over these gigantic stacks of data much like librarians and their countless volumes.
Thankfully, humanity would soon evolve beyond this laborious era of floors filled with punch cards and teams to manually feed them and into the era of the mainframe and even the personal computer—an era of technology that in some ways still resembles the technology of today.
1970s: The ETA for ETL
Many of the foundational principles and advancements in the history of data management were introduced in the 1970s. Alongside the development of relational databases, companies could now extract and integrate data from multiple sources, leading to the principle of Extract, Transform, Load, or ETL. Still used by many professionals to this day, this simple, yet powerful idea of collecting and converting data to encourage consistency laid the groundwork for many new advancements in the field throughout later decades.
With data now being stored on magnetic tapes or disks, computers were becoming more capable of efficiently handling large volumes of data. The birth of the query language SQL in the 70s also led to more clear and streamlined data queries that could be used by more people. Combined with new hardware developments and the relative ease of communicating with databases through SQL, ETL was set to flourish in the following years, starting with the emergence of the data warehouse.
1980s: IBM, The Warehouse Powerhouse
Believe it or not, it wasn’t until the 1980s that the current vision of what a data warehouse should look like really took form.
In the midst of IBM’s era of market domination over the computing space, researchers in the company began playing with better ways to centralize and consolidate their data from different sources to generate reports. Though computers may have decreased in size by this point, allowing widespread use among a variety of job roles, IT was still the undisputed overlord of data, and their work was the definition of “siloed.” While databases had progressed to the point of having separate repositories specifically for analytics, all inquiries had to be dealt with through IT from beginning to end. Those outside the department started to grow antsy for a method of accessing data themselves without the need for an IT degree of their own.
2000s: Data Demands Digital
Thanks to both the internet boom and bust of the 90s and early 2000s, the sheer volume of data generated by trackable activities accelerated rapidly. Data experts struggled to keep up with the needs of these big corporations and the potential to profit from this “digital gold.” Since the methods of old could no longer manage, up-and-comer Amazon looked to the clouds for an answer.
Amazon Web Services and others like Microsoft Azure and Google Cloud Platform revolutionized how these mountains of data could be processed and stored by sending it to the cloud. With easily-expandable digital storage, companies had the potential to scale their data solutions faster and with more flexibility, becoming less reliant on large, hardware-intensive in-house IT teams. However, it would take a next-generation tool to fill in the gaps still left by these first-generation cloud data solutions, including their price point that only appealed to huge-scale corporations and their unimpressive speed. It was time for Redshift to make data possible for the “Average Joe.”
2010s: “Redshifting” into the MDS
Despite the major leap that cloud data processing was, many companies did not have the resources or infrastructure to fully take advantage of this movement until 2012. That’s when Amazon Redshift changed everything. The first data warehouse to run natively in the cloud, Redshift addressed many of the problems faced by previous cloud solutions, namely their cost and ease of use. Now more companies than ever could get real data analytics without having the deepest of pockets. Small to medium-sized companies finally had a realistic solution for managing their growing data needs. Not only that, Redshift’s innovative architecture made data processing much faster—like lightning fast compared to its predecessors.
The effects of Redshift on data management were so far-reaching that the era following Redshift’s release later became known as the “First Cambrian Explosion.” With Redshift leading the charge, many more solutions used this tidal wave as a time to research and release their own innovations to complement it. Several well-known products inspired by the successes of Redshift include Snowflake, dbt, and BigQuery.
The Aftermath and Defining the Next Great Era
Many believe that the technological wave of innovation ushered in by Redshift solidifies our definition of the “Modern” Data Stack as we see it today. We generally consider the following features when deciding if a data stack is truly “modern:”
- It must be cloud-based
- It’s both modular and customizable
- It considers best-of-breed first (choosing the best tool for a specific job, versus an all-in-one solution)
- It’s driven by metadata