Data Ingestion 101: Defining the First Mile Problem in Data
How explosive growth in data sources, volumes, and velocity created a data monster
We recently asked our customers why it is harder to accelerate data ingestion today than it was five years ago. The answer we got was not surprising. Data drives everything we do. As technology leaders, we’re constantly adding new applications to the operational workflow. This explosion in data sources has made data ingestion so complex that hopes of streamlining data supply chain management seem far-fetched.
In this series, we’ll explore why businesses continue to struggle with data ingestion. We’ll dive deep into how we got here, revisiting the evolution of data management and cloud innovations that have changed the landscape. If you’re a business leader trying to streamline operations and maximize efficiency, this series is a must-read.
Executive Summary:
- If your company is to successfully deliver your product’s value, you must invest more time and effort in solving the first-mile problem.
- What is the first-mile problem? The lack of a universal standard for data exchange creates a scenario in which every business receives data in varying formats and schemas, making ingesting clean data a huge operational challenge.
- While business leaders have long recognized the importance of collecting data, their teams struggle to seamlessly ingest and onboard clean data, usable data – placing an unnecessary burden on engineering and dev teams.
- More on how to solve that problem in Blog #2 in this series.
What is the first mile of data?
The first mile problem of data ingestion is the challenge of pulling data from a variety of sources, cleaning it up, and ingesting it into your data warehouse.
When Mike Tuchen, the former CEO of Talend, introduced the concept in 2018 during a Mad Money interview, he described the first mile of data ingestion as “pulling the data together, cleaning it up, and making it right.”
The “cleaning it up and making it right” part is easier said than done. In practice, this might look like using Fivetran to bring LinkedIn ads data into Snowflake for transformation with dbt or using Airbyte to bring Google Analytics data into BigQuery and later transforming it with SQL.
2010 to 2015: The Introduction of Cloud Computing Leads to an Explosion of ETL Solutions
The 2010s saw an explosion of new cloud and SaaS solutions as companies started moving workloads off-premises and cloud giants emerged.
The big three biggest cloud providers were all underway by 2010. Amazon Web Services (AWS) rolled out in 2006, followed by Google's App Engine in 2008 and Microsoft's Azure Cloud Services in 2010. Their primary focus was on offering a web application hosting platform on which to create apps.
And cloud adoption took off. According to Gartner, global spending on public cloud is expected to end at $600 billion in 2023—almost eight times higher where it began the decade at $77 billion. With software-as-a-service (SaaS) comprising over 30% of revenue, the total market will reach $208 billion in 2023.
Companies like Netflix, Dropbox, and Uber all exist because of the Cloud. Cloud-native services provide businesses with near-instant access to services that would have taken months to develop on-premise. The number of cloud tools went through the roof as thousands of new SaaS, gaming, and tech companies were born in the Cloud.
Engineering teams needed a way to combine all these new cloud-based data sources like ads, ERP, CRM, and payment to help calculate their cost of acquisition. With the shift away from on-premise, the adoption of ETL (extract-transform-load) solutions accelerated as building and maintaining data pipelines in-house began to make less and less sense.
ETL Process
This new paradigm gave businesses deeper insight into their legacy data, enabled visualization on top of a unified view, and enabled the rise of the data analyst, who enjoyed this new playground of sanitized data.
Advantages of ETL
There are many use cases for ETL in modern applications where security, customization, and data quality are priorities.
ETL is ideal for applications where anonymization and security are key. Healthcare, government, and finance organizations all benefit since they’re subject to compliance regulations like HIPAA or GDPR.
ETL is a good fit when destinations like databases require data to align to a specific schema. That’s because prior to loading, transformations clean and align the incoming data by ironing out the wrinkles brought on by incompatibility.
ETL vs ELT vs Reverse ETL
Additionally, ETL is great for operationalizing transformed data (aka Reverse ETL), for example, pushing data to your BI tools for analysis, building ML algorithms, and sending data to SaaS systems like Salesforce.
The Drawbacks OF ETL
However, there are some drawbacks to ETL. It’s inherently customizable, resulting in more control over data quality, but at the expense of time and effort. It can take months to put into place. There’s also overhead cost required for a dedicated engineering team to oversee maintenance and keep up with changing requirements.
A few other limitations to consider:
- ETL requires data analysts to know the end result beforehand so they can stitch together the data before loading it into the data warehouse.
- Building ETL pipelines frequently goes beyond the technical capabilities of data analysts. Hence, the rise in the number of analytics engineers.
- It is not ideal for near real-time or on-demand data access, where fast response is required.
- Analysts only see the ‘clean data’ and not the raw data.
2015 to 2020: ETL leaders make moves, the growth of modern cloud data warehouses, and the ELT paradigm
The data integration landscape changed significantly during this period. The enterprise data integration landscape has expanded considerably due to a number of factors. First, enterprise data integration market leaders expanded their product portfolios.
- Tibco was taken private by private equity firm Vista Equity Partners in December 2014 for $4.3 billion.
- Informatica was taken private by Canada Pension Plan Investment Board and Permira in a $5.3 billion deal in April 2015.
- Talend went public in July 2016. They snapped up Stitch, a fast-growing, self-service data integration company, in a $60 million deal in 2018.
- Alteryx went public in March 2017.
The Growth of Modern Cloud Data Warehouses
Since 2015, two significant trends coalesced: the rise of modern cloud data warehouses and the emergence of the ELT paradigm. Cloud data warehouses such as BigQuery, Redshift, and Snowflake quickly became the best place to consolidate data. They offer significantly lower data computation and storage costs than traditional data warehouses. For example, AWS has dropped prices 107 times as of August 2021.
Hiring Software Competency Requirements
ELT paradigm
In an effort to modernize data storage and analytics processes, data ingestion shifted to ELT solutions. With ELT, data transformation is done after being loaded into a data warehouse or data lake.
In ELT, the biggest change is the move towards commodity servers with commodity power and memory. The price of commodity servers has dropped by over 50% since 2015, providing a huge boost to cloud data warehouses. This means you can now run many smaller data warehouses on a single server at a fraction of the cost of traditional data warehouses. They are also becoming more powerful and efficient, which means they can be used to handle large amounts of data.
The most recent innovation in ELT technology is the use of GPUs for both GPU acceleration and deep learning processing, as well as for parallel computing on large datasets. Both techniques are taking off in cloud data warehouses, but only recently have they been combined into a single algorithm to reduce cost and increase performance.
The algorithms are based on sparse grids using deep neural networks running within multiple GPUs or CPU cores, allowing them to process multiple datasets at once without slowing down or requiring more than one core per dataset. This results in higher throughput and better performance than any other available method.
So what does this all mean?
It means that cloud data warehouses are now able to compete with traditional systems from both sides of the trade-off between cost, performance, and scalability: high-performance traditional systems for workloads that require large volumes of data (e.g., searching databases), low-cost but highly scalable systems for workloads that require small volumes (e.g., statistical and machine learning algorithms) and high-performance systems where the workload is not that demanding (e.g., deep learning).
The ELT paradigm makes a lot of sense, especially when dealing with large volumes of data. Cloud environments offer significant computing power at a lower cost, enabling data analysts to transform data as needed within the data warehouse.
Popular solutions have emerged to enable this new architecture. These include:
- Fivetran and Airbyte for EL
- dbt for T
- BigQuery, Redshift, and Snowflake for the data warehouse
The stage is set for a paradigm shift in operational efficiency and the explosion of new cloud-native companies.
Matt Turck, VC at FirstMark, dubbed this explosive growth of Machine Learning, AI, and Data solutions as the MAD landscape (see image below).
2020 - Current: More M&A and the Big Shift in First Mile Data Ingestion
The data ingestion market has grown significantly active in recent years. Big moves and acquisitions punctuate the landscape, and still, the amount of ETL players continues to blossom. This G2 Grid® for ETL Tools is a prime example of the market as it stands today.
Past examples include:
- Thoma Bravo took Talend private in March 2021
- Matillion raised $150M series E funding at $1.5B valuation in September 2021
- Informatica returned to the public market in October 2021
- Vista Equity Partners started exploring a sale of Tibco in June 2021
- Alteryx acquired Trifacta in February 2022
- Airbyte acquired Groupoo in April 2022
The Modern Data Stack in the AI Era
There is a problem in supply chain management known as the first-mile/last-mile problem. The first mile is the distance it takes for raw materials to travel from their extraction location to the processing location in your supply chain. The last mile problem, on the other hand, describes the challenge of getting the finished product from the shipping depot to the customer. Your data supply chain is also plagued by these issues.
In 2018, then CEO of Talend, Mike Tuchen described the first mile of data ingestion as “pulling the data together, cleaning it up, and making it right” during a Mad Money interview.
And this has been the general definition for the first mile problem of data ingestion - pulling data from a variety of sources, (cleaning it up if using ETL), and ingesting it into your data warehouse. In practice, this might look like using Fivetran to bring LinkedIn ads data into Snowflake for transformation with DBT.
Put another way, the first mile problem focuses on getting data out of operational data sources and into the data warehouse for analysis.
The First Mile Problem Shifts Upstream
A variety of operational data sources, such as ERP, CRM, database, Google Sheets, CSVs, 3rd party data, etc., get piped into the data warehouse. But what's the process for ingesting data into these operational sources? For example, what's the process for ingesting customer data into your product to power your app?
The challenge here is the “the first mile problem” or getting data into operational sources that’s translated into a meaningful way.
Thus, the first-mile problem shifted upstream. Ingesting clean data into your operational sources is now the first-mile problem. The previous first mile is now the middle mile, and the last mile is activating your transformed data in your SaaS application (reverse ETL).
Ingesting clean data into your operational sources is now the first-mile problem
“The average organization employs more than 130 applications, a figure that is increasing by 30% year over year.”
No company operates in a data silo. The average organization employs more than 130 applications, a figure that is increasing by 30% year over year. If your company is to successfully deliver your product’s value, you must invest more time and effort in solving the first-mile problem, especially when you're ingesting hundreds to thousands of external data sources (e.g., customer and partner data) to power your products.
The industry hasn't paid as much attention to this first-mile issue because it’s painful to solve at scale. It’s plagued by schema changes, volume anomalies, and late deliveries, which then spread to your downstream warehouse tables and business processes. These external data sources represent potential points of failure that are beyond the control and scope of a data team - typically falling more into the hands of engineering as a product infrastructure problem.
Learnings from this article:
Despite substantial investments in on-premises systems, infrastructure clouds, and application clouds, data silo issues still plague businesses, preventing them from realizing the full value of their data. Data ingestion now takes up roughly one-quarter of our time, and first-mile data ingestion receives far less "attention" than it deserves.
In future posts, we’ll discuss the steps to solving this new first-mile problem and why current solutions aren’t well-suited to handle it. But first, part 2 of our series investigates the most common sources of data and the one data source you should never take for granted.
Read Part 2 in this series: Data Ingestion 101: The Most Overlooked Data Source Impacting Business Growth
Should You Build or Buy a Data Importer?
But before you jump headfirst into building your own solution make sure you consider these eleven often overlooked and underestimated variables.
view the GUIDE