Company

How I Learned to Stop Worrying and Love the CSV

Or, What does the Berlin AirLift have to do with data ingestion?

Written by 
Kirat Pandya
June 28, 2024

The last decade of my career has been spent working with small and large-scale data infrastructure of all types, from data infrastructure at startups to hyperscale infrastructure providers.

One consistent theme I have observed is that:

1) Every data system at some point either ingests or outputs CSVs
2) Engineers constantly complain about how CSVs are a terrible format.

Technical outcomes in the real world that seem suboptimal from a narrow view become obvious when you take a step back to understand why we arrived here. Why did the CSV win against technically better solutions like APIs?

The problem

The need for exponentially greater, faster data exchange between organizations can be traced back decades to before computers became fundamental to our lives. Understanding the technological journey that led to the systems that power our world today provides some fascinating insights as to how society tries to solve problems and what it optimizes for.

Despite all our efforts, we remain constrained by our inability to make the flow of information faster and smoother. We spend $1.9T annually on just manually cleaning up the data mess we create for ourselves as data crosses organizational boundaries (McKinsey Global Institute. A future that works: Automation, Employment and Productivity. January 2017).

Let’s look at an average modern US company. They probably work with hundreds of other companies, receiving new data from partners, vendors, and customers every day. Pick a single type of data (Eg: warehouse product inventory), and even with that scope, the schema of the data they receive from each partner is different. This data probably does not come in over a neat engineered API but rather as a bunch of CSV and Excel files that humans (or really brittle Python scripts) then try to reshape, clean, and ingest into operational systems.

To understand why, we need to rewind the clock about 75 years.

To build a better future, we must learn from our past

Line-drawn vector image of a WW2 Douglas C54 Skymaster airplane.  

The Berlin AirLift

Our story starts back in 1941.

The Berlin Blockade was a major crisis during the Cold War. During the occupation of post-war Germany, the Soviets blocked rail, road, and canal access to the parts of Berlin controlled by the Western powers.

The Berlin Airlift (Operation Vittles, Jun 24, 1948 – May 12, 1949) by the American military was a way for the Allies to fly supplies into parts of Berlin that were inaccessible on the ground due to Soviet occupation zones. The airlift lasted for 13 months, over 250,000 sorties, and more than 3 million tons of food and supplies were lifted. At the peak of the airlift, one aircraft landed in West Berlin every 30 seconds.

“But tracking the cargo, which had to be loaded and unloaded at top speed, was next to impossible with shipping manifests in different forms and, sometimes, different languages,” – Frank Hayes, in The Story So Far.

Enter U.S. Army Master Sgt. Edward A. Guilbert and his colleagues. They created a standard manifest system that could be transmitted by telex, radio-teletype, or telephone. This system was used at scale to operate the logistics of the air lift.

After the war, Guilbert joined DuPont, where he helped develop a set of standardized electronic cargo messages for communicating with tank truck carrier Chemical Leaman Tank Lines. This was the early 1960s, and the transportation industry was bogged down with paper, paper, and more paper. Guilbert’s work was a breakthrough.

The first such messages were sent in 1965 when the Holland-American steamship line sent trans-Atlantic shipping manifests using telex messages, which could send a full page of information in roughly 2 minutes. These messages were then converted into tape that could be loaded onto computers.

EDI: The progenitor

As industries began adoption, everything was entirely dependent on proprietary implementations at both ends. This was, of course, untenable at industry scale. Things advanced rapidly from there.

EDI was standardized as a structured, flat file format. Different industries had different needs. Competitors within the same industry ran their business differently, and so their data exchange needs were different. So, the actual meaning of the contents of an EDI file was also standardized into many different standards (13 as of today, https://en.wikipedia.org/wiki/Electronic_data_interchange#Standards)

Image of a long string of EDI text. One can easily notice the standardization at the beginning of each line. Codes like N3, N4, P01, and P04 stand out.  

The underlying tension that got us here is rooted in humanity itself. We all want to be different, be unique. This percolates through our thinking, languages, and eventually, our organizations. However, our computer systems don’t operate that way, and so there is a deep desire for more flexible standards. These standards should enable predictability and automation without completely trading away our ability to be different.

Age of the API

The dot-com boom brought with it the rise of APIs. Every organization built an API... server. While EDI took 70 years to end up with 10 or so standards, the API world has millions upon millions of standards. We just call them “API Specification” instead of “Standard” now. And every company has one.

The challenge is that APIs are extremely one-sided. Most organizations have one API surface, their own, to maintain, but are staring at the prospect of writing hundreds or even thousands of API clients to other organizations’ APIs.

Imagine you are the CTO of a company that operates in the supply chain. You work with hundreds of suppliers and channel partners. Back in 2015, you probably built yourself an awesome "supplier portal," complete with its own API. Everyone was excited to be an API-first business where all those suppliers and channel partners would integrate with your APIs, and everything would be magical.

That future never came, and in hindsight, it's pretty clear why. You have 2000 partners. Of those, 1000 have their own APIs. The ROI to write an API client for each of your partners' APIs has an ROI so negative that you'd be laughed out of the room for even suggesting it. Of course, each of your partners faces the same dilemma.

Every company has an API, but nobody has the resources to write all the API clients they want.

Celebrating diversity

While the varieties of data being exchanged between orgs are, for all intents and purposes, infinite, there are some core things every business relies on. ERP (Enterprise Resource Planning) is one such system.

With the top 3 ERP Software Vendors controlling 70% of market share, one would assume that the core data models are the same. After all, these systems all hold and exchange the same logical data.

Humans and our organizations are complex, however. Even with such close overlap, our data workloads end up looking drastically different. Accepting this fundamental piece of human nature is important to pushing things forward.

Powering humanity with the humble CSV - Desire paths

The term “desire paths” refers to the trails created by people or animals as they navigate from point A to point B, deviating from constructed routes and paved walkways. These paths illustrate a fundamental principle: systems often fail to anticipate or accommodate actual user behavior, resulting in unplanned and organic solutions that better meet the needs of those using the system.

Desire paths tend to be controversial. One group sees them as disrespectful and a desecration of beautiful, well laid out architecture. The other side sees it as a rejection of function over form.

​​

Image of a green lawn where people have forged their own path between two sidewalks cutting directly through the grass. An eScooter sits by the dirt path.
Licensed under Creative Commons CC https://commons.wikimedia.org/wiki/File:Desire_path_-_52849400711.jpg

In software systems, desired paths usually show up as users “using it incorrectly” or “holding it wrong”. Oftentimes, users will choose to solve problems with  more clicks and more complex workflows than following happy “user journeys” that the designers of systems have laid out. It leaves software teams baffled and frustrated.

As technologists, we should not be fighting these desired paths. We should be embracing them. They are the clearest and loudest signals from our users about what they want the system to do and how.

The humble CSV is a desired path. It is a rejection of attempts to trade off flexibility in the pursuit of structure and automation. Let’s understand why.

Learning to love the CSV

Image of a table comparing the features of CVSs, API as a modern alternative, and EDI as the progenitor. CSVs obviously out-feature APIs and EDIs.

As a data exchange format, the CSV is the most accessible, inclusive format, enabling everyone, not just engineers to work with data. It gets out of your way and lets you solve your complex problems in a manner that works best for you.

However, there is no such thing as a free lunch, and working with CSVs can bring very critical pain. Let’s explore why and how:

  • CSVs must be read sequentially and reads cannot easily be parallelized in the general case. You can’t just split the file into chunks at newline characters because you cannot tell if a newline is indeed a line ending for a row or if it is part of a quoted string field in the middle of a row. Pre-defined schemas and sorted data can work around this but that doesn’t really help when there are schemas defined by many parties in data exchange scenarios.
  • While there are standards (RFC 4180 proposes a specification for the CSV format, with more W3C guidance out there, too), they mostly function as conventions that are frequently ignored. For example, it is common to use semicolons as delimiters in Germany and Brazil for files with a .csv extension. This makes automated machine ingestion difficult
  • Limited to strings, so you lose any type of information (date, boolean, etc.) when data is converted to CSV.
  • CSVs are not fully self-describing. Looking at the headers of a file doesn't usually tell you anything about the actual schema. Things like nullability, data types, etc. are left as an inference exercise for the reader by looking at the data in the file or communicated out of band from the CSV itself. While this isn’t ideal, it’s not that different than formats like JSON and far better than formats like EDI that don’t have field names that can be interpreted without an out-of-band specification document.

All these drawbacks have made CSV ingestion extremely difficult to automate. When your organization’s partners, vendors, customers, etc., send you data, you are faced with an impossible number of source schemas that need to be ingested into your organization’s representative golden schema.

We at Osmos, believe we have the solution for this.

Should You Build or Buy a Data Importer?

But before you jump headfirst into building your own solution make sure you consider these eleven often overlooked and underestimated variables.

view the GUIDE

Kirat Pandya

CEO & Co-founder