Using AI to solve data ingestion in a CSV-powered world
The exchange of semi-structured data, especially the CSV, is the fuel that powers global commerce (My thoughts on why this is so: How I Learned to Stop Worrying and Love the CSV). Ingesting all this data into your organization is painful, and the tools that exist require technical expertise that only a small fraction of the organization has.
Our mission at Osmos is to build a data ingestion solution for the rest of us.
Osmos is built to solve data ingestion in a world powered by CSVs (TSVs, XLSX etc.). It was clear to us that existing tools and technologies are woefully inadequate for a world increasingly moving towards more flexible, customizable data operations, with the expectation that AI will get better at dealing with this dynamism.
Solving semi-structured data ingestion is critical to growing any business. However, the tools are completely inadequate and painful. Most organizations are forced to rely on humans manually cleaning the data or trying to automate hundreds or thousands of different file ingestions, one Python script at a time. For the people doing the work, it is an exhausting, thankless job. There are plenty of perspectives when it comes to solving data ingestion challenges.
Here are a few common perspectives:
- “It’s just Excel files”: The organizational perception outside the data team is that the data is simple enough to open in Excel, so how hard can it be to automate its ingestion?
- “Just write some Python”: Individually, each type of structured data (e.g., CSV Product Catalog from a supplier) is simple enough, so junior developers can just write a Python script to pull into the database, right?
- “We need to 2x our customer base this year.” That single Python script now became 200, but as an IT leader, your budget did not even 2x. So, how are you going to build and maintain all those scripts?
- “IT is writing crappy code”: Partners change something on their end ever so often, breaking a non-trivial percentage of the Python scripts your team has managed to build. But as far as the rest of the organization can tell, “our IT team really struggles” because “they can’t even automate pulling in a simple Excel file.”
- “Let's just use <no-code solution here>”: IT is trying valiantly to maintain the quality bar for data ingestion during this struggle, but there are constant attempts to work around IT. The rest of the org is waiting on their tickets to get picked up, IT can’t keep up, nobody is happy, and data ingestion remains a chronic problem.
At Osmos, we are implementing technologies to solve the end-to-end data ingestion problem, making everyone in the organization happier and more effective.
Our strategy is simple:
- Turn IT from doers to facilitators: Free IT from the Sisyphean task of automating ingestion from each source. Instead, IT can deploy Osmos for the entire organization, empowering each person to own and automate their step of the data lifecycle.
- Enhance data quality through code guardrails: Give IT the tools to build reliable, testable, and scalable infrastructure that guarantees protecting mission-critical systems from bad data, regardless of who builds the automation to ingest it.
- Allow anyone to automate data ingestion: With the guardrails in place, we can now give everyone in the organization the tools to clean, transform, and ingest data into production systems without needing yet another ticket for IT to resolve. When things break for any reason, the same people can easily and quickly fix the automation they built.
The current bar for every other data product in the market is the “just build better no-code tools and UI” approach. Our vision is much bigger than that. With the advancements in AI technology (generative or otherwise), complex transformation and cleaning that would have required code or tremendous manual efforts are outsourced to AI.
Osmos’ approach is to solve key points in the data ingestion process with AI, taking the human toil out of the equation (Excel, Python, or otherwise).
The technology we’ve developed directly addresses the everyday challenges people face when dealing with messy, complex data.
Common scenarios we’ve solved:
- Automatically Mapping 100s of fields: The incoming data has hundreds or thousands of fields that must be mapped to your system’s fields. Osmos can do that automatically. For example, if you are receiving a Warehouse Stock information CSV,
- “Product ID” maps to “SKU”
- “Ph. Num” maps to “Phone”
- “Sell Price” maps to “Cost”
- Automatically mapping data to a list of acceptable values: That same warehouse stock information CSV may have a column indicating the color of a product. You need to map their color names to names your system uses. Osmos can automatically understand the semantics of your data and map categorical values from the 3rd party’s data to your categorical values.
- (“Cerulean,” “Midnight,” “Mars”) from the supplier map to (“Blue,” “Black,” “Red”) in your system
- Extracting information semantically: Your partner is sending you messy data with delivery information in the notes field and the delivery address in a single field. Osmos can automatically understand this and extract the data for you.
- From “Deliver notes: Our office is Suite 405B,” extract “Street address 2: Suite 405B.”
- From “Address: 123 main st ste 42, Seattle, WA98101” extract
- Street 1: 123 Main Street
- Street 2: 42
- City: Seattle
- State: Washington
- ZIP: 98101
- Cleaning messy data: Automatically fix the messy data inside each cell. Fix typos and all the confusing ways people send dates, phone numbers, abbreviations, etc.
- “5-Jun-24” -> “2024-06-05”
- “14253335577” -> “+1 (425) 333-5577”
- “Minsota” -> “Minnesota”
- Reshaping semi-structured data files automatically: Understand the structure and meaning of a file in relation to your business and then automatically reshape it to fit your internal schemas as defined by IT.
- A helpfully formatted (for human aesthetics) Excel invoice with logos, pivot tables, etc. -> An operational table in your Data Lakehouse.
- Understanding relational data: Automatically understand the relationships across multiple related data sets, correlate them to IT-owned internal relational schemas, and automatically reshape and normalize the data and ingest
Our end goal is for people to be able to assign data ingestion tasks to the machine, walk away, trust that it will do everything correctly, and verify its work quickly and easily regardless of the size, shape, or messiness of the data.
Osmos is making AI trustworthy for fully automated data operations.
Achieving this goal will require deep, sustained technical innovation, something we are investing in at every layer of the stack. Very often, we meet customers who have tried building this on their own and run into steep challenges at every step:
- LLMs on their own can’t do it. They hallucinate and make mistakes silently and confidently.
The full solution requires more than just accurate models. The AI must be wrapped with infrastructure that can identify where it is making mistakes, pause the data flow, and notify the human to intervene, supervise, and correct.
- Larger models are smarter and make fewer mistakes, but they are unaffordably expensive (See some math here: https://www.osmos.io/blog/using-gpt-4-for-data-transformation). You can’t just send every row of data to trillion-parameter models like the frontier models by OpenAI, Google, or Anthropic.
- Using smaller models to make them affordable makes them dumber and more prone to mistakes. Advanced steering techniques like (SFT, RLHF, RLAIF, DPO, ORPO, etc.) need to be applied in specific ways to scale the model sizes down while increasing accuracy.
- The specific data needed to train these models is scarce. You need correctly correlated sets of “messy source data,” “a natural language description of what was done to the source,” and “the clean output data that came out.” This doesn’t just exist in public or private datasets
- Inference infrastructure (hardware and software) is extremely immature.
- “Model as a service” providers have all built their APIs with human-machine chat interactions in mind. Their quota limits (and system constraints) expect a low request-per-minute rate. This doesn’t work for data ingestion, where millions of rows of data need to be processed through the models as fast as possible.
- The software inference stacks (including Triton+TensorRT-LLM/VLLM/Google JetStream) are still very immature, lacking things like configurable prompt token caching, etc.
Osmos is innovating across every part of this problem space. We are, of course, building our own technologies, but this is larger than just us. We are also partnering deeply with industry leaders to help drive investment, innovation, and integration towards the needed technologies to solve these challenges.
We’ve partnered with the Google TPU team to help drive advancements in the inference software stack, further enabling high request rates, high token throughput, and reliable inference at the scale we need. (Here is the announcement from Google Next 2024: Accelerate AI Inference with Google Cloud TPUs and GPUs - Blog and Session Video).
We aim to solve data ingestion in a way that is accessible to everyone in the organization while keeping the data quality bar high. We know this is an extremely complex technological challenge, and solving it is going to be a game changer. Osmos is up for the task.
Contact an Osmos expret to learn how we can transform your data ingestion workflows.
Should You Build or Buy a Data Importer?
But before you jump headfirst into building your own solution make sure you consider these eleven often overlooked and underestimated variables.
view the GUIDE