Generative AI is finally here to tackle data ingestion’s most challenging problems
On June 17th, 2017, a group of Google researchers released their seminal research paper “Attention is all you need.” This paper defines the concept of transformers and provides the foundation for revolutionary products like CHAT GPT. The result was a wave of excitement and innovation not seen since the heady days of the dot-com revolution. We live in a magical time where we have an engine at our disposal that can reason and be creative. The result is wide-ranging, with AI being applied in areas from Agriculture to Pharmacology.
Over the course of my 30+ year career, I have stumbled upon the need to move data across organizational boundaries again and again. At Intel, I had the opportunity to work with supply chain team members who shared the complexity of collecting data from 1000s of global suppliers when I naively suggested onboarding a cheaper source for a touch screen component. During my time at Microsoft, I implemented a program that gathered interoperability test results from global car manufacturers. While working for Amazon, I collected feedback from our developer ecosystem. These disparate use cases had one thing in common - The data came to my organization in the form of spreadsheets whether it was CSVs or Excel files. Despite best efforts to provide clear instructions, the incoming data was a mess! With my Microsoft and Amazon examples, it took more effort to clean the data than it did to implement the program or analyze the results.
Inefficient and low-quality data ingestion impacts the flow of data across enterprises, resulting in a wide range of problems, such as slower revenue realization and incorrect forecasting. When I was presented with the opportunity to join Osmos, I jumped in with both my feet! I saw that it was finally viable to solve this problem. With the rapid advancements in generative AI, the technical foundation to solve the problem would be in place!
At the heart of data ingestion lies the challenge of transforming messy, imperfect data from a source into a format that maintains integrity and conforms to a destination system’s schema and quality requirements. This task is relatively straightforward when dealing with a single source. However, it becomes daunting when handling thousands of customers supplying large files in varying formats. This situation forces organizations to divert their engineering talent from building innovative products to investing in improving operational efficiency.
Osmos breaks the data ingestion process into four distinct but interrelated categories of activities:
- Preprocessing: Read and tabularize the source data, making it ready for processing.
- Mapping: Map the preprocessed data to the destination schema.
- Transformation: Perform necessary transformations and cleanup to meet validation and business requirements.
- Operational Management: Monitor, notify, manage, and report on the data flows.
Each of these steps can become very complicated, especially when dealing with hundreds or thousands of sources. Osmos takes an AI-assisted approach to each of these steps, automating the process where possible with human oversight and continuously learning and adapting. Let’s zoom into each of these stages to understand how!
AI Preprocessing (In Early Evaluation)
To begin cleaning up messy files, data specialists organize the data into a tabular format before validation and transformation tasks can commence. At Osmos, we refer to this step as "pre-processing". Pre-processing can involve choosing delimiters, merging the header section (e.g., invoice number, shipping address) with the data table (e.g., line items), removing empty rows, and unpivoting data. These tasks are typically unique to every file type and often manual.
Osmos leverages generative AI to automate these pre-processing tasks, reducing the need for manual intervention and ensuring efficiency and accuracy in handling diverse data sources.
Automated Column Mapping
Once a file has been pre-processed and its data converted into tabular form, the next step is to map the fields to the internal "golden schema." This task presents several challenges due to the unique ways each entity and system stores information. Mapping columns from input data to the golden schema can be challenging for various reasons: Non-explanatory field names, data might not be a 1:1 map, and field names can contain typos and inconsistencies. This process can take days because the data team may need to consult with customers or business teams to understand the data and get their assistance in mapping. The complexity magnifies exponentially when dealing with hundreds or even thousands of columns.
To automate this process, the underlying engine needs to understand the semantics of not just the field names but also the values in the field. The goal is to automate this process as much as possible, minimizing the need for data specialists to map data manually. Files often have hundreds of columns that need to be mapped, and this is not just a matter of matching similar names; understanding the semantics is crucial. Generative AI is perfectly positioned to automate this time-consuming and laborious process.
Osmos uses LLMs (large Language Models) to map the source schema to the destination schema. For example, a semantic understanding of information allows Osmos to determine that the Drug Code maps to the NDC (which stands for the National Drug Code).
Transformation
Once fields have been mapped, the next step is to clean and transform the data to ensure it meets business and quality requirements. Transformation can take many forms, such as cleaning up individual values within a field, applying formulas to make programmatic changes to data in a column, performing lookups to fetch data from a related dataset, or joining and aggregating data from multiple sources. These processes are essential to prepare the data for accurate analysis and reporting.
Osmos already offers a range of capabilities to support data management:
- SmartFill: An AI agent that learns from user examples to perform real-time data transformations.
- QuickFixes: Pre-built, one-click transformations for common scenarios.
- Formulas: Excel-like formulas for lightweight scripting needs.
Today, I am thrilled to announce the launch of two new features: AI AutoClean and AI Value Mapping. Let’s look at how these new features work!
AI AutoClean (NEW) - A Gen-AI feature agent that interprets simple English instructions to transform data.
With AI Autoclean, users can provide simple instructions using everyday language to tell Osmos how to transform the data. For example, this can be as simple as, “Convert to US phone number format in (XXX) XXX-XXXX format.” Here’s an example of a more sophisticated transformation:
Autoclean truly democratizes the transformation process, making it viable to transition the work from sophisticated data experts to everyday users!
AI Value Mapping (NEW) - A Gen-AI agent that automatically maps category values to a predefined list.
AI Value Mapping automatically maps category values to a predefined set, making it easier to standardize data across different sources.
Check out this example where Value Mapping AI accurately maps store department types from source data to a list of acceptable department types in the destination schema.
Manually mapping each of these one by one would be both painful and error-prone. Achieving this programmatically would be equally hard, as it would require incorporating all variations of a field value and any spelling mistakes.
Keeping humans in the loop - SaaS application users expect consistency and correctness, and we know that AI can sometimes make strange recommendations, like suggesting glue on your pizza. Osmos protects users from such mishaps by keeping humans in the loop, making it easy to verify and adjust any output. Our approach to AI saves users hours of work while providing tools to efficiently assess output.
As shown in the example below, users are presented with the output of AI’s attempt at mapping values, clearly indicating when a match couldn't be found. If anything is miscategorized, users can easily identify problems and manually override the system. The AI will adapt and learn from human overrides.
Operational Management
Operational efficiency is central to the use of Osmos, facilitating ease of operations throughout the user lifecycle. This encompasses everything from learning how to use Osmos without relying on complex support documentation to implementing Osmos for various use cases and monitoring the operational performance of these use cases.
Today, I am thrilled to announce the launch of Osmos Chat. Powered by GPT, Osmos Chat offers a simple conversational interface that helps users quickly find the insights they need. Instead of providing superficial search results, Osmos delivers deep, context-aware responses that truly answer user questions.
For example, here’s a rich and thoughtful response to a user who asked, “What are Osmos connectors?”
Users can also use Osmos Chat to get the status of the pipelines and jobs, as shown in the example below:
Osmos Chat enables rich analytics and reporting by using function calling to integrate various Osmos APIs, providing users with the most relevant and comprehensive responses to their questions.
Agility and Commitment Are Key
The Osmos architecture is designed to seamlessly integrate the latest advancements in generative AI while maintaining a best-in-class user experience.
To learn more about our vision for the future, please read our CEO, Kirat Pandya’s, recent blog about the Osmos AI Strategy.
Discover a better path to clean data
To learn more, contact an Osmos data ingestion expert today
Should You Build or Buy a Data Importer?
But before you jump headfirst into building your own solution make sure you consider these eleven often overlooked and underestimated variables.
view the GUIDE