Data Transformation

Osmos Makes Data Ingestion Even Easier with Enhanced AutoMap AI

Written by 
Kirat Pandya
January 19, 2024

Everyone is rushing to add some form of chat-based, generative AI experience to their products. Most of what you see out there today is the same old product, but now with a chatbot that isn’t very useful yet. 

Osmos is a company that has grounded its product in AI since day one with cutting edge features, such as SmartFill (using “program synthesis” aka generative AI from 2019). So today, I’m proud to announce our enhanced AutoMap AI which is our first step towards revolutionizing data ingestion with generative AI.

Of course, we aren’t implementing generative AI just to say we have generative AI (ahem, chatbots for the sake of chatbots). We believe that generative AI can offer huge productivity gains without forcing the user to sit and type out instructions for it, one chat message at a time. Our new AutoMap AI makes column mapping–typically a manual process–faster (and more accurate) than before by automatically suggesting mappings.

Why You Need Osmos and Our New AutoMap AI

If you’ve been mapping columns without Osmos, you’re no stranger to the misery of manually mapping 100+ columns from many different sources to the destination schema. It could take a couple hours just to map a single file. Some of our customers even have files with 500+ columns!

Our AutoMap AI takes that task and automatically maps your columns quickly and more accurately. Immediately map things like “MSRP” to “Price,” “Energy” to “Calories,” “Vehicle Identification Number” to “VIN,” or “patient_ID” to “insured_ID” automatically with zero clicks. Your team saves time without losing control as users can easily adjust any suggestions made by the AI. The human is always the supervisor.

Good News for Osmos Customers

With our enhancements to AutoMap, our new AutoMap AI uses generative AI to significantly reduce the need for manual column mapping that is required during data ingestion. In a recent test, the new feature automatically mapped 96% of columns across a series of files.

The Technical Scoop

Osmos has trained our own large language model on a huge amount of domain specific and industry specific data. This data comes from many sources but is further enhanced by the fact that so many people use Osmos every day, feeding it new information about how messy source data relates to clean destination data.

Behind the scenes, we are able to train our model to take in a variety of information including:

  • Source metadata
  • Source schema
  • Source data sample
  • Destination schema
  • User information
  • Organization information
  • Date and time

All of this information enables the model to accurately predict source to destination field mapping. 

An interesting challenge that follows is that a single set of rows may not be a good representative sampling of the problem space. To work around this problem, we take a many-shot approach by sampling various parts of the source data and then scoring the model output to calculate the final mapping. This builds on research from Microsoft that found that model performance can be significantly improved with a many-shot approach.

Study showing GPT-4 performance
A recent research study from Microsoft found that a high quality generalist model can outperform a topic-specific model if a statistical approach is taken to many-shot prompting. In their case, they were able to make GPT-4 outperform many other large models trained on medical information.

By combining fine tuning–Supervised Fine-Tuning (SFT) and Direct Policy Optimization (DPO)--and a statistical approach to sampling and prompting, we are able to drive up accuracy while eliminating model hallucinations.

We decided to build our own models, as opposed to using a model from a Foundational Model provider, because we believe in using the right tool for the right job. Existing models work as really great human conversational assistants, but they fall flat when it comes to data processing.

For our new AutoMap AI, however, we need the large language model (LLM) to be a hybrid machine that understands human language, code, and schema structure very quickly. This is possible with a lot of prompt engineering, but the prompts are huge and the big proprietary models lack much of the industry-specific knowledge needed. This makes the experience less accurate, increases latency, lowers throughput, and increases cost.

By comparison, we are able to fine tune and align our own models for the exact problem at hand. This lets us use significantly smaller models and give our customers a much faster AutoMap experience than trying to get a larger, more generally capable model. Plus, these models can be regularly retrained to continue learning new information as more people use Osmos every day.

Most importantly, we can provide data security guarantees to our customers that no third party model provider is seeing their data. This gives our customers the option to completely opt out of having their data used for training (potentially giving up on some improvements from learning), or have Osmos build custom models for them using versions of our shared models that are fine tuned to their data, but used only in their private tenants. The choice is yours.

What's Next

Building AutoMap AI brings us one step closer to reimagining how generative AI can help improve every facet of data ingestion including mundane data transformation tasks. Our goal is to make complex data ingestion easier and faster for teams working with incoming data. We’re excited to bring the latest and greatest in AI advancements to our customers.

Ready to start accelerating your data ingestion processes? Contact us to try it out!

Should You Build or Buy a Data Importer?

But before you jump headfirst into building your own solution make sure you consider these eleven often overlooked and underestimated variables.

view the GUIDE

Kirat Pandya

CEO & Co-founder