FTP to BigQuery: How to Automate Data Ingestion in Minutes
We are seeing a generational change in how companies collaborate and work with one another to raise the tide for everyone. This collaborative working model brings along with it the need for ongoing data sharing between the involved parties.
Let’s be honest—dealing with external data is a pain. You have no control over the data, formats, or types. Worst of all, it’s never a one-time cleanup process; it needs constant engineering and operational cleanup.
At Osmos, we provide solutions to simplify the intake of external data into your operating systems. Osmos Pipelines is specifically designed to automate the process of ingesting external data into your operational systems. Our no-code data transformation engine lets you validate, map, and cleanup messy data, making it simple to bring in cleaner, fresher data.
Let’s take a look at one of the most common scenarios companies need to deal with on a regular basis: a customer or partner sends you CSV data via FTP. You need to ingest it into your operational system, but the data is in a very different format than your BigQuery schema. Additionally, you also need to handle null characters, duplicates, and pre-processing the file before reading it.
In this blog, you'll see how this scenario can be handled using Osmos Pipelines, in five minutes or less.
The scenario: SFTP to BigQuery
This is the setup for this use case:
- Data is in CSV format in an SFTP folder
- The operational database is Google BigQuery
- Field names and types are as indicated in Figure A
Step 1: Define the SFTP Source
The first step is to build a connector to the source system (in this case the FTP server). After selecting the source, you fill in additional information such as the connector name, details to connect to the FTP folder, such as a name, SFTP host, port number, and so on.
You may follow the instructions on the right pane of the page to create the connector or follow the instructions on our docs page.
Dealing with CSVs often requires dealing with multiple edge cases—files without headers, files with special characters or null characters, or CSVs that are zipped. With Osmos, handling these edge cases is as simple as configuring the connector to handle each unique situation.
More information about these advanced options can be found on our docs pages. Once all the details are provided, test and save the connector.
Step 2: Destination is BigQuery
Your data needs a destination. In this example, you want to save the data in a table within Google BigQuery data warehouse. We offer two modes of connecting to the BigQuery Data warehouse: connection using OAuth and connection using Service account.
After selecting the destination, you need to provide the project ID, dataset, and the table to write the data to. Once all the details are provided, test and save the connector. Another benefit of Osmos Pipelines is writing data into existing tables, unlike other solutions that create a new table each time.
All instructions to create the BigQuery connector can be found in the side panel or in the docs.
Step 3: Data Mapping and Cleanup
You have the source and the destination mapped. Now, it's time for data mapping. When assigning data to fields in the destination system, one needs to ensure both structural integrity (the structure of the data is accurate) and type integrity (data type is accurate).
Structural integrity can be accomplished in two ways with Osmos:
- Direct source to destination column mapping
- Transforming the data first, and then mapping the transformed output to destination
Data transformation can further be accomplished in two ways within Osmos:
- Transforming the data using simple formulas before mapping the columns
- Transforming the data using examples, so the system can auto-generate the necessary transformations
Data type integrity can be ensured using in-path Data Validations. Osmos natively understands data types from the destination systems. For example, if the user tries to enter a string in a number field, Osmos warns the user and prompts them to enter only acceptable values. Osmos also understands the date and time formats of the underlying systems and makes the best attempt to auto-transform date and time to the format accepted by the destination system.
In the video above, you can see all the methods outlined above in action:
- Extracting out EAN value from a combined EAN-UPC string and assigning it to numeric productID field in the destination system, using examples
- Extracting out first name from full name using just one example
- Extracting and transforming dates using datetime() formula
- And so on...
For more details on transformation methods and supported formulas, please refer to our documentation.
Step 4: Schedule and Automate the Pipeline
With data mapping complete, you can now choose to save the pipeline as a “run-on-demand” pipeline or a “scheduled” pipeline.
- If setup as a “run-on-demand” pipeline (“No scheduled runs” option), the pipeline can be manually trigged at any time from the pipeline details page.
- If setup as a “scheduled” pipeline, you can choose the time, day, and frequency for how you want the pipeline to run. The pipeline will automatically run at the scheduled frequency.
Step 5: Save the Pipeline
That’s it! You just automated the process of ingesting messy data into your database without writing a line of code, in less than five minutes.
Simplify Your External Data Ingestion into BigQuery with Osmos Pipelines
Using data onboarding solutions to ingest external data does more than simply save time. It creates exponential value for businesses by cutting down the cost of managing data, freeing up your technical team, and allowing more time to transform data into insights that can impact your business.
Get hands-on. Explore how Osmos Pipelines can help your teams ingest data from external customers and partners, without writing a line of code.
Should You Build or Buy a Data Importer?
But before you jump headfirst into building your own solution make sure you consider these eleven often overlooked and underestimated variables.
view the GUIDE