Ultimate Data Glossary | 121 Must-Know Data Terms | Osmos
The data landscape is growing, and so is our terms glossary. We started with the top 101 terms, and we’ve expanded the list to include 121 must-know data terms.
It's critical that teams learn how to speak each other's language. While internal teams have some shared vocabulary, there are plenty of data terms that get thrown around in meetings that leave people scratching their heads. We thought it was time to create a blog post that could serve as a holistic data terms glossary -- one that not only defines each term but also offers some helpful resources in case you want to learn about them in more depth.
The data world is full of terms that are critical to your success. These are just 121 of the most used data terms at the moment. New trends are emerging all the time, and as new trends emerge, we'll continue to add new terms to continue learning.
Must-know Data Terminology
API - An application programming interface (API) is a set of defined rules and protocols that explain how applications talk to one another.
Artificial Intelligence - Artificial intelligence (AI) is a wide-ranging branch of computer science concerned with building smart machines capable of performing tasks commonly associated with intelligent beings.
Batch Processing - the running of high-volume, repetitive data jobs that can run without manual intervention, and typically scheduled to run as resources permit. A batch process has a beginning and an end.
Big Data - refers to data that is so large, fast or complex that it's difficult or impossible to process using traditional methods yet growing exponentially with time.
BigQuery - a fully-managed, serverless data warehouse that enables scalable analysis over petabytes of data by enabling super-fast SQL queries using the processing power of Google's infrastructure.
Business Intelligence (BI) - Business intelligence is a technology-driven process that consists of data collection, data exchange, data management, and data analysis. Uses include business analytics, data visualization, reporting, and dashboarding that deliver actionable information so organizations can make better data-driven decisions.
CSV - A Comma-Separated Values (CSV) file is a delimited text file in which information is separated by commas, and is common in spreadsheet apps.
CSV Import - A CSV import is the process of importing data from one application to another in CSV (comma-separated values) format.
CSV Import Automation: CSV import automation is a time-saving measure wherein a CSV file is automatically pulled from a designated location and exported to another application.
Customer Data Onboarding - the process of ingesting online and offline customer data into a product’s operational system(s) in order to successfully use that product.
Customer Data Platform (CDP) - software that consolidates, integrates, and structures customer data from a variety of touchpoints into one single database creating an unified customer view so marketing teams have relevant insights needed to run campaigns.
Data Analytics: Data analytics is the science of analyzing raw data to identify trends and patterns, answer questions, and glean insights with the goal of making data-driven decisions.
Data Architecture - defines how information flows in an organization for both physical and logical data assets, and governs how they are controlled. The goal of data architecture is to translate business needs into data and system requirements and to manage data and its flow through the enterprise.
Data Augmentation - a technique to artificially increase the amount of training data from existing training data without actually collecting new data.
Data Capture - the process of collecting information and then converting it into data that can be read by a computer.
Data Catalog - a neatly detailed inventory of all data assets in an organization. It uses metadata to help data professionals quickly find, access, and evaluate the most appropriate data for any analytical or business purpose.
Data Center - a large group of networked computer servers typically used by organizations to centralize their shared IT operations and equipment like mainframes, servers and databases.
Data Cleanroom - a secure, protected environment where PII (Personally Identifiable Information) data is anonymized, processed, and stored to give teams a secure place to bring data together for joint analysis based on defined guidelines and restrictions.
Data Cleansing - the process of preparing data for analysis by amending or removing incorrect, corrupted, improperly formatted, duplicated, irrelevant, or incomplete data within a dataset.
Data Collaboration - Data collaboration is the practice of using data to enhance customer, partner, and go-to-market relationships and create new value. Anytime two or more organizations are in a business relationship, data sharing and data collaboration can be seen in action.
Data Democratization - data is accessible to the average user without gatekeepers or barriers in the way.
Data Dashboard: A data dashboard is the visual expression of related data sets used to track and analyze data.
Data Ecosystem: A data ecosystem is a complex network of infrastructure, applications, tools, platforms, and processes used to collect, aggregate, and analyze information.
Data Enrichment - the process of enhancing, appending, refining, and improving collected data with relevant third-party data.
Data Exchange - process of taking data from one file or database format and transforming it to suit the target schema.
Data Extensibility - the capacity to extend and enhance an application with data from external sources such as other applications, databases, and one-off datasets.
Data Extraction - the process of collecting or retrieving data from a variety of sources for further data processing, storage or analysis elsewhere.
Data Fabric - a single environment consisting of a unified architecture with services or technologies running on that architecture to enable frictionless access and sharing of data in a distributed data environment.
Data Governance - the system for defining the people, processes, and technologies needed to manage, organize, and protect a company’s data assets.
Data Health - the state of company data and its ability to support effective business objectives.
Data Hygiene - the ongoing processes employed to ensure data is clean and ready to use.
Data Import - the process of moving data from external sources into another application or database.
Data Ingestion - the process of transporting data from multiple sources into a centralized database, usually a data warehouse, where it can then be accessed and analyzed. This can be done in either a real-time stream or in batches.
Data Integration - the process of consolidating data from different sources to achieve a single, unified view.
Data Integrity - is the overall accuracy, consistency, and trustworthiness of data throughout its lifecycle.
Data Intelligence - the process of analyzing various forms of data in order to improve a company’s services or investments.
Data Interoperability - the ability of different information technology systems and software applications to create, exchange, and consume data in order to use the information that has been exchanged.
Data Joins - combining multiple data tables based on a common field between them or “a key.'' There are 6 types of join: inner, left inner, left outer, right inner, right outer and outer.
Data Lake - a centralized storage repository that holds large amounts of data in its natural/raw format.
Data Lineage - tells you where data originated. It’s the process of understanding, recording, and visualizing data as it flows from origin to destination.
Data Loading - Data loading is the “L” in “ETL” or “ELT”). After data is retrieved and combined from multiple sources (extracted), cleaned and formatted (transformed), it is then packed up and moved into a designated data warehouse.
Data Lookup: Data lookup is the process of retrieving specific information from a structured data file. Using a data lookup is a way to populate information based on predetermined rules.
Data Manipulation - the process of organizing data to make it easier to read or more structured.
Data Mapping - the process of matching data fields from one or multiple source files to a data field in another source.
Data Masking - is a data security technique in which a dataset is copied but with sensitive data obfuscated. Also referred to as data obfuscation.
Data Mesh - a data mesh is a highly decentralized data architecture, unlike a centralized and monolithic architecture based on a data warehouse and/or a data lake. It ensures that data is highly available, easily discoverable, secure, and interoperable with the applications that need access to it.
Data Migration - the process of transferring internal data between different types of file formats, databases, or storage systems.
Data Mining - the process of discovering anomalies, patterns, and correlations within large volumes of data to solve problems through data analysis.
Data Modeling - the process of visualizing and representing data elements and the relationships between them.
Data Munging - the preparation process for transforming data and cleansing large data sets prior to analysis.
Data Onboarding - the process of bringing in clean external data into applications and operational systems.
Data Orchestration - the process of gathering, combining, and organizing data to make it available for data analysis tools.
Data Pipeline - the series of steps required to move data from one system (source) to another (destination).
Data Platform: A data platform is an integrated set of technologies used to collect and manage data, usually comprised of hardware and software tools used to deliver reporting and business insights.
Data Portability - the ability to move data among different applications, programs, computing environments or cloud services. For example, it lets a user take their data from a service and transfer or “port” it elsewhere.
Data Privacy - a branch of data security concerned with the proper handling of data – consent, notice, and regulatory obligations. It relates to how a piece of information – or data—should be handled with a focus on compliance with data protection regulations.
Data Replication - the process of storing your data in more than one location to improve data availability, reliability, redundancy, and accessibility.
Data Quality - a measure of how reliable a data set is to serve the specific needs of an organization based on factors such as accuracy, completeness, consistency, reliability and whether it's up to date.
Data Science - a multidisciplinary approach to extracting actionable insights from the large and ever-increasing volumes of data collected and created by today’s organizations.
Data Scientist - a professional who uses technology for collecting, analyzing and interpreting large amounts of data. Their results create data-driven solutions to business challenges.
Data Scrubbing - the procedure of modifying or removing data from a database that is incomplete, incorrect, inaccurately formatted, repeated, or outdated.
Data Security - the practice of protecting data from unauthorized access, theft, or data corruption throughout its entire lifecycle.
Data Sharing - the ability to distribute the same sets of data resources with multiple users or applications while maintaining data fidelity across all entities consuming the data.
Data Silo - a collection of information within an organization that is scattered, not integrated, and/or isolated from one another, and generally not accessible by other parts of the organization.
Data Source: A data source is the origin of a set of information. It can be a location, system, database, document, flat file, or any other readable digital format.
Data Stack - a suite of tools used for data loading, warehousing, transforming, and analyzing & business intelligence.
Data Structure: A data structure is a specialized format for storing, processing, organizing, and retrieving data. This is often a collection of data values, their relationships, functions, and operations that can be applied to the data.
Data Supply Chain: The data supply chain encompasses the four phases of the chain of custody of data. Beginning with the data source where your data is created and collected. Then on to data exchange transferred from its original location. Now to data management, where it is cleaned, transformed, enriched, and moved to its intended destination. In the final stage, data is leveraged as BI (business intelligence).
Data Transfer - any information that is transferred or shared between systems or organizations.
Data Transformation - the process of converting the format, structure, or values of data to another, typically from the format of a source system into the required format of a destination system.
Data Upload - the transmission of a file from one computer system to another.
Data Validation - ensuring the accuracy and quality of data against defined rules before using, importing or otherwise processing data.
Data Virtualization - the process of aggregating data across disparate systems to develop a single, logical and virtual view of information so that it can be accessed by business users in real time.
Data Warehouse - a repository for structured, filtered data that has already been processed for a specific purpose. Examples include BigQuery, Redshift, & Snowflake.
Data Wrangling - the process of restructuring, cleaning, and enriching raw data into a desired format for easy access and analysis.
Data Workflows - a sequence of tasks that must be completed and the decisions that must be made to process a set of data.
Database - an organized collection of structured information, or data, typically stored electronically in a computer system so that it can be easily accessed, managed and updated. Examples include MySQL, PostgreSQL, Microsoft SQL Server, MongoDB, Oracle Database, and Redis.
Database Schema - it’s the collection of metadata that describes the relationships between objects and information in a database. It’s the blueprint or architecture of how our data will look.
Dataflows - represents the path for data to move from one part of the information system to another.
Data Mesh: A data mesh is an architectural approach that decentralizes data ownership and processing, promoting the idea that individual domain-oriented teams are responsible for their own data products.
DataOps - the practice of operationalizing data management used by analytic and data teams for developing, improving the quality, and reducing the cycle time of data analytics.
Dataset - a structured collection of individual but related items that can be accessed and processed as individually or as a unit.
Data Stewardship: The management and oversight of an organization's data assets to ensure data quality, compliance, and proper usage throughout its lifecycle.
Deep Learning - a subfield of machine learning that trains computers via algorithms to do what comes naturally to humans such as speech recognition, image identification and prediction making.
Dev-Driven Data Ingestion: Data-driven data ingestion is a data ingestion process that is either initiated and owned by persons on the development team or cannot be completed without developer involvement
Dummy Data - mock data that has the same content and layout as real data in a testing environment.
Electronic Data Interchange (EDI) - the intercompany exchange of business documents in a standard electronic format between business partners.
ELT - stands for Extract, Load, and Transform. The data is extracted and loaded into the warehouse directly without any transformations. Instead of transforming the data before it’s written, ELT takes advantage of the target system to do the data transformation.
ETL - stands for Extract, Transform, and Load. ETL collects and processes data from various sources into a single data store making it much easier to analyze.
First Mile of Data: The first mile of data refers to the challenge of pulling data from a variety of sources, cleaning it up, and ingesting it into your operational systems.
First-Party Data - the information you collect directly from your owned sources about your audience or customers.
Flat File: A flat file is a collection of data stored in a two-dimensional database in which similar yet discrete strings of information are stored as records in a table. The columns of the table represent one dimension of the database, while each row is a separate record.
FTP - File Transfer Protocol (FTP) is a standard communication protocol that governs how computers transfer files between systems over the internet.
JSON - JavaScript Object Notation (JSON) is a text-based, human-readable data interchange format for storing and transporting data.
Master Data Management (MDM) - the technology, tools, and processes to ensure the organization's data is consistent, uniform, and accurate on customers, products, suppliers and other business partners.
Metadata - data that describes and provides information about other data.
Modern Data Stack (MDS): A set of technologies and tools used to collect, process, and analyze data. The tools in the modern data stack range from turnkey solutions to customizable products designed to solve complex data situations.
MySQL - pronounced "My S-Q-L" or "My Sequel," it is an open-source relational database management system (RDBMS) that’s backed by Oracle Corp. Fun fact: it’s named after My, the daughter of Michael Widenius, one of the product’s originators.
No-code ETL - the ETL process is performed using software that has automation features and user-friendly user interface (UI) with various functionalities to create and manage the different data flows.
NoSQL - non-relational database that stores and retrieves data without needing to define its structure first - an alternative to the more rigid relational database model. Instead of storing data in rows and columns like a traditional database, a NoSQL database stores each item individually with a unique key.
Ops-driven data Ingestion: Ops-driven data Ingestion is a data ingestion process initiated and owned by, or that can be independently completed by, persons on the implementation and operations side of an organization.
PostgreSQL - a free and open-source object-relational database management system emphasizing extensibility and SQL compliance. It’s the fourth most popular database in the world.
Product Catalog Data: A structured data file that contains product details from your business or a seller, vendor, or partner. This may include images, descriptions, specifications, availability, pricing, shipping info, etc.
Program Synthesis - the task to construct a program that provably satisfies a given high-level formal specification. It’s the idea that computers can write programs automatically if we just tell them what we want. Bonus: Program Synthesis is what’s behind the Osmos AI-powered data transformation engine that lets end users easily teach the system how to clean up data.
Raw Data - a set of information that’s gathered by various sources, but hasn’t been processed, cleaned, or analyzed.
Relational Database (RDBMS) - a type of database where data is stored in the form of tables and connects the tables based on defined relationships. The most common means of data access for the RDBMS is SQL.
RESTful APIs - Representational State Transfer (REST) is an architectural style for designing networked applications since it provides a convenient and consistent approach to requesting and modifying data.
Reverse ETL - As the name suggests, Reverse ETL flips around the order of operations within the traditional ETL process. It’s the process of moving data from a data warehouse into third party systems to make data operational. For example, extracting data from a database and loading it into sales, marketing, and analytics tools. Examples include Census and Hightouch.
Rust (programming language) - a static multi-paradigm, memory-efficient, open-source programming language that is focused on speed, security, and performance. Rust blends the performance of languages such as C++ with friendlier syntax, a focus on code safety and a well-engineered set of tools that simplify development. The annual Stack Overflow Developer Survey has ranked Rust as the “most loved” programming language for 5 years running. The code-sharing site GitHub says Rust was the second-fastest-growing language on the platform in 2019, up 235% from the previous year. And Osmos is built using Rust.
Stream Processing - a technique of ingesting data in which information is analyzed, transformed, and organized as it is generated.
Streaming Data - the continuous flow of data generated by various sources to a destination to be processed and analyzed in near real-time.
Structured Data - data that has been organized and predefined into a formatted repository before being placed in data storage.
Third-Party Data - any information collected by a company that does not have a direct relationship with the user the data is being collected on. It includes anonymous information about interests, purchase intention, favourite brands, or demography.
TSV - Tab Separated Values - files are used for raw data and commonly used by spreadsheet applications to exchange data between databases.
Unstructured Data - datasets (typical large collections of files) that are not arranged in a predetermined data model or schema.
Usable Data - data that’s understood and used without additional information. And data in an organization can be used to meet the goals defined in the corporate strategy.
Webhook - is a way for apps and services to submit a web-based notification to other applications that are triggered by specific events. Also called a web callback or HTTP API.
XML (Extensible Markup Language) - a simple, very flexible markup language designed to store and transport data. It describes the text in a digital document.
XSL (Extensible Stylesheet Language) - a language for expressing style sheets. It is used for transforming and presenting XML documents.
Summary
The data world is full of terms that are critical to your success. These are just 101 of the most used data terms at the moment. New trends are emerging all the time, and we'll continue to add new terms to continue learning.
Hopefully, as you get familiar with these common data terms and how to apply them to your business, you'll grow more confident using these data terms in meetings and daily conversations.
Should You Build or Buy a Data Importer?
But before you jump headfirst into building your own solution make sure you consider these eleven often overlooked and underestimated variables.
view the GUIDE