To analyze a body of data, that data must first be loaded into a data warehouse; that is, it must be copied from one or more systems, converted into a uniform format, and written to the new destination. This process is commonly referred to as extract, transform, and load (ETL). ETL provides the means to combine disparate data from multiple sources and create a homogenous data set that can be analyzed in order to extract business intelligence from it.
During extraction, data is read from one or more sources and held in temporary storage for transformation and loading. An organization may extract data from its own internal systems, such as a transaction processing system that records all order activities or from external sources, such as data it purchases or obtains for free from other organizations.
Extraction is commonly broken down into two logical extractions methods:
Extraction is also broken down into two physical extraction methods:
During the transform stage, data is processed to make all data consistent in structure and format so that it all conforms to a uniform schema. A schema provides the structure and rules for organizing data in a relational database. The source and target database systems may use different schemas; for example, the source database may store shipping information in a Customer table, whereas the target database stores shipping information in a separate Shipping table. Or, the source table may have dates in the MM/DD/YYYY format, whereas the target uses the DD/MM/YYYY format. To successfully copy data from the source to the target, certain transformations must be made to ensure that the source data is in an acceptable format.
Transformations can be handled in two ways:
During the load operation, all newly transformed data is written to the target data warehouse for storage. Various mechanisms can be used to load data into the target warehouse, including the following:
Variations on the Theme
ETL is commonly described as a three-step process primarily to make it easier to understand. In practice, ETL is not a series of clearly defined steps but more of a single process. As such, the sequence of events may vary. Depending on the approach, ETL may be more like one of the following:
The ETL Bottleneck
Given the increasing volumes of data that organizations must capture and integrate into their data warehouses, ETL often becomes a major bottleneck. Database administrators need to constantly revise their ETL procedures to accommodate variations in the data arriving from different sources. In addition, the volume and velocity of data can overwhelm an organization’s existing data warehouse storage and compute capabilities, leading to delays in producing time-sensitive reports and business intelligence. ETL operations often compete for the same storage and compute resources needed to handle data queries and analytics.
Fortunately, data warehousing technology has evolved to help reduce or eliminate the impact of the ETL bottleneck. For example, cloud data warehousing provides virtually unlimited storage and compute resources, so that ETL does not need to compete with queries and analytics for limited resources. In addition, data warehouse frameworks such as Hadoop take advantage of distributed, parallel processing to distribute work-intensive tasks such as ETL over multiple servers to complete jobs faster.
With the right tools and technologies in place, organizations can now stream data from multiple sources into their data warehouses and query and analyze that data in near real time for