Data Warehouse - Doug Enterprises

One of the biggest challenges your data science team is likely to encounter is gaining access to all of the organization’s data. Many organizations have data silos— data repositories managed by different departments and isolated from the rest of the organization.

The term “silo” is borrowed from agriculture. Farmers typically store grain in tall, hollow towers called silos, each of which is an independent structure. Silos typically protect the grain from the weather and isolate different stores of grain, so if one store is contaminated by pests or disease all of the grain isn’t lost. Data silos are similar in that each department’s database is separate; data from one department isn’t mixed with data from another.

Data silos develop for various reasons. Often they result from common practice — for example, human resources (HR) creates its own database, because it can’t imagine anyone else in the organization needing its data or because it needs to ensure that sensitive employee data is secure. Data silos may also arise due to office politics — one team doesn’t want to share its data with another team that it perceives to be a threat to its position in the organization.

If your data science team encounters a data silo, it needs to find a way to access that data. Gaining access to data is one of the primary responsibilities of the project manager on the data science team. After the data analyst identifies the data sets necessary for the team to do its job, the project manager needs to figure out how to gain access to that data.

The Problems with Data Silos

Although data silos may be useful for protecting sensitive data from malware and from unauthorized access, they also cause a number of problems, including the following:

With data silos, an organization has no single source of truth. Data from various departments must be collected and combined prior to analysis.
If two or more departments are storing the same data, figuring out which department has the most accurate and current data can be a major challenge.
The chance of overwriting new data with old data is increased.
Data sharing may be more difficult and less efficient.
Data security may be more challenging if the organization needs to secure multiple sources of data, as opposed to having data in only one location to secure.

I once worked for an organization that was trying to migrate all its data to a central data warehouse. They felt that the organization wasn’t getting enough insight from its data. The organization had just gone through a data governance transformation and wanted to govern how the data was controlled in the organization.

When they finally got into their data, they realized how much was locked away in silos that no one knew about. Over many years, each department had created its own processes, schemas, and security procedures. The organization wanted to get value from this data, but that data was stored on different servers across the entire company. To compound the problem, the various departments were reluctant to share their data. It was as if the project manager was asking them to share toothbrushes.

Breaking Down Data Silos

One of the first steps toward becoming a data-driven organization is to break down the data silos:

Migrate all of the organization’s data to a secure data warehouse. A cloud data warehouse may be the most economical, because you can outsource data warehouse management and security to a third-party vendor that has the technology and expertise to provide superior performance and security.
Assign each user a unique username, and require a secure password from each user to log in. This enables IT to grant unfettered access to all data they may need to do their jobs, while restricting unauthorized access to any sensitive data.
Provide users with the tools and training they need to query and analyze the data.

By breaking down data silos, you give everyone in your organization self-serve access to the data they need to do their jobs better.

Words of Advice for Project Managers

If you’re a project manager on a data science team, try to keep the following key points in mind:

Don’t underestimate the difficulty of gaining access to data stored in silos. It may take a long time, so get started before the team actually needs the data.
Migrating an organization’s data to a centralized data warehouse requires the entire organization to be on board. You’ll need executive buy-in to make any progress. You can also expect to have to sell each department on the idea. Expect push-back from some departments that are highly protective of their data or that think “if it ain’t broke, why fix it?” You may have to entice them by explaining that with central data storage they’ll be able to create more complex reports or use newer visualization tools.
Provide access to your team’s reports. You may have an easier time breaking down silos if you can show the value of company-wide reporting and insights. Build interest in your information system by sharing your team’s wins. When others in the company see the value in the data and the BI, they’ll be eager to adopt.
Do your best to protect the data science team from any meetings about breaking down the data silos. You want the rest of your team focusing on exploration and discovery, while you focus on getting them access to the data.

Democratizing data involves making it available to personnel throughout an organization and providing them with the tools and training needed to query and analyze that data. In this post, I discuss the potential benefits and drawbacks of data democratization and provide some general guidance for democratizing data.

Benefits of Data Democratization

Distributing data and business intelligence throughout an organization delivers the following benefits:
Having more people examine the data from different perspectives increases the organization’s knowledge through discovery and learning, leading to greater innovation.
Democratization is a self-service model, which removes some of the burden of analyzing data and producing reports from the data science or IT team, so they can focus on higher level tasks.
Enabling personnel to access data and analytics on-demand increases the pace of business. Personnel can ask and answer questions and make decisions on their own (when appropriate) without necessarily having to go through the data gatekeepers.
Everyone in the organization has the resources to make data-based decisions, which, at least theoretically, leads to better, faster decisions.

Potential Drawbacks of Data Democratization

Nearly every organization that democratizes its data properly reports that the benefits of doing so far outweigh any potential drawbacks. However, organizations do need to address the following concerns:

Increased risks to data security, privacy, and integrity. For example, storing employee records in a central data warehouse makes those records more susceptible to security breaches.
Misinterpretation of data or analytics by untrained personnel can lead to bad data-based decisions.
Duplication of effort across different teams may be more costly than having a central data science team.
With different teams performing their own analytics, an organization may end up with multiple versions of the truth, which can lead to diminished confidence in the analytics system.

Drivers of Democratization

Traditionally, the IT department has owned the data and was in charge of extracting meaning from it and presenting the information to executives and managers. The development of various technologies, tools, and techniques is driving the movement toward greater democratization of data:

Central storage, particularly cloud storage: Instead of having each department manage its own data, which results in the creation of data silos, all data is stored in a central data warehouse, so an organization has a “single source of truth.”
Business intelligence (BI) dashboards: BI dashboards facilitate the process of selecting, aggregating, and analyzing data, often displaying the results graphically in the form of tables, graphs, maps, and so on. These visualizations make the data easier to understand.
Enterprise agility: To remain competitive, organizations have an increasing need to achieve enterprise agility — the ability to quickly adapt to market changes and capitalize on emerging opportunities. Democratizing data increases enterprise agility by empowering teams at all levels across an organization.

Democratizing Data in Your Organization

Democratizing data is not a simple matter of providing everyone in the organization unfettered access to all of the organization’s data, especially if the organization stores sensitive data. To democratize data safely and effectively, consider the following guidelines:

Establish strong governance to protect privacy, security, and data integrity. Each person given access to the data and BI tools should have a username and password, so IT can assign an access level to each user. For example, human resource managers may be the only ones who have access to personnel records.
Train personnel on how to use the data and tools. Technology is no replacement for education and skills. Everyone using the information system should know how to use it and for what purposes. Preferably, any training will focus on applications relevant to the person’s position and job duties.
Encourage personnel to ask questions and use the system to answer them. Don’t assume that “If you build it, they will come,” to borrow a phrase from the movie Field of Dreams. Just because you place a new, powerful technology in the hands of users doesn’t mean they will embrace it. Make sure the C-level executives actively promote the information system. Celebrate any “wins” that can be attributed to the information system. Once people experience the value of data and BI for themselves, they’ll be hooked, but you need to grease the wheels to get them moving.
Shift the data science team’s responsibilities. When you democratize data, you’re distributing the data science team’s analytical burden throughout the organization. This frees up the team to explore data at a deeper level. However, the team should allocate some of that additional time to finding better tools, sharing them with other teams in the organization, and providing training on how to use them. In other words, the data science team should put some effort into empowering personnel across the organization to become better data scientists.

If your organization currently places the power of its data in the hands of a few, I hope this article encourages you to strongly consider the possibility of democratizing your organization’s data. By placing the power of data and analytics in the hands of the many, you’re likely to be surprised by the resulting increase in innovation and agility. Your organization will be much better equipped to adapt in a competitive landscape that’s constantly changing.

Businesses and other organizations typically have two types of database management systems (DBMSs) — one for online transactional processing (OLTP) and another for online analytical processing (OLAP):

Online transactional processing (OLTP): A type of information system that captures and stores daily operational data; for example, order information, inventory transactions, and customer relationship management (CRM) details. OLTP systems are commonly used for online banking, booking flights or rooms online, ordering products online, and so on.

Online analytical processing (OLAP): A type of information system that supports business intelligence (BI) applications, such as tools for generating reports, visualizing data (with tables, graphs, maps, and so on), and conducting predictive “what if” analyses. OLAP systems are commonly used for planning, solving problems, supporting decision-making, and automating tasks (as with machine learning applications).

Traditional databases are optimized for OLTP, where the emphasis is on capturing transactional data in real time, securing transactional data, maintaining data integrity, and processing queries as quickly as possible. On the other hand, enterprise data warehouses (EDWs) are optimized for OLAP, where the emphasis is on capturing and storing large volumes of historical data, aggregating that data, and mining it for business knowledge and insights to support data-driven decision-making.

The following table highlights the differences between OLTP and OLAP

OLTP and OLAP in Action

Suppose you want to sell running shoes online. You hire a database administrator (DBA) who creates dozens of different tables and relationships. You have a table for customer addresses, a table for shoes, a table for shipping options, and so on. The web server uses structured query language (SQL) statements to capture and store the transaction data. When a customer buys a pair of shoes, her address is added to the Customer Address table, the Shoes table is updated to reflect a change in inventory, the customer’s desired shipping method is captured, and so on. You want this database to be fast, accurate, and efficient. This is OLTP.

You also ask your DBA to create a script that uploads each day’s data to your EDW. You have a data analyst create a report to see whether customer addresses are related in any way to the shoes they buy. You find that people in warmer areas are more likely to buy brightly colored shoes. You use this information to change your website, so customers from warmer climates see more brightly colored shoes at the top of the page. This is an example of OLAP. While you don’t need real-time results, you do need to be able to aggregate and visualize data to extract meaning and insight from it.

Copying Data from OLTP to OLAP

Most organizations have separate OLTP and OLAP systems, and they copy data from their OLTP system to their OLAP system via a process referred to as extract, transform, and load (ETL):

Extract: Data is read from one or more sources and held in temporary storage for the next two stages of the ETL process.
Transform: Data is processed to make all data consistent in structure and format so that it all conforms to a uniform schema. For example, if some dates are formatted MM/DD/YYYY and others are formatted DD/MM/YYYY, a transformation would be necessary to format all dates consistently.
Load: All data is loaded into the data warehouse for storage.

For more about ETL, see my previous article Grasping Extract, Transform, and Load (ETL) Basics.

The Best of Both Worlds

Some newer database designs attempt to combine OLTP and OLAP into a single solution, commonly referred to as a translytical database. However, OLTP systems are highly normalized to reduce redundancy, while OLAP reduces the required degree of normalization to achieve optimal performance for analytics.

Normalization is a process of breaking down data into smaller tables to reduce or eliminate the need to repeat fields in different tables. If you have the same field entries in different tables, when you update an entry in one table, you have to update it in the other; failing to do so results in a loss of data integrity. With normalization, when you need to change a field entry, such as a customer’s phone number, you have only one table in which you need to change it.

Because OLTP and OLAP differ in the degree to which data must be structured, combining the two is a major challenge. However, organizations are encountering an increasing need to analyze transactional data in real time, so the benefits of a translytical database model are likely to drive database and data warehousing technology in that direction.

To analyze a body of data, that data must first be loaded into a data warehouse; that is, it must be copied from one or more systems, converted into a uniform format, and written to the new destination. This process is commonly referred to as extract, transform, load (ETL). ETL provides the means to combine disparate data from multiple sources and create a homogenous data set that can be analyzed in order to extract business intelligence from it.

Extract

During extraction, data is read from one or more sources and held in temporary storage for transformation and loading. An organization may extract data from its own internal systems, such as a transaction processing system that records all order activities or from external sources, such as data it purchases or obtains for free from other organizations.

Extraction is commonly broken down into two logical extractions methods:

Full: All data currently available in the source system is extracted.
Incremental: Only data that has been added to the source system after a full extraction is extracted. The purpose of incremental extractions is to keep the data in the data warehouse up to date.

Extraction is also broken down into two physical extraction methods:

Online: Data is extracted directly from the source system; for example, streaming data from websites or online games.
Offline: Data is extracted from files created by the source system; for example, via a save or export operation.

Transform

During the transform stage, data is processed to make all data consistent in structure and format so that it all conforms to a uniform schema. A schema provides the structure and rules for organizing data in a relational database. The source and target database systems may use different schemas; for example, the source database may store shipping information in a Customer table, whereas the target database stores shipping information in a separate Shipping table. Or, the source table may have dates in the MM/DD/YYYY format, whereas the target uses the DD/MM/YYYY format. To successfully copy data from the source to the target, certain transformations must be made to ensure that the source data is in an acceptable format.

Transformations can be handled in two ways:

Multistage: The transformation process is broken down into distinct steps, each of which results in a separate staging table prior to data being inserted into the target table.
Pipelined: Staging tables are eliminated to streamline the process in the event that the target database is a more integral part of the ETL operation.

Load

During the load operation, all newly transformed data is written to the target data warehouse for storage. Various mechanisms can be used to load data into the target warehouse, including the following:

Structured query language (SQL)
External tables
OnCommand Insight (OCI) databases and Direct-Path application programming interfaces (APIs)
Export/import features of the source and target databases

Variations on the Theme

ETL is commonly described as a three-step process primarily to make it easier to understand. In practice, ETL is not a series of clearly defined steps but more of a single process. As such, the sequence of events may vary. Depending on the approach, ETL may be more like one of the following:

Extract, load, and transform (ELT): Data is loaded into the target warehouse and then transformed. ELT is likely to increase performance when the target data warehouse has greater compute capacity than would otherwise be available for the transformation engine required for an ETL operation.
Extract and then transform while loading: With a pipelined approach to transforming data, the transformation and loading steps are often combined.

The ETL Bottleneck

Given the increasing volumes of data that organizations must capture and integrate into their data warehouses, Extract Transform Load often becomes a major bottleneck. Database administrators need to constantly revise their ETL procedures to accommodate variations in the data arriving from different sources. In addition, the volume and velocity of data can overwhelm an organization’s existing data warehouse storage and compute capabilities, leading to delays in producing time-sensitive reports and business intelligence. ETL operations often compete for the same storage and compute resources needed to handle data queries and analytics.

Fortunately, data warehousing technology has evolved to help reduce or eliminate the impact of the ETL bottleneck. For example, cloud data warehousing provides virtually unlimited storage and compute resources, so that ETL does not need to compete with queries and analytics for limited resources. In addition, data warehouse frameworks such as Hadoop take advantage of distributed, parallel processing to distribute work-intensive tasks such as ETL over multiple servers to complete jobs faster.

With the right tools and technologies in place, organizations can now stream diverse data from multiple sources into their data warehouses and query and analyze that data in near real time. If you or your team is in charge of procuring a new data warehouse solution for your organization, look for a solution that provides unlimited concurrency, storage, and compute, to avoid contention issues between ETL processes and people in the organization who need to use the same system to run queries and conduct analysis. Also look for a system that can live-stream data feeds and process structured, semi-structured, and unstructured data quickly and easily without complicated and costly ETL or ELT processes. In most cases, the ideal solution will be data warehouse built for the cloud.