Bill Inmon and vendors like Teradata define a warehouse as a separate database for decision support, which typically contains vast amounts o...
Bill Inmon and vendors like Teradata define a warehouse as a separate database for decision support, which typically contains vast amounts of information. Richard Hackathorn defines a warehouse as “a collection of data objects that have been inventoried for distribution to a business community”. In our own modest definition a warehouse is an active intelligent store of data that can manage and aggregate information from many sources, distribute it where needed, and activate business policies.
Data warehouses as a separate database for decision support, which typically contains vast amount of information. The modern institutions conducting in the course of conducting their everyday business. The computerized production systems that collect and consume this data are called OLTP systems- these are true data factories that run around the clock.
The Elements of Data Warehousing
Data warehousing is too ad hoc and customer–specific to be provided as a single shrink-wrapped solution. Instead, hundreds of vendors provide constituent pieces that contribute to a warehouse solution.
The first step on the road to data warehousing Nirvana is to understand the constituent elements that make up a solution. Almost all data warehousing systems provide the following four elements
1. The data replication manager manages the copying and distribution of data across databases as defined by the information hound. The hound defines the data that needs to be copied, the source and destination platforms, the frequency of updates, and the data transforms. Refresh involves copying over the entire data source; update only propagates the changes. Everything can be automated or done manually. Data can be obtained from relational or non-relational sources. Note that almost all external data is transformed and cleansed before it’s brought into the warehouse.
2. The Informational database is a relational database that organizes and stores copies of data from multiple data sources in a format that meets the needs of information hounds. Think of it as the decision-support server that transforms, aggregates, and adds value to data from various production sources. It also stores metadata (or data about data) that describes the contents of the informational database. System-level metadata describes the tables, indexes, and sources extracts to a database administrator (DBA); Semantic-level metadata describes the contents of the data to an information hound. The informational database can be a personal database on a PC, a medium-sized database on a local sever, or a massively parallel database on an enterprise server. Most of the major SQL database engines can be used as informational databases.
3. The information directory combines the functions of a technical directory, business directory, and information navigator. Its primary purpose is to help the information hound find out what data is available on the different databases, what format it’s in, and how to access it. It also helps the DBAs manage the data warehouse. The information directory gets its metadata by discovering which databases are on the network and then querying their metadata repositories. It tries to keep everything up-to-date.
DBAs use the information directory to access system-level metadata, keep track of data sources, data targets, cleanup rules, transformation rules, and details about predefined rules and reports.
Examples of information/metadata directories include Prism’s Directory Manager, IBM’s DataGuide, HP’s Information Warehouse Guide, Minerva’s Info-harvester, and BrownStone’s Data Dictionary etc.
4. EIS/DSS tool support is providing via SQL. Most vendors support ODBC and some other protocol. Some vendors – for example, Red Brick – provide extended SQL dialects for fast queries and joins. The tools are more interested with sequential access to large data quantities than access to a single record. This means that you must tune the table indexes for queries and sequential reads as opposed to updates.
In summary, data warehousing is the process of automating EIS/DSS systems using the four elements we just described. You must be able to assemble data from different sources, replicate it, cleanse it, store it, catalog it, and then make it available to EIS/DSS tools. The trick is to be able to automate the process and have it run like clockwork.
Data warehouses as a separate database for decision support, which typically contains vast amount of information. The modern institutions conducting in the course of conducting their everyday business. The computerized production systems that collect and consume this data are called OLTP systems- these are true data factories that run around the clock.
The Elements of Data Warehousing
Data warehousing is too ad hoc and customer–specific to be provided as a single shrink-wrapped solution. Instead, hundreds of vendors provide constituent pieces that contribute to a warehouse solution.
The first step on the road to data warehousing Nirvana is to understand the constituent elements that make up a solution. Almost all data warehousing systems provide the following four elements
1. The data replication manager manages the copying and distribution of data across databases as defined by the information hound. The hound defines the data that needs to be copied, the source and destination platforms, the frequency of updates, and the data transforms. Refresh involves copying over the entire data source; update only propagates the changes. Everything can be automated or done manually. Data can be obtained from relational or non-relational sources. Note that almost all external data is transformed and cleansed before it’s brought into the warehouse.
2. The Informational database is a relational database that organizes and stores copies of data from multiple data sources in a format that meets the needs of information hounds. Think of it as the decision-support server that transforms, aggregates, and adds value to data from various production sources. It also stores metadata (or data about data) that describes the contents of the informational database. System-level metadata describes the tables, indexes, and sources extracts to a database administrator (DBA); Semantic-level metadata describes the contents of the data to an information hound. The informational database can be a personal database on a PC, a medium-sized database on a local sever, or a massively parallel database on an enterprise server. Most of the major SQL database engines can be used as informational databases.
3. The information directory combines the functions of a technical directory, business directory, and information navigator. Its primary purpose is to help the information hound find out what data is available on the different databases, what format it’s in, and how to access it. It also helps the DBAs manage the data warehouse. The information directory gets its metadata by discovering which databases are on the network and then querying their metadata repositories. It tries to keep everything up-to-date.
DBAs use the information directory to access system-level metadata, keep track of data sources, data targets, cleanup rules, transformation rules, and details about predefined rules and reports.
Examples of information/metadata directories include Prism’s Directory Manager, IBM’s DataGuide, HP’s Information Warehouse Guide, Minerva’s Info-harvester, and BrownStone’s Data Dictionary etc.
4. EIS/DSS tool support is providing via SQL. Most vendors support ODBC and some other protocol. Some vendors – for example, Red Brick – provide extended SQL dialects for fast queries and joins. The tools are more interested with sequential access to large data quantities than access to a single record. This means that you must tune the table indexes for queries and sequential reads as opposed to updates.
In summary, data warehousing is the process of automating EIS/DSS systems using the four elements we just described. You must be able to assemble data from different sources, replicate it, cleanse it, store it, catalog it, and then make it available to EIS/DSS tools. The trick is to be able to automate the process and have it run like clockwork.