Before you send your data warehouse administrator to the basement with a hatchet, realize that you will always have some dormant data. In fact, says Bill Inmon, chief technology officer of Pine Cone Systems, founder of the data warehouse concept and a prolific author, “a certain amount of dormant data is proper for the data warehouse/decision support system environment. Just because a unit of data has never been accessed does not mean it is not useful and will never be accessed.”
He recommends a ratio of one unit of accessed data per one of dormant. In other words, about half of your data warehouse should be dormant. But for reducing true clutter, Inmon makes the following recommendations:
– Look at historical patterns of DSS use. These patterns tell exactly how the data warehouse is being used and can point to data that has not been used and is unlikely to be used in the future.
– Don’t just throw away all dormant data. Move it to other, less expensive storage places where it can be accessed if needed, but where it isn’t congesting the primary data warehouse.
– Devise a strategy for identifying dormant data in the future. Maybe you’ll want to throw away all data over two years old as a matter of operating procedure, as well as six-month old summary data and entire columns that have never been accessed in a certain time.
“Dormant data creeps into the data warehouse over time naturally, for many reasons,” Inmon says. “The data warehouse administrator needs to identify the dormant data and remove it to other forms of storage, where it can be held and managed much less expensively.”
Using an activity monitor can help find dormant data. “Look at the standard query language as it enters the database management system, and the result set as the data is returned to end users,” Inmon explains. “From the information produced by the activity monitor, you can determine what data is and is not being used.”
Dormant data can be either discarded or moved to storage. Or, Inmon says, “you can also transform it in various ways that add value to the process.” To that end he recommends the following steps:
– Aggregate detailed data with other data of the same type, or with detailed data of another type.
– Move dormant data to a nearline environment, compressing it and creating summaries. A nearline environment is one where data can be retrieved but remains separate from the main data warehouse – it’s not searched on queries, but data is stored there for when it’s needed.
– Resequence data.
– Create indexes of the data as it exists in nearline storage. Store metadata about the nearline data.
– Create restoration guidelines, so that when nearline data must be restored, the transition can be executed cleanly and easily. Establish release information about the software; information about the operating system; and metrics about how much data is stored, and where it can be stored. This will make the restoration process run much more smoothly.
“Adding a nearline component to the data warehouse can help drive down costs and make the performance of the active data remaining on disk storage rise significantly,” Inmon advises.
Bill Inmon is a featured speaker at DCI’s Data Warehouse World.