I’m currently taking a grad class on data warehousing. As I’ve mentioned before, I’m somewhat intrigued by useful data analysis, which is the exact reasoning behind data warehousing.
After a month or so into the class, I wonder how companies and large organizations can compete without having a data warehouse. Do you know why Wal-mart always beats out its competitors and K-Mart doesn’t? They maintain one of the largest and efficient data warehouses in the business, with analysts and pattern recognition scripts to spot trends based on products, location, cost, date, time, day-of-the-week, and a million other things. Amazon.com is another prime example. I believe they’re currently generating around $3 billion per year in sales, and I bet it’s because of how well they watch customer habits and trends. Personally, I’m not one to be coaxed into buying the “recommended” books, or “some who’ve bought [product A], have also bought [product B],” but I’m sure some people bite. Actually, they’re pretty good recommendations. I’m a penny-pincher for the most part, and have considered them at times.
The size of the database is another amazing facet of data warehousing. For instance, Amazon.com runs a data warehouse around 10TB in size; AT&T is around 90TB; Yahoo! has one exceeding 100TB; and the mighty Google is analyzing a couple of peta-bytes worth of data (which is almost incomprehensible). For perspective, a peta-byte is 1,024TB, or ~20,000 PCs with 50GB each.
Designing a data warehouse contradicts pretty much all that I know about database design. It’s almost the exact opposite: de-normalize instead of normalize, optimize for querying rather than transactional processes, duplicating data becomes popular, no more relational tables (goal is to remove the JOIN), and so on. It’s definitely a different way of thinking, but interesting nonetheless.





