Competitive advantage requires abilities. Abilities are built through knowledge. Knowledge comes from data. The process of extracting knowledge from data is called Data Mining.
Data mining, the extraction of hidden predictive information from large databases, is advance technique to help companies to highlight the most important information in their data warehouses. Data mining tools predicts future trends and behaviors. Data mining tools can answer business questions that traditionally were too time consuming to resolve. Data Mining techniques can be implemented rapidly on existing software and hardware platforms to enhance the value of existing information resources, and can be integrated with new products and system as they are brought online.
A Data warehouse is a platform that contains all of an organization’s data in one place in a centralized and normalized form for deployment to users, to fulfill simple reporting to complicated analysis, decision support and executive level reporting/archiving needs. Physically, a data warehouse is a repository of information that businesses need to thrive in the information age. Analytically, a data warehouse is a modern reporting environment that provides users direct access to their data. In the information age, data warehousing is a powerful strategic weapon. Not only does it let organizations compete across time, it is also a rising tide strategy that can elevate the strategic acumen of all employees in a fields.
This paper presents an overview of the data mining and warehousing, their basic definitions, how they are implemented and their pros and cons.
In today’s competitive global business environment, it is crucial for organisations to understand and manage enterprise wide information for making timely decisions and respond to changing business conditions. With the receding economy, enterprises have changed their business focus towards customer orientation to remain competitive. Consequently, CRM tops their agenda and many companies are realizing the business advantage of leveraging one of their key assets – data.
Many research reports indicate that the amount of data in a given organization doubles every five years. As said earlier, the most fundamental aspect affecting the successful functioning of a business enterprise is the crucial decisions taken in this regard by the management. The cardinal entity that helps them in taking these decisions is the business critical information. This information can only be reliable and accurate if all the business related data is properly analyzed and further a thorough analysis is only possible if all the data affecting the enterprise is present at one place. The solution – a data warehouse!
Data Warehouse is a single, complete & consistent store of data obtained from a variety of different sources made available to end users in what they can understand & use in a business context. Today, data warehousing is one of the most talked-about business technologies in the corporate world.
Data mining is a powerful new technology with great potential to help companies focus on the most important information in the data they have collected about the behavior of their customers and potential customers. It discovers information within the data that queries and reports can’t effectively reveal.
The amount of raw data stored in corporate databases is exploding. From trillions of point-of-sale transactions and credit card purchases to pixel-by-pixel images of galaxies, databases are now measured in gigabytes and terabytes. Raw data by itself, however, does not provide much information. In today’s fiercely competitive business environment, companies need to rapidly turn these terabytes of raw data into significant insights into their customers and markets to guide their marketing, investment.
Fig: Data Explosion
Data mining, or knowledge discovery, is the computer-assisted process of digging through and analyzing enormous sets of data and then extracting the meaning of the data. Data mining tools predict behaviors and future trends, allowing businesses to make proactive, knowledge-driven decisions. Data mining tools can answer business questions that traditionally were too time consuming to resolve. They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations.
Data mining derives its name from the similarities between searching for valuable information in a large database and mining a mountain for a vein of valuable ore. Both processes require either sifting through an immense amount of material, or intelligently probing it to find where the value resides.
Frequently, the data to be mined is first extracted from an enterprise data warehouse into a data mining database or data mart .The data mining database may be a logical rather than a physical subset of your data warehouse.
A data warehousing (DW) is a subject-oriented, integrated, time variant, non-volatile collection of data in support of management’s decision making. A data warehouse is a relational database management system (RDMS) which offer organizations the ability to gather and store enterprise information in a single conceptual enterprise repository and is designed specifically to meet the needs of transaction processing systems. Data Warehousing deals with the organizing & collecting data into database that can be searched & mined for information through the use of intelligence solution.
2. CHARACTERISTICS OF A DATA WAREHOUSE
The data in the database is organized so that all the data elements relating to the same real-world event or object are linked together;
The changes to the data in the database are tracked and recorded so that reports can be produced showing changes over time;
Data in the database is never over-written or deleted – once committed, the data is static, read-only, but retained for future reporting; and
The database contains data from most or all of an organization’s operational applications, and that this data is made consistent.
3. ARCHITECTURE OF DATA WAREHOUSE
The architecture for a data warehouse is given below. Building this architecture requires four basic steps:
1) Data are extracted from the various and internal source system files and databases. In a large organization there may be dozens or even hundreds of such files and databases.
2) The data from the various source systems are transformed and integrated before being loaded into the data warehouse. Transactions may be sent to the sources system to correct errors discover in data staging.
3) The data warehouse is a database organized for decision support. It contains both detailed and summary data.
4) User access the data warehouse by means of a variety of query languages and analytical tools. Results (e.g. prediction, forecast ) may be fed back to data ware house and operational databases.
Information integrated in advance
Stored in warehouse for direct querying and analysis
Fig: Architecture of typical data warehouse ,and the querying and data-analysis support
Architecture in Conceptual View
Every data element is stored once only
Real-time + derived data
Most commonly used approach in industry today
transformation of real-time data to derived data really requires 2 steps
4. ISSUES IN BUILDING A WAREHOUSE
1) When and how gather data –
In a source driven architecture for gathering data, there data sources transmit new information. In a destination -driven architecture, the data warehouse periodically sends request for new data to the data source .
2) What Schema To Use –
Data sources that have been constructed independently are likely to have different schemas, part of data warehouse is schema integration, and to convert data to the integrated schema before they are stored .as a result data stored in warehouse are not just a copy of the data at the source
3) Data Cleansing –
The task of correcting and preprocessing data is called data cleansing data sources often deliver data with numerous minor inconsistencies that can be corrected.
4) How To Propagate Updates –
Updates on relations at the data sources must be propagated to data warehouse, if the relations at the data warehouse are exactly the same as those data source, propagation is straightforward
5) What To Summarize –
The data generated by the transaction-processing system may be too large to store online .we can maintain summary of data obtained by aggregation on a relation.
5. DATA WAREHOUSE MODEL
Data warehousing is the process of extracting and transforming operational data into informational data and loading it into a central data store or warehouse. Once the data is loaded it is accessible via desktop query and analysis tools by the decision makers.
The data warehouse model is illustrated in the following figure:.
The materialized views contain summary data compiled from several data sources. The auxiliary views in the picture are not mandatory, and are used to contain additional information needed to support the synchronization of the materialized views with the data sources.
Fig: Data ware house model
The data within the actual warehouse itself has a distinct structure with the emphasis on different levels of summarization as shown in the figure below.
Fig: Structure of data warehouse
6. STAGES IN IMPLEMENTATION
A DW implementation requires the integration of implementation of many products. Following are the steps of implementation:-
Step1: Collect and analyze the business requirements.
Step2: Create a data model and physical design for the DW.
Step3: Define the Data sources.
Step4: Choose the DBMS and software platform for DW.
Step5: Extract the data from the operational data sources, transfer it, clean it & load into the
DW model or data mart.
Step6: Choose the database access and reporting tools.
Step7: Choose the database connectivity software.
Step8: Choose the data analysis and presentation software.
Step9: Keep refreshing the data warehouse periodically.
7. DATA MARTS
A data warehouse is the sum of all its data marts. A data mart is a complete “pie-wedge” of the overall data warehouse pie, a restriction of the data warehouse to a single business process or to a group of related business processes targeted toward a particular business group. Data marts can be customized for the end users ,and can present data in different formats for the end-users benefit. Data marts can employ OLAP , which is a method of database indexing that enhances quick access to data, specially in queries of data or viewing the data from many different aspects.
Data Mining, or Knowledge Discovery in Databases (KDD) as it is also known, is the nontrivial extraction of implicit, previously unknown, and potentially useful information from data.
Data mining refers to “using a variety of techniques to identify nuggets of information or decision-making knowledge in bodies of data, and extracting these in such a way that they can be put to use in the areas such as decision support, prediction, forecasting and estimation. The data is often voluminous, but as it stands of low value as no direct use can be made of it; it is the hidden information in the data that is useful”.
A data mining is also defined as “A new discipline lying at the interface of statistics, data base technology, pattern recognition, and machine learning, and concerned with secondary analysis of large data bases in order to find previously unsuspected relationships, which are of interest of value to their owners.”
The data mining process can be divided into four steps:
Fig: Process used in data mining
While large-scale information technology has been evolving separate transaction and analytical systems, data mining provides the link between the two. Data mining software analyzes relationships and patterns in stored transaction data based on open-ended user queries. Several types of analytical software are available: statistical, machine learning, and neural networks. Generally, any of four types of relationships are sought:
Classes: Stored data is used to locate data in predetermined groups. For example, a restaurant chain could mine customer purchase data to determine when customers visit and what they typically order. This information could be used to increase traffic by having daily specials.
Clusters: Data items are grouped according to logical relationships or consumer preferences. For example, data can be mined to identify market segments or consumer affinities.
Associations: Data can be mined to identify associations. The beer-diaper example is an example of associative mining.
Sequential patterns: Data is mined to anticipate behavior patterns and trends. For example, an outdoor equipment retailer could predict the likelihood of a backpack being purchased based on a consumer’s purchase of sleeping bags and hiking shoes.
4. MODELS RELATED TO DATA MINING
There are two types of model or modes of operation, which may be used to discover information of interest to the user.
1) Verification Model:
The verification model takes input from the user and tests the validity of it against the data. The emphasis is with the user who is responsible for formulating the hypothesis and issuing the query on the data to affirm or negate the hypothesis.
2) Discovery Model:
The discovery model differs in its emphasis in that it is the system automatically discovering important information hidden in the data. The data is sifted in search of frequently occurring patterns, trends and generalizations about the data without intervention or guidance from the user.
5. TECHNIQUES USED IN DATA MINING
Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure.
Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID).
Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of evolution.
Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k ? 1). Sometimes called the k-nearest neighbor technique.
Rule induction: The extraction of useful if-then rules from data based on statistical significance.
6. TWO STYLES OF DATA MINING
There are two styles of data mining. Directed data mining is a top-down approach, used when we know what we are looking for. This often takes the form of predictive modeling, where we know exactly what we want to predict. Undirected data mining is a bottom-up approach that lets the data speak for itself. Undirected data mining finds patterns in the data and leaves it up to the user to determine whether or not these patterns are important.
7. POTENTIAL APPLICATIONS
Data mining has many and varied fields of application some of which are listed below.
Marketing: Identify buying patterns from customers & Market basket analysis.
Banking: Detect patterns of fraudulent credit card use & Identify `loyal’ customers.
Insurance and Health Care: Claims analysis, Predict which customers will buy new policies & Identify fraudulent behavior.
Transportation: Determine the distribution schedules & Analyze loading patterns.
Organizations today are under tremendous pressure to compete in an environment of tight deadlines and reduced profits. Legacy business processes that require data to be extracted and manipulated prior to use will no longer be acceptable. Instead, enterprises need rapid decision support based on the analysis and forecasting of predictive behavior. Data-warehousing and data-mining techniques provide this capability.
A data warehouse is a modern reporting environment that provides users direct access to their data. A Data warehousing is the sum of all its Data Marts. Data warehousing strategy allows organizations to move from a defensive to an offensive decision-making position. The purpose of data warehouse is to consolidate and integrate data from a variety of sources and to format those data in a context for making accurate business decisions.
Data mining offers firms in many industries the ability to discover hidden patterns in their data — patterns that can help them understand customer behavior and market trends. The advent of parallel processing and new software technology enable customers to capitalize on the benefits of data mining more effectively than had been possible previously.