The purpose of this report is to outline the major developments and issues surrounding data mining. Data mining is considered a very contentious issue and this has augmented with the increasing advancements in technology. Various scholarly articles like journals and books have been used to help write this report. The report will therefore begin with a brief introduction of the topic as well as a brief history that explains how data mining became a science. The various types of data mining techniques as well as comparison between some of the techniques will also be outlined. The major ethical and legal issues associated with data mining will also be discussed. The report will end with a conclusion and recommendation on how to tackle the privacy concerns linked to data mining. It has been concluded that even though data mining gives major competitive advantages to firms and organizations and also helps to tackle various security issues like terrorism, it also has the power to destroy people and put their lives and information at risk.
Data mining is a field of science that has been in existent for quite a long time. With the increasing development made in technology and changes in the economy and demand, most firms have realized the importance of data mining. It is considered a major competitive advantage. There are various data mining techniques that will be discussed in this report. With the several advantages of data mining, there are also major ethical, legal and social concerns associated with data mining. These are mostly related to the privacy of individuals. It is recommended that a strong balance should be struck between the use and analysis of data and the ability to safeguard the privacy and information of people.
A brief definition of data mining is, ‘the computational process through which large or big data is explored and uncovered’ (Mining, 2013). Scientifically, data mining refers to the process through which patterns are discovered in large sets that involve methods at the juncture of database systems, statistics and machine learning (Groth, 2008). The word data mining was launched during the 1990s. Some of the early methods of identifying data patterns are the regression analysis in the 1800s and the Bayes’ theorem in the 1700s. The increasing, ubiquity and propagation of computer technology has enhanced the manipulation, storage and collection of data. With the increase in complexity and size of data, analysis of hands-on direct data has also been increased through automatic and indirect data processing. Other discoveries such as clustering, neural networks, decision trees, genetic algorithms among others have aided in the discovery of data mining as a science.
The identity of step within or sub-process of Knowledge Discovery in Databases (KDD) also resulted into data mining popularity. Some of the sub-processes of KDD include: pre-processing and data cleaning, generating target set of data and interpretation of mined patterns among others. Through the various conferences in the 1990s, increasing rate of technology, data storage and processing speed capabilities of computers, data mining became even more popular (Mining, 2013). It became easier for individuals and organizations to save their data in readable form and to process the large data volumes through the use of desktops. The world became well-conversed with data mining even after customer loyalty cards were introduced. These cards enabled organizations to record the data and purchases of customers and the data was mined to detect the patterns of customer purchasing. The same is still applicable in recent times. On a personal level, the loyalty cards have been used regularly to earn loyalty points and in some cases to even purchase goods and services in the supermarket.
Most of the applications of data mining have the scientific goals of prediction, interpretation and detection of quantitative or qualitative data patterns ((Ramakrishnan & Grama, 2015). To describe and assess data patterns, algorithms of data mining make use of various models of statistics, machine learning and AI among others (Han, Pei & Kamber, 2011). These are techniques that borrow from Mathematical tactics like dynamical systems and approximation theory. Data mining is also majorly based on scientific perspectives like induction, querying, search and compression. These are among the few reasons that resulted into the shift of data mining from serendipity to a science (Ramakrishnan & Grama, 2015).
How Data Mining is Used Today
Data mining is still considered infant in the industry but its application is adopted by various firms and organizations (Han, Pei & Kamber, 2011). It is used in the healthcare industry, aerospace, finance, retail and manufacturing transportation among others. Data mining enables most analysts to discover the important facts, patterns, relationships and anomalies that are useful in businesses and may have been unnoticed in the absence of data mining. Some of the ways in which data mining is used in the current times include the following:
Fraud detection. It is used to identify the transactions that can be considered fraudulent (Groth, 2008).
Market segmentation: It is used to detect the common features that most customers who purchase goods from the company possess. This is a function that was also used during the 1990s. This may be of help in determination of the kind of the preferences of customers.
Direct marketing: Data mining detects the prospects that need to be included in the mailing list in order to acquire the highest rate of response (Groth, 2008).
Trend analysis: It also reveals the various differences between the behavior of the typical customer in one month and the next. This will help determine a change in preference that may be as a result of an increase of decrease in price. This means that data mining has been useful in explaining various economic theories related to demand.
Market basket analysis: It comprehends the services and products that are commonly purchased collectively by customers such as diapers and beers (Groth, 2008). This is also useful in economics as it explains the behavior of customers and may help in advertisement to attract more customers. s
The Future of Data Mining
With the increasing technology and dynamic nature of the business world which a resulted into globalization, data mining is expected to have new algorithms and tools. According to Doug Laney, the three V’s in data management include:
Volume; More data is available compared to the past and this size continues to increase at a higher percentage than the ability of data mining tools to process (Fan ; Bifet, 2013).
Velocity: Recently, data is said to continuously arrive in streams and it is important to obtain the significant information.
Variety: There are also various data types such as sensor data, graph, audio and text among others (Fan ; Bifet, 2013).
Recently, there are two more V’s. These are Variability and Value. There are several changes in data structure and the way that users would desire to infer the data. Additionally, there is increasing value in businesses, giving them compelling competitive advantage because of the ability to make decisions based on capability of answering questions that were formerly considered unanswerable.
Even though data mining is said to have great potential in the future, it has various other challenges besides those mentioned above. According to Marakas (2014), the data quality has the ability to either break or make the effort of data mining. For data to be mined, firms are supposed to integrate, convert and also clean it. Additionally, to get value from the data mining process, it is important to alter the operation mode of organizations and uphold this effort. Lastly, there are augmenting concerns related to the issue of privacy. This is something that is more common in countries like the US that value privacy and are scared that information technology generally puts people’s privacy at risk. With increasing globalization, this should be a concern to almost all countries.
Types of Data Mining
Data mining is classified basing on the kind of data source that is to be mined, the data model that it is drawn on, the knowledge king that has been discovered and the techniques of mining that have been used. Classification based on data type to be mined categorizes the data basing on such types as multimedia data, spatial data, text data, time series data and World Wide Web among others (Garg ; Sharma, 2013). Classification based on data model entails such models as object-oriented database, relational database, transactional and data warehouse among others. Knowledge king categories include discrimination, classification, characterization, clustering and association among others (Mining, 2013). Mining techniques include neural networks, machine learning, visualization, data-warehouse-oriented and neural networks among others (Garg ; Sharma, 2013). A wide-ranging system is one that is able to offer comprehensive techniques of data mining to fit the various options and situations. It should also be able to provide varying degrees of interactions among the users.
The techniques of data mining that can aid in creation of optimal results include regression analysis, clustering analysis, association rule learning, outlier or anomaly detection and classification analysis also known as supervised learning (Garg ; Sharma, 2013). Supervised learning normally has predefined group sets or models that are based on the predicted values. For example, the security at the airport upholds metrics sets that attempt to predict an individual as a potential terrorist. Regression analysis can either be logistic or linear and is used to assume or predict future trends and values basing on the past occurrences. An example of this is investment on pension fund. In this case, one can calculate the yearly income and attempt to predict what he or she may need upon retirement. Basing on the value of the present income as well as the required income, one can make a decision on investment.
In the analysis of time series, all attribute values are regulated by the various time intervals. For instance, in purchasing of stocks, one can take three companies, say X, Y and Z and determine their monthly performance and use this to predict the next annual growth basing on the stock growth. Prediction is associated with time series, though it is not bound by time. This predicts various values basing on the current and past data. For instance, the flow of water in a river is calculated by different monitors at different levels and time intervals. This information can be used to predict the flow of water in the future.
Clustering is also commonly known as unsupervised learning (Garg ;Sharma, 2013). It is considered similar to classification with the exception that it does not have predefined group sets. Instead, the group is defined by the data itself. For instance, in a supermarket, there are various details like purchase amount, job and age. Age and job can be grouped against percentage separately to be able to make significant business decision that will target a particular group.
Outlier or anomaly detection means observing data items in the set that fail to match an anticipated behavior or pattern (Garg ; Sharma, 2013). Anomalies offer actionable and critical information. This can be used in various domains like fraud detection, system health monitoring and intrusion detection among others. In statistics, these anomalies can be useful or non-useful. It is important to find out if these have meaning in various situations.
Under classification, there are various techniques such as the Bayesian classification, decision tree and Neural Networks that are very useful in data mining. The table below illustrates a comparison of the algorithms in classification and the decision-tree algorithms.
Algorithm Pros Cons
Decision Tree • It offers fast outcomes in the classification of unknown records.
• It is able to manage both discrete and continuous data.
• It operates well with the attributes that are redundant
• It also performs well with data that is numeric.
• It does not necessitate preparation technique such as normalization. • It offers results that are prone to error whenever several classes are used.
• It cannot predict values of continuous class features.
• Any small change in the data can alter the decision tree.
• The unrelated and unimportant attributes can negatively affect the construction of the decision tree (Wang, 2013).
Naïve Bayesian • If offers high level of speed and accuracy on voluminous database.
• It is very easy to comprehend.
• It can effectively manage streaming data.
• It has minimum rate of error when compared to other classifiers.
• It is also able to manage both discrete and real values.
It adopts independence of various features. Therefore, it offers less accuracy.
Neural Networks. • They have very high lenience to noisy data.
• They can categorize patterns even without training.
• They are best-suitable for data that is continuous. • They have very poor ability to interpret.
• They need longer periods of training.
Source: Garg, S., ; Sharma, A. K. (2013). Comparative Analysis of Various Data Mining Techniques on Educational Datasets. International Journal of Computer Applications, 74(5).
Comparison between the Decision Tree Algorithms
Algorithm Pros Cons
C4.5 • Makes use of data that is continuous.
• It enhances the computational efficiency.
• It avoids excess fitting of data.
• It manages training data that have numeric and missing values. • It necessitates that the target attribute should only have discrete values.
CART • It is considered non-parametric.
• It can easily handle the outliers.
• It does not require advanced variable selection. • It may not have a stable decision tree.
• It can only split by a single variable.
J48 • It is able to manage both numeric and nominal values.
• It is able to handle missing values.
Garg, S., ; Sharma, A. K. (2013). Comparative Analysis of Various Data Mining Techniques on Educational Datasets. International Journal of Computer Applications, 74(5).
Current and Prospective Legal Issues
With the rate at which technology is advancing, the laws that safeguard against abuses may fail to keep up. However, certain current laws apply to particular situations. After the 9/11 attack, the US felt the urgent need to ensure security for the nation and at the same time protect privacy for individuals (Cook ; Cook, 2013). These two may not be achieved together. One of the laws protecting people is the Fair Credit Reporting Act (Cook ; Cook, 2013). Various individuals have raised concerns about the use of credit cards online because of the fear of information theft. The credit history can also be easily obtained. With the increasing rate of data mining, customers fear that their credit history and credit information will become more susceptible (Cook ; Cook, 2013). This Act was passed to prevent inaccurate or false information about people’s credit report from being released. This means that companies need to adhere to the FCRA laws or face possible law suits.
Another law protecting individuals’ right is the Right to Financial Act (Cook ; Cook, 2013). This safeguards people’s financial information. It requires various organizations like banks and credit card firms to follow certain procedures before giving any information to the Federal Agency. Even though most people would want to have their financial information private, the increasing rate of data mining techniques and advancements may challenge this privacy act.
Current and Prospective Ethical Issues
The use of personal data by companies spurs major ethical implications. Most organizations are faced with an ethical quandary about whether they should make individuals aware that their personal information is stored to enable prospective data mining (Cook & Cook, 2013). By granting people the option to have their data deleted from the system, the company may lose competitive advantage in the market. It is important for firms to determine whether the lack of concern for ethics will result into reduced good will from customers or suffer from counterattack from these consumers. Data mining can also promote discrimination of people in relation to religious orientations, sex and race (Milne, 2000). This is both illegal and unethical.
The ethical issues related to data mining will also be discussed in relation to the privacy issues. The current development in IT have made it easy to collect and process voluminous personal data like shopping habits, driving records, criminal records and medical history among others (Brankovic & Estivill-Castro, 2009). This is very useful information in national security, law enforcement and medical research. Privacy in this case can be referred to as individuals’ right to have control over information about themselves (Brankovic ; Estivill-Castro, 2009). The following are the overall privacy issues related to Knowledge Discovery and Data Mining. These are: secondary application of personal information, granulated availability to the personal information and managing misinformation (Brankovic ; Estivill-Castro, 2009). These validate that the current privacy policies and laws are behind developments in the technology and cannot fully provide protection.
Most companies and individuals highly value their data. For companies, data is considered the most significant asset. KDDM have various private-related challenges like stereotypes, pattern combination and protection of personal data from the researchers of KDDM (Brankovic ; Estivill-Castro, 2009). It is the duty of firms to protect the private data of their clients. Individuals would also like to keep their lives and information private. However, through the use of various social media like Instagram, most people do not really have private lives because of the constant posing of photos on where they are and what they are doing. In 1989, the Motor Vehicles’ Department collected more than $16 million through the sale of data for driver-license from roughly 19.5 million residents in California (Brankovic & Estivill-Castro, 2009). This policy was revised following Robert Brado’s killing of Rebecca Schaeffer. Brado obtained data from the department’s services and followed Schaeffer to her apartment where she was killed. In most criminal investigations, all the personal and mostly sensitive data about individuals and organizations are normally obtained. This spurs various reactions from individuals who fear for their privacy and security. It can be concluded that it is unethical for data mining researchers to offer or volunteer personal information to various organization for investigation purpose without following the legal procedure. However, it may also be the best way to protect the individual. It is important for individuals and organization to consider ways of securing their data.
In sum, there are several benefits associated with the use of data mining. However, there are also major ethical, legal and social concerns associated with data mining. It is important to establish a middle ground that will protect the consumers from any threat or embarrassment. The same data mining that poses a threat to the privacy of consumers should be used to protect them through various ways such as being strictly guided by the laws on local and international privacy. This should also be done without necessarily restricting the ability and power that data mining enjoys, giving companies competitive advantage. Even though it may be difficult to strike an optimal balance between these two, it is still important to ensure that privacy and security of individuals and the nation is safeguarded from the ‘wrong people.’
• Privacy protection cannot simply be attained through restricting collection of data or restricting application of networking and computer (Milne, 2000). Therefore, researchers who make use of KDDM tools should be denied access to personal data. However, this would render data useless.
• A balance should be obtained between the need to obtain knowledge discovery and privacy.
• Individuals need to be made aware of the use of their personal data and receive explanation on the importance of possessing this personal information about them. This will help prevent negative reactions.
Brankovic, L., & Estivill-Castro, V. (1999, July). Privacy issues in knowledge discovery and data mining. In Australian institute of computer ethics conference (pp. 89-99).
Cook, J. S., & Cook, L. L. (2013). Social, ethical, and legal issues of data mining (pp. 395-420). Idea Group Publishing.
Fan, W., & Bifet, A. (2013). Mining big data: current status and forecast to the future. ACM sIGKDD Explorations Newsletter, 14(2), 1-5.
Garg, S., & Sharma, A. K. (2013). Comparative Analysis of Various Data Mining Techniques on Educational Datasets. International Journal of Computer Applications, 74(5), 34.
Groth, R. (2008). Data Mining: A hands on approach. New Jersey: Prentice hall.
Han, J., Pei, J., & Kamber, M. (2011). Data mining: concepts and techniques. Elsevier.
Marakas, G. M. (2014). Modern data warehousing, mining, and visualization: core concepts (pp. 100-101). Upper Saddle River, NJ: Prentice Hall.
Milne, G. R. (2000). Privacy and ethical issues in database/interactive marketing and public policy: A research framework and overview of the special issue. Journal of Public Policy & Marketing, 19(1), 1-6.
Mining, W. I. D. (2013). Data Mining: Concepts and Techniques. Massachusetts: Morgan Kaufinann.
Ramakrishnan, N & Grama, A. (2015). Data Mining: From Serendipity to Science. Available at http://weber.itn.liu.se/~aidvi/courses/06/dm/papers/gei.pdf
Wang, J. (Ed.). (2003). Data mining: opportunities and challenges. New Haven: Universities press.