Data Mining
Data mining is the discovery of structures and patterns in large and complex data sets. There are two aspects to data mining: model building and pattern detection. Model building in data mining is very similar to statistical modeling, although new problems arise because of the large sizes of the data sets and the fact that data mining is often secondary data analysis. Pattern detection seeks anomalies or small local structures in data, with the vast mass of the data being irrelevant. Indeed, one view of many large‐scale data mining activities is that they primarily constitute filtering and data reduction. Although some subdisciplines of statistics have examined special cases of this problem, the bulk of the work on pattern detection to date has been computational, with an emphasis on algorithms.
 Principles of Data Mining
Data mining is the discovery of interesting, unexpected or valuable structures in large datasets. As such, it has two rather different aspects. One of these concerns large-scale, ‘global’ structures, and the aim is to model the shapes, or features of the shapes, of distributions. The other concerns small-scale, ‘local’ structures, and the aim is to detect these anomalies and decide if they are real or chance occurrences. In the context of signal detection in the pharmaceutical sector, most interest lies in the second of the above two aspects; however, signal detection occurs relative to an assumed background model, therefore, some discussion of the first aspect is also necessary. This paper gives a lightning overview of data mining and its relation to statistics, with particular emphasis on tools for the detection of adverse drug reactions.
 Data mining in education
Applying data mining (DM) in education is an emerging interdisciplinary research field also known as educational data mining (EDM). It is concerned with developing methods for exploring the unique types of data that come from educational environments. Its goal is to better understand how students learn and identify the settings in which they learn to improve educational outcomes and to gain insights into and explain educational phenomena. Educational information systems can store a huge amount of potential data from multiple sources coming in different formats and at different granularity levels. Each particular educational problem has a specific objective with special characteristics that require a different treatment of the mining problem. The issues mean that traditional DM techniques cannot be applied directly to these types of data and problems. As a consequence, the knowledge discovery process has to be adapted and some specific DM techniques are needed. This paper introduces and reviews key milestones and the current state of affairs in the field of EDM, together with specific applications, tools, and future insights. © 2012 Wiley Periodicals, Inc.
 Application of Data Mining Techniques to Audiometric Data among Professionals in India
Aims: Noise induced hearing loss (NIHL) is among the principal occupational health hazard. To illustrate that, in order to enrich the database on audiometric status and fast dissemination of knowledgebase, data mining techniques are imperative tools.
Study Design: A cross sectional study design was used.
Place and Duration of Study: Pure tone audiometric data of both ears of drivers that have 10 years working experience and office workers from Kolkata City, India were recorded.
Methodology: The data were subjected to both unsupervised and supervised learning techniques, in turn, in order to train the classifier that determines the clusters for newly generated cases. Expectation Maximization (EM), k-means, Linear Vector Quantization (LVQ), and Self Organization Map (SOM) unsupervised learning techniques were utilized.
Results: Silhouette Plot (SP) validation showed that 93.3% of the considered cases for the left ear and 85.8% for the right ear were correctly classified. These metadata were further subjected to supervised learning algorithm to achieve a high level correctly classified result, in which, each cluster bears its class label. Naïve Bays Classifier (NBC) recorded, as accurate (98.8%) for both left and right ears. The high accuracy of supervised learning algorithms, cross validated with 10-fold cross validation tends to predict the class of audiometric data whenever a newly generated data are introduced.
Conclusion: This feasibility of using machine learning and data classification models on the audiometric data would be an effective tool in the hearing conservation program for individuals exposed to noisy environments in their respective workplaces.
 Novel Data Mining Techniques for Incomplete Clinical Data in Diabetes Management
An important part of health care involves upkeep and interpretation of medical databases containing patient records for clinical decision making, diagnosis and follow-up treatment. Missing clinical entries make it difficult to apply data mining algorithms for clinical decision support. This study demonstrates that higher predictive accuracy is possible using conventional data mining algorithms if missing values are dealt with appropriately. We propose a novel algorithm using a convolution of sub-problems to stage a super problem, where classes are defined by Cartesian Product of class values of the underlying problems, and Incomplete Information Dismissal and Data Completion techniques are applied for reducing features and imputing missing values. Predictive accuracies using Decision Branch, Nearest Neighborhood and Naïve Bayesian classifiers were compared to predict diabetes, cardiovascular disease and hypertension. Data is derived from Diabetes Screening Complications Research Initiative (DiScRi) conducted at a regional Australian university involving more than 2400 patient records with more than one hundred clinical risk factors (attributes). The results show substantial improvements in the accuracy achieved with each classifier for an effective diagnosis of diabetes, cardiovascular disease and hypertension as compared to those achieved without substituting missing values. The gain in improvement is 7% for diabetes, 21% for cardiovascular disease and 24% for hypertension, and our integrated novel approach has resulted in more than 90% accuracy for the diagnosis of any of the three conditions. This work advances data mining research towards achieving an integrated and holistic management of diabetes.
 Hand, D.J. and Adams, N.M., 2014. Data mining. Wiley StatsRef: Statistics Reference Online, pp.1-7.
 Hand, D.J., 2007. Principles of data mining. Drug safety, 30(7), pp.621-622.
 Romero, C. and Ventura, S., 2013. Data mining in education. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 3(1), pp.12-27.
 Majumder, J. and Sharma, L.K., 2014. Application of data mining techniques to audiometric data among professionals in India. Journal of Scientific Research and Reports, pp.2960-2971.
 Jelinek, H.F., Yatsko, A., Stranieri, A. and Venkatraman, S., 2014. Novel data mining techniques for incomplete clinical data in diabetes management. Current Journal of Applied Science and Technology, pp.4591-4606.