Data Mining Tutorial – A Complete Guide

data mining
  1. What is Data Mining?
  2. Data Mining History and Current Advances
  3. Open-Source Software for Data Mining
  4. Advantages of Data Mining
  5. Disadvantages of Data Mining
  6. Data Mining Tools
  7. Types of Database 
  8. The Process of Data Mining
  9. Data Mining – Tasks
  10. Data Mining – Issues
  11. Data Mining – Evaluation
  12. Data Warehousing (OLAP) to Data Mining (OLAM)
  13. Data Mining – Terminologies
  14. Data Mining – Knowledge Discovery
  15. Data Mining – Systems
  16. Data Mining – Query Language
  17. Data Mining – Decision Tree Induction
  18. Data Mining – Bayesian Classification
  19. Data Mining – Cluster Analysis
  20. Data Mining – Mining Text Data
  21. Data Mining – Mining World Wide Web
  22. Data Mining – Themes
  23. Data Mining and Collaborative Filtering

What is Data Mining?

Data mining operates in combination with predictive analysis, a field of statistical science that employs sophisticated algos programmed to deal with a particular variety of challenges. Predictive analysis often finds trends in vast volumes of data that are extended by mining data for projections and forecasts. Data mining has a special aim, which is to identify correlations in datasets for a series of issues that belong to a particular context. Take understand more about data mining, take up our free data mining courses and learn the required skills for you to build your data mining career.

The method of scanning for massive data sets is to check for trends and patterns that cannot be identified using basic research techniques. It uses sophisticated computational algorithms to analyse data and then determine the probability of potential events on the basis of the results. It is often referred to as knowledge discovery of data (KDD). It has a range of forms, including visual data mining, text mining, internet mining, social networking mining and video and audio mining, among others.

We can assume that Data Mining is equivalent to Data Analytics taken over by an individual in a particular scenario, on a specific data set with an objective. This method involves a number of resources, such as textual data mining, online mining, video and audio mining, graphic data mining, and social media mining. This is achieved by means of an application that is basic or extremely precise. Through outsourcing data mining, all testing can be performed more efficiently with reduced running costs. 

The main data mining task is an automatic processing of vast volumes of data to retrieve completely undiscovered, fascinating trends like cluster analysis, odd documents (predictive analytics) and associations (dependencies) . This usually involves the use of repository techniques, such as spatial indices. Such trends can then be used as a kind of description of input data which can be used in further research or for instance, in deep learning and predictive modelling. To understand certain process it is necessary to enhance your practical skills. So, check out this free data mining projects course which helps you to understand the data mining process in a detailed manner.

The distinction among data analysis and data mining is that data analysis is used to validate algorithms and database theories, e.g. to analyse the success of a marketing strategy, regardless of the amount of data; on the other hand, data mining uses machine learning and mathematical models to discover clandestine or secret correlations in a vast number of data.

Word Origin and its Background

In the 1960s, mathematicians and analysts used terminology such as data fishing or data dredging to point to just what they perceived to be a poor way of evaluating data without a prior explanation. The term “data mining” has been used by a critic believe by Michael Lovell in an academic article in the 1983 Study of Economic Studies. Lovell suggests that perhaps the practice “masquerades underneath a number of labels, varying from “fishing” to “snooping”.

The word data mining emerged in the database culture around 1990, usually with optimistic implications. For a brief time in the 1980s, the term “database mining”, was used, but after it was copyrighted by HNC, a San Diego-based corporation, to set up their Server Mining Workstation, scientists switched to data mining. Other terminology includes using data archaeology, database compilation, information discovery, information retrieval, etc.

Data Mining History and Current Advances

The method of searching across data to uncover secret ties and forecast future patterns has a long tradition. Often referred to as “information discovery in libraries,” the word “data mining” was not introduced until around the 1990s. Its base, however, consists of three interconnected scientific disciplines: statistics (numeric analysis of data relations), AI and deep learning. What’s been outdated is fresh again as data mining technology continues to develop to stay competitive with the boundless potential of big data and inexpensive computational resources.

Throughout the last decade, advancements in computing power and accuracy have allowed us to transition from mechanical and time-consuming practises to fast, simple, and automatic data analysis. The more complicated the data sets obtained, the greater the ability to reveal key information. 

Open-Source Software for Data Mining

The below softwares are available as open-source. Open access to the source code of the application is also available.

  • ELKI
  • GATE
  • Carrot2
  • MEPX
  • KNIME
  • Chemicalize.org
  • Massive Online Analysis (MOA)
  • Mlpack
  • OpenNN
  • scikit-learn 
  • UIMA
  • Torch 
  • ML-Flex
  • NLTK (Natural Language Toolkit)
  • Orange
  • R
  • Weka

Advantages of Data Mining

  •  Data Mining is a fast method that makes it possible for novice users to analyse large volumes of information in a short time.
  • Data Mining technology allows companies to collect knowledge-based data.
  • Opposed to other computational data applications, data mining is cost-effective.
  •  Data mining allows companies to make significant improvements in service and development.
  •  It enables the automated detection of latent phenomena as well as the prediction of trends and behaviours. It helps the decision-making process of an organization.
  • It can also be triggered in the current system and also on existing systems.

Disadvantages of Data Mining

  • A lot of data mining software solutions are difficult to use and requires specialised training and certification.
  •  Data mining methods are not efficient, thus leading to significant consequences under problematic circumstances.
  •  Companies are willing to sell valuable consumer data to other organisations for revenue. 
  •  Due to the various algorithms used in their architecture, different data mining techniques work in different ways. The discovery of the best software for data mining is thus a very difficult task.

Data Mining Tools

We may run different algorithms on your data collection, such as clustering or sorting, and simulate the effects themselves. It is a structure that provides us with deeper visibility into our data and the phenomena represented by the data. This system is known a Data Mining Tool.

Below are the most common tools for data mining:

1. Rattle:

Ratte is a GUI based data mining tool. Uses the programming language R stats. Rattle reveals R’s static control by providing major data mining functionality. Although Rattle has a robust and quite well user interface, it has an embedded log code tab that generates redundant code for every GUI service.

You can access and edit the data set provided by Rattle. Rattle allows the other facilities such as the ability to review the code, use it for many uses, and expand the code without any limitations.

2. SAS Data Mining:

SAS refers to the Method of Mathematical Analysis. It is a tool of the SAS Institute developed for research and information management. SAS will mine, alter, process information from multiple sources, and interpret stats.It includes a GUI for Laymen users.

SAS data miner helps consumers to analyse big data and have reliable information for effective decision-making objectives. SAS provides a highly scalable memory computing architecture distributed. It is ideal for optimization, mining and text mining objectives.

3. DataMelt Data Mining:

  • DataMelt is an environment of computation and visualisation that provides an immersive data analysis and visualisation framework. It is mainly intended for academics, scientists and engineers. It’s also identified as DMelt.
  • It is a tool which is java – Based basically for multi-platforms. It can operate on just about any OS that is JVM (Java Virtual Machine) compliant. It comprises of repositories for mathematics and science.

It is possible to use DMelt for broad data volume processing, data mining, and data analysis. In sciences, capital services, and technology, it is used widely.

4. Orange Data Mining:

Orange is a great tech package for deep learning and data mining. It facilitates visualisation as well as being a component-based programme developed by the bioinformatics laboratory at the Faculty of Computer Science and Information Science, Ljubljana University, Slovenia, written in python.

Since it is a component-based programme, Orange’s modules are called “widgets.” These widgets vary from pre – processing and data visualisation to algorithm validation and predictive evaluation.

Widgets offer essential functions, like:

  • Visualization of Data Components
  • Showing the data table and choosing features
  • Predictors for training and the comparison of algorithms
  • Reading statistics

5. Rapid Miner:

One of the most common predictive analysis solutions produced by the organisation of the same name as the Rapid Miner is the Rapid Miner. It is written in a programming language called JAVA. It provides an advanced text-mining, machine learning, deep learning, and predictive analysis framework.

The instrument can be used for a wide variety of uses, including enterprise applications, business software, research, schooling, training, development of applications, and machine learning.

The server is supported on-site as well as in the public or private cloud networks by Rapid Miner. It has as its basis a client/server architecture. A quick miner provides with template-based frameworks that allow for quick distribution with little failures (which are usually required in the writing phase of manual coding).

Types of Databases

Relational Database:

A relational database is a list of several data sets explicitly ordered by tables, documents, and sections from which, without needing to understand the database tables, data can be viewed in different forms. Tables express and exchange data, enabling the discoverability, reporting and organisation of data.

Data warehouses:

The technology that gathers data from multiple channels inside the company to offer useful market insights is a Data Warehouse. The vast volume of knowledge comes from different areas such as advertising and finance. The derived data is used for research purpose and assists in decision- making for a company enterprise. The data centre is designed for data storage rather than handling transactions.

Data Repositories:

Generally speaking, the Data Repository refers to a data collection destination. However, the term is more specifically used by many IT practitioners to refer to a certain kind of configuration inside an IT system. A group of libraries, for instance, where different kinds of data have been stored by an entity.

Object-Relational Database:

An object-relational architecture is considered a hybrid of an object-oriented database model and a relational database model. It facilitates inheritance, classes, objects, etc.

 One of the key goals of the Object-relational data model is to close the distance between both the Relational database and the widely used object-oriented model practises, such as , Java, C++, C#, and so on, in many programming languages.

Transactional Database:

A transactional database refers to a DBMS (database management system) that if it is not performed correctly, has the ability to reverse a database transaction. Although a very long time ago this was a special feature, today most relational database systems facilitate transactional database operations.

The Process of Data Mining

There are many steps involved in the application of data mining before actual data mining can occur. This is how:

Stage 1: Market Analysis- You need to have a full overview of the priorities, resources available, and existing scenarios of your organisation in accordance with the specifications before you begin. This will help create a comprehensive roadmap for data mining that achieves the goals of companies efficiently. 

Stage 2: Data Consistency Checks-It has to be tested and matched when the data is gathered from multiple sources to ensure no bottlenecks in the process of data integration. Quality engineering aims to detect the underlying data discrepancies, such as incomplete data interpolation, maintaining the information in top condition until it is mined.

Stage 3: Data Washing-Selecting, cleaning, encoding, and anonymizing data until mining is assumed to take 90 percent of the time.

Stage 4: Data Transformation-Consisting of five sub-stages, data is prepared into final data sets by the processes concerned. This entails:

  • Data Smoothing: Noise is eliminated from the data here.
  • Data Summary: This method uses the compilation of data sets.
  • Data generalisation: Here, when replacing some low-level data with higher-level conceptualizations, the data is generalised.
  • Normalization of data: Data is represented in fixed ranges here.
  • Construction of Data Attributes: Data sets are expected to be in the attribute set prior to data mining.

Stage 5: Data Modelling: Several statistical models are applied in the dataset, depending on several parameters, to help define data trends.

Data Mining – Tasks

The sort of patterns that can be mined was dealt with through data mining. There are two types of functions involved in Data Mining on the basis of the form of data to be extracted −

  • Descriptive
  • Classification and Prediction

Descriptive Function

The descriptive function deals with the database’s general data properties. Here is a list of descriptive functions −

  •  Definition of class/concept
  • Mining of Regular Trends
  •  Mining of associations
  • Mining of Correlations
  • Mining of Clusters

Definition of class/concept 

  • Class/Concept provides an understanding to which the groups or concepts are related. In a business, for instance, the classes of sales goods include computers and printers, and consumer concepts include high spenders and budgetary spenders. These class definitions or concepts are referred to as class/concept descriptions. The following two forms can be inferred from these definitions –

Characterization of data- This corresponds to the overview of class data under analysis. This research class is considered the Target Class.

Data Discrimination-It applies to the mapping or grouping of a class with a category or class predefined.

Common trends are such patterns that in transactional data sometimes appear. Here is a list of the forms of regular trends –

• Frequent Item Collection-It refers to a set of items such as milk and bread that sometimes appear together.

• Frequent Subsequence – A collection of trends that appear regularly, such as buying a camera, are accompanied by a memory card. 

• Frequent Sub Structure – Substructure refers to multiple structural structures that can be paired with item sets or subsequences, such as diagrams, trees, or lattices.

Mining of associations

In retail purchases, correlations are used to classify trends that are often bought together. This approach applies to the process of uncovering the relationship between data and deciding the rules of the association.

For instance, an association rule is created by a retailer that indicates that 65% of the time milk is sold with bread and only 35% of the times biscuits are sold with bread.

Mining of Correlations

This is a kind of supplementary research carried out to discover interesting statistical associations among sets of associated-attribute-value pairs or between two sets of items to determine whether they have a positive, negative, or no impact on one another.

Mining of Clusters

A cluster refers to a collection of objects of a related nature. Analysis of clusters refers to the creation of a group of objects which are very close to each other but very distinct from the objects of other clusters.

Classification and Prediction

The method of selecting a model that represents the data groups or definitions is classification. The aim is to be able to recognize the degree of objects whose class mark is uncertain using this model. The study of sets of training data is the basis of this derived model. In the following ways, the derived model may be provided: 

• Classification rules

• Decision Trees 

• Mathematical Formulae 

• Neural Networks

The list of roles involved in these processes are as follows –

• Classification − The class of objects whose class label is unknown is expected. Its aim is to find a derived model that defines and separates classes or categories of data. The Derived Model is based on the training data analysis package, i.e. the data object whose class label is well documented.

• Prediction − It’s used instead of class labels to forecast incomplete or inaccessible numerical data values. For estimation, Regression Analysis is commonly used. For the identification of distribution patterns based on the summary of the findings, the prediction may also be used.

• Outlier Analysis − Data objects that do not comply with the general behavior or model of the available data may be defined as outliers.

 • Evolution Analysis − Analysis of evolution refers to the regularity or trend description and model for objects whose behavior changes over time.

Data Mining Task Primitives

• In the context of a data mining query, we can describe a data mining task.

• This query is a machine input.

In terms of data mining task primitives, a data mining query is specified. These primitives help one to communicate with the data mining method in an interactive way. Here is the Data Mining Task Primitives list:

1. Collection of task-relevant data that is to be mined

This is the part of the database that is of concern to the user. This segment contains the following −

  • Features of database
  • Data Warehouse Interest parameters

2. Kind of knowledge to be mined

This applies to the kind of functions to be carried out. Those functions are: −

  • Characterization
  • Discrimination
  • Grouping
  • Predicting
  • Clustering
  • Outlier Evaluation  
  • Study of evolution

3. Background knowledge

Background information enables data to be processed at different abstraction levels. For instance, one of the context information that enables data to be mined at several abstraction levels is the Idea Hierarchies.

4. Interestingness measures and thresholds for pattern evaluation

This is used to test the trends that the information discovery process finds. Various fascinating steps are required for various levels of information.

5. Representation for visualizing the discovered patterns

This extends to the manner in which patterns found are to be shown. The following can contain these representations.

• Graphs

• Decision Trees

• Rules

• Cubes

• Charts

• Tables

Data Mining – Issues

Data mining is never an easy process since it can become very difficult with the algorithms used and data is not only accessible at one location. It needs to be implemented from different sources of heterogeneous data. Many issues are often generated by these variables. Here in this article, we will address the key issues of −

• Performance Issues

• Diverse Data Types Issues

• Mining Methodology and User Interaction

Performance Issues

Performance-related problems such as the following will occur.

• Performance and scalability of data mining algorithms-Data mining algorithms must be effective and scalable in order to reliably retrieve information from large volumes of data in databases.

• Parallel, distributed, and incremental mining algorithms- The development of parallel and distributed data mining algorithms are inspired by factors such as enormous database size, large data distribution, and complexity of data mining methods. The data is separated into partitions by these algorithms, which are further processed in parallel. The observations from the partitions are then mixed. Update datasets without mining the details again from start with incremental algorithms.

Diverse Data Types Issues

Handling relational and complex data types- Complex data objects, multimedia data objects, spatial data, temporal data, etc may be included in the database. Both these types of data cannot be mined by one device.

Mining data from heterogeneous datasets and global information systems – is accessible on LAN or WAN from multiple data sources. There can be hierarchical, semi-structured, or unstructured data sets. Therefore, mining information brings complexities to data mining.

Mining Methodology and User Interaction Issues

It applies to the following types of problems: 

  • Mining various kinds of information in libraries- Different users might be involved in various kinds of knowledge. Therefore, a wide variety of information exploration activities are required for data mining to be covered.

• Interactive information mining at various layers of the abstraction-The method of data mining has to be interactive so it enables users to concentrate on pattern searching, providing and modifying requests for data mining based on the findings returned.

• The integration of context information – will be used to guide the process of exploration and to articulate the trends found. Context information can be used not only in succinct terms but at different levels of abstraction to express the discovered patterns.

• Data mining query languages and ad hoc data mining- should be combined with a data warehouse query language and designed for efficient and scalable data mining, enabling the user to define ad hoc mining activities.

• If the trends are found, the presentation and visualization of data mining results must be articulated in high-level languages and graphic representations. Such representations should be readily comprehensible.

• Managing noisy or fragmented data-The methods of data cleaning are important to manage noise and incomplete objects when mining the regularities of the data. If there are no methods of data cleaning, then the precision of the patterns observed would be low.

• Pattern assessment-The patterns observed should be meaningful since they reflect either common knowledge or lack creativity.

Data Mining – Evaluation

Data Warehouse

Types of Data Warehouse

A data warehouse has the following features that facilitate the decision-making phase of management.

• Integrated – data storage is created by combining data from heterogeneous sources, such as flat files, relational databases, etc. The successful processing of data is enhanced by this integration.

• Non-volatile- Non-volatile ensures that as additional data is added to it the existing data is not replaced. The data warehouse is kept separate from the operating database and thus regular improvements to the operating database are not mirrored in the data warehouse.

• Time-Variant: The data gathered in a data warehouse is determined by a given time span. From a historical context, the data in a data warehouse contains details.

• Subject-oriented- Data warehouse is subject-oriented because it provides one with data on a problem rather than the current activities of the enterprise. 

Data Warehousing

The method of building and use of the data warehouse is data warehousing. By combining data from several heterogeneous sources, a data warehouse is created. Analytical reporting, formal and/or ad hoc requests, and decision making are facilitated.

Data warehousing requires data washing, integration of data, and storage of information. We have the following two methods to incorporating heterogeneous databases:

• Query Driven Approach

• Update Driven Approach

To know more about how Data Warehouse differs from Data Mining, check out our comprehensive guide in the blog “Difference Between Data Warehousing and Data Mining.

Query-Driven Approach

This is the standard technique for heterogeneous databases to be combined. This strategy is used on top of multiple heterogeneous databases to build wrappers and integrators. Such integrators are known as mediators as well.

Process of Query Driven Approach

• A metadata dictionary converts the query into the queries when a query is submitted to the client-side, ideal for the specific heterogeneous site concerned.

• These requests are now mapped to the local database processor and submitted.

• The observations from heterogeneous sites are combined into a global collection of responses.

This technique has the following drawbacks-

• The Query Driven Solution requires complicated processes of integration and filtering.

• For daily inquiries, it is rather inefficient and very costly. 

• For queries that require aggregation, this strategy is costly.

Update-Driven Approach

Instead of the traditional approach discussed before, today’s data warehouse systems follow an update-driven approach. The information from different heterogeneous sources is combined in advance and maintained in a warehouse in the update-driven strategy. For direct querying and reviewing, this information is accessible. 

Data Mining Advantages

There are the following benefits of this approach:

• This technique offers high efficiency.

• In the semantic data store, data can be copied, stored, merged, annotated, compiled, and restructured in advance.

The handling of queries does not involve an interface for local source processing.

Data Warehousing (OLAP) to Data Mining (OLAM)

Online Analytical Mining combines data mining and mining information into multidimensional datasets with Online Analytical Processing.

OLAM’s Significance

OLAM is essential for the following factors –

High data quality in data warehouses- In order to operate on integrated, reliable, and cleaned data, data mining tools are needed. These measures are very expensive in the pre-processing of results. Data warehouses built by such pre-processing are useful sources of high-quality data for OLAP and data mining.

To know about the leading data mining tools, explore our blog “Top Free Data Mining Tools of 2022.”

• Information processing technology relates to the entry, convergence, aggregation, and transition of different heterogeneous datasets, online access and service facilities, monitoring, and OLAP research software. Accessible information processing infrastructure surrounding data warehouses.

• OLAP-based exploratory data analysis-For successful data mining, exploratory data analysis is needed. OLAM provides data mining facilities on different subsets of data and at various abstract levels.

• Integrating OLAP with multiple data mining functions and online analytical mining provides users with the ability to pick preferred data mining functions and dynamically swap data mining tasks.

Data Mining – Terminologies

Data Mining

Data mining is described as the extraction of data from an enormous data set.  Check earlier sections for proper explanations. For any of the following uses, this data can be used-

• Review of the market

• Prevention of Fraud

• Retention of customers

• Quality Management

• Discovery in science 

Data Mining Engine

The engine for data mining is the true core of our architecture for data mining. It consists of tools and applications used and maintained in data warehouses to derive information and expertise from data collected from data sources.

The data mining engine for the data mining method is very important. It consists of a set of operational modules executing the following tasks-

• Association and Study of Correlation  

• Predicting

• Classification

• Analysis for clusters

• Outlier analysis  

• Study of Evolution

• Characterization

 Knowledge Base

This is the information of a domain. This understanding is used to direct the quest or determine the meaningfulness of the trends that result.

Knowledge Discovery

Some individuals regard data mining as information discovery, while others perceive data mining as an important phase in the knowledge discovery process. Here is the list of steps used in the exploration process of information.

• Cleaning of data

• Data Integration

• Selection of Data

• Transformation of data

• Data Mining

• Pattern Assessment

• Presentation of Information

User interface

The user interface is a data mining system module that enables user-to-data mining system interaction. The user interface supports the following features:

  • Communicate with the framework by defining a query task for data mining.
  • Supplying data in order to support the search concentrate.
  • Mining focused on the findings of intermediate data mining.
  • Search systems or data models for the servers and data warehouse.
  • Assess the dynamics of mining.
  • In multiple ways, imagine the patterns.

Data Integration

Data Integration is a technique in data pre-processing that merges data from various sources of heterogeneous data into a cohesive data store. Integration of data can entail conflicting details and thus involves data cleaning.

Data Cleaning

Data cleaning is a method used to eliminate noisy information and fix data inconsistencies. To fix the incorrect data, data cleaning requires modifications. When preparing the data for a data warehouse, data cleaning is done as a data pre-processing phase.

Data Selection

Data selection is the method of extracting data related to the research mission from the database. Data transformation and restructuring are often achieved before the method of data collection.

Clusters

A cluster refers to a collection of objects of a related nature. Analysis of clusters refers to the creation of a group of objects which are very close to each other but very distinct from the objects of other clusters.

Data Transformation  

In this stage, by performing description or aggregation operations, data is converted or combined into forms suitable for mining.

Data Mining – Knowledge Discovery

What is Knowledge Discovery?

Some individuals do not separate data mining from information discovery, whereas others perceive data mining as a critical phase in the knowledge discovery process. Here is the list of steps used in the exploration process of information.

  • Data Cleaning- The noise and conflicting data is eliminated in this step.
  • Data Integration-Multiple data sources are merged in this stage.
  • Data Mining: In this step, in order to extract data patterns, intelligent methods are implemented.
  • Pattern Evaluation- Data patterns are analysed in this stage.
  • Information Presentation – Knowledge is expressed in this stage.

Data Mining – Systems

A great range of systems for data mining is available. Data mining systems can implement the following techniques:

  • Business
  • Pattern Recognition
  • Retrieval of Information
  • Spatial Interpretation of data
  • Image Analysis
  • Processing signals
  • Bioinformatics
  • Computer Graphics
  • Internet Technologies

Classification for Data Mining System

You should define a data mining system as per the following criteria: 

  • Infrastructure or technology used in Databases
  • Figures
  • Machine Learning
  • Data Science
  • Visualization
  • Additional Disciplines
  • mined databases
  • mined information
  • used methods
  • adapted applications.

Classification Based on Mined Databases

We may categorize a method of data mining depending on the form of mined databases. You may define the database system according to various parameters, such as data structures, data types, etc. And it’s possible to identify the data mining system accordingly.

For instance, we can have a data warehouse mining, transactional, relational, or object-relational method if we identify a database according to the data model.

Classification Dependent on the type of mined knowledge

  • We may define a method of data mining according to the form of information that is extracted. This suggests that the data mining system is categorised on the basis of features including
  • Characterization
  • Discrimination
  • Association and Study of Correlation
  • Classification
  • Predicting
  • Outlier Study
  • Study of evolution

Classification based on techniques used

According to the sort of techniques used, we can define a data mining method. These approaches may be defined according to the magnitude of user engagement involved or the research methods used.

Classification Based on the Applications Adapted

We may define a method of data mining according to the tailored applications. We will classify a data mining method according to the category of knowledge mined based on the type of knowledge mined. This suggests that the data mining system is graded on the basis of features such as

  • Finance
  • Telecommunications
  • DNA
  • Exchanges in stocks 
  • E-mail

Integrating the framework for data mining with a DB/DW system

There would be no device to connect with if a data mining system is not coupled with a database or a data warehouse system. The non-coupling scheme is known as this system. The key focus of this system is on the architecture of data mining and the development of reliable and effective algorithms for mining the data sets available.

The Integration Schemes list is as follows –

  • Semi-tight Coupling-The data mining method is connected to a database or data warehouse framework in this method and powerful implementations of a few primitives of data mining can be given in the database in addition to that.
  • No Coupling – The data mining method does not use any of the database or data warehouse features of this scheme. It retrieves data from a given source and uses certain data mining techniques to process the data. The outcome of data mining is preserved in some other file.
  • Tight coupling- The data mining method is easily merged into the database or data warehouse system through this coupling process. The subsystem of data mining is viewed as one functional part of an information system.
  • Loose Coupling-The data mining method can use some of the database and data warehouse system features in this scheme. It gathers information from the respiratory data handled by these devices and does data mining on that information. It then stores the product of mining in a database or in a data warehouse, either in a directory or in a specified location.

Data Mining – Query Language

The DMQL (Data Mining Query Language) for the DBMiner data mining framework was suggested by Han, Fu, Wang, et al. In reality, the SQL (Structured Query Language) is the foundation of the Data Mining Query Language. It is possible to design Data Mining Query Languages to enable ad hoc and immersive data mining. This DMQL provides commands for primitives to be specified. DMQL is also able to interact with databases and data centers. To describe data mining activities, DMQL can be used. 

The syntax for task-relevant to Specified Data:

Here is the DMQL syntax to define task-relevant data-

use database db_name

or 

use data warehouse dw _name
in relevance to att_or_dim_list
from relation(s)/cube(s) [where condition]
order by order_list
group by grouping_list

Syntax to Specify the Kind of Knowledge:

The syntax for Characterization, Discrimination, Association, Classification, and Prediction are discussed here.

Characterization

The syntax is as follows−

mine characteristics [as pattern_name]
   analyze  {measure(s) }

The clause for evaluation defines aggregate metrics such as percent count, sum or count.

For example −

The description explains buying habits for consumers. 

Discrimination

The syntax is as follows −

mine comparison [as {pattern_name]}
For {target_class } where  {t arget_condition } 
{versus  {contrast_class_i }
where {contrast_condition_i}}  
analyze  {measure(s) }

A consumer, for example, may identify high spenders as customers who buy products that cost an average of $300 or more; and budget spenders as customers that buy items on average of less than $300. In the DMQL, the mining of selective consumer descriptions from each of these groups can be defined as:

mine comparison as purchaseGroups
for bigSpenders where avg(I.price) ≥$300
versus budgetSpenders where avg(I.price)< $300
analyze count

Association

The syntax is as follows −

mine associations [ as {pattern_name} ]
{matching {metapattern} }

Example −

mine associations as buyingHabits

matching P(X:customer,W) ^ Q(X,Y) ≥ buys(X,Z)

Here, X is the consumer relationship key; the predicate variables would be P and Q; and the object variables are W, Y, and Z.

Classification

The syntax is as follows −

mine classification [as pattern_name]
analyze classifying_attribute_or_dimension

For example, for mine trends, customer credit rating is categorized where the classes are defined by the credit rating attribute, and mine classification is determined as classifyCustomerCreditRating.

analyze credit_rating

Prediction

The syntax is as follows −

mine prediction [as pattern_name]
analyze prediction_attribute_or_dimension
{set {attribute_or_dimension_i= value_i}}

Concept Hierarchy Specification Syntex

Using the following syntax to define definition hierarchies –

use hierarchy <hierarchy> for <attribute_or_dimension>

In order to describe various forms of hierarchies, such as −

set-grouping hierarchies

define hierarchy age_hierarchy for age on customer as
level1: {young, middle_aged, senior} < level0: all
level2: {20, ..., 39} < level1: young
level3: {40, ..., 59} < level1: middle_aged
level4: {60, ..., 89} < level1: senior

schema hierarchies

define hierarchy time_hierarchy on date as [date,month quarter,year]

operation-derived hierarchies

define hierarchy age_hierarchy for age on customer as:

{age_category(2), ..., age_category(6)} 
:= cluster(default, age, 6) < all(age)

rule-based hierarchies

define hierarchy profit_margin_hierarchy on item as:

level_1: low_profit_margin < level_0:  all

if (price - cost)< $40
   level_1:  medium-profit_margin < level_0:  all
   
if ((price - cost) > $40)  and ((price - cost) ≤ $350)) 
   level_1:  high_profit_margin < level_0:  all

Interestingness Measures Specification Syntax

The user will define relevant steps and thresholds with the declaration –

with <interest_measure_name>  threshold = threshold_value

Example −

with support threshold = 0.03
with confidence threshold = 0.5

Pattern Presentation and Visualization Specification Syntax

We have a syntax that allows users to define, in one or more ways, the representation of found patterns.

display as <result_form>

Example −

display as table

Full Specification of DMQL

You would like to describe the shopping habits of consumers who can afford goods priced at no less than $300 as a sales manager of a company; with respect to the age of the buyer, the type of item bought, and the location where the item was bought. You would like to consider the number of clients who have that attribute. In fact, only transactions made in Mexico and paid using an American Express credit card are of concern to you. In the context of a table, you would like to see the resulting explanations.

use database BoldElectronics_db
use hierarchy location_hierarchy for B.address
mine characteristics as customerPurchasing
analyze count%
in relevance to C.age,I.type,I.place_made
from customer C, purchase P,  item I, branch B, items_sold S  
where  I.item_ID =  S.item_ID & P.cust_ID = C.cust_ID &
P.method_paid = "AimEx" and B.address = "Canada" and I.price ≥ 300
with noise threshold = 4%
display as table

Data Mining Languages Standardization

Standardizing the languages of data mining would fulfill the following purposes.

• Supports the systemic production of solutions for data mining.

• Enhances interoperability between different frameworks and functions for data mining.

• Promotes schooling and easy learning.

• Promotes the use of industry and culture of data mining systems.

Data Mining – Classification & Prediction

There are two methods of data processing that can be used to derive models that represent significant groups or to forecast future patterns in knowledge. All these modes are as follows: 

• Classification

• Predicting

Categorical type labels are predicted by classification models; prediction models forecast continuous-valued functions. For say, we can construct a classification model to classify bank loan requests as either secure or dangerous or a forecast model to estimate future customers’ expenses in USD on electronic equipment given their incomes and profession.

Classification

Examples of situations where the role of data processing is classified are below.

• In order to know the client is risky or which one is secure, a bank loan officer needs to evaluate the details.

• A brand manager at a business has to assess a buyer who would purchase a new device with a specific profile.

A model or classifier is built to predict the categorical labels in each of the above cases. For loan applicant data, these labels are dangerous or stable and for marketing data, yes or no.

Prediction

The following are examples of situations where prediction is the goal of data processing.

Suppose the marketing manager has to estimate how much a single client can pay at his business after a transaction. We are bothered to forecast a numerical value in this case. Therefore an example of numeric prediction is the data processing activity. In this scenario, a system or a predictor that forecasts a continuous-valued-function or ordered value will be created.

The study of regression is a mathematical technique most commonly used for numeric prediction.

Working of the Classification 

Let us appreciate the functioning of classification with the assistance of the bank loan application that we have mentioned above. There are two stages in the Data Classification system

• The Classifier or Model construction

• Using Classification Classifier

The Classifier or Model construction

• The learning step or the learning period is this step.

• The classification algorithms construct the classifier in this stage.

• The classifier is constructed from a training set composed of tuples of databases and their corresponding class names.

• Each category that makes up the training set is referred to as a category or class. You may also refer to these tuples as samples, objects, or data points.

Using Classification Classifier

The classifier is used for classification in this stage. Test data is used here to approximate the consistency of the rules for classification. If the precision is considered appropriate, classification rules can be extended to new data tuples.

Issues in Classification and Prediction 

The key challenge is planning the Classification and Prediction data. Data planning covers the following activities:

• Data Cleaning: Data cleaning entails the reduction of noise and the handling of lost values. Through applying smoothing techniques, the noise is eliminated and the issue of incomplete data is resolved by substituting the missing value for that attribute for the most frequently occurring value.

• Relevance Analysis – can also include insignificant properties in the database. Analysis of similarity is used to know if there are any two attributes related.

• Transformation and reduction of data-The data can be converted by any of the methods below.

1. Normalization: Using normalization, the data is converted. In order to make them fall into a limited defined range, normalization requires scaling all values for given attributes. Normalization is used where neural networks or techniques requiring tests are used in the learning process.

2. The data can also be translated by generalizing it to the higher principle of generalization. We may use concept hierarchies for this reason.

Classification and Prediction Process Comparison

Here are the guidelines for comparing classification and prediction processes-

• Accuracy- Classifier accuracy refers to the classifier’s ability. It correctly predicts the class label and the predictor’s accuracy relates to how accurately a given predictor can approximate the value of a new data attribute expected.

• Speed- This refers to the expense of producing and using the classifier or predictor for estimation.

• Robustness- refers to the classifier or predictor’s ability to make accurate predictions from the noisy data supplied.

• Scalability- refers to the capacity to accurately construct the classifier or predictor, provided the vast volume of data.

• Interpretability-It refers to what the classifier or indicator knows to what degree.

Data Mining – Decision Tree Induction

A decision tree is a system containing a root node, branches, and leaf nodes. A test on an attribute is denoted by each internal node, each branch denotes the product of a test, and each leaf node has a class name. The topmost node of the tree is the root node.

The advantages of using a decision tree are as follows –

• No domain awareness is needed.

• It is easy to learn.

A decision tree’s learning and classification steps are simple and quick.

Algorithm of Decision Tree Induction 

A decision tree algorithm known as ID3 (Iterative Dichotomiser) was developed by a computer researcher called J.Ross Quinlan in 1980. He later proposed C4.5, which was ID3’s successor. ID3 and C4.5 follow a strategy that is greedy. There really is no backtracking for this algorithm; the trees are created in a top-down recursive divide-and-conquer way.

Producing a decision tree in the form of data partition D training tuples

Algorithm : Generating_decision_tree

Input:

Data partition, D, which would be a collection of tuples for training as well as the corresponding class labels for them.

attribute_list, the compilation of characteristics for candidates.

Attribute selection method, a process for evaluating an attribute.

The criteria for splitting the best partitions of the data tuples into the different classes. This criterion contains a provision for splitting_attribute and then a splitting point or splitting.

Output:

A Decision Tree

Method:

First, create a node N;

if in D all the tuples are from the same class, then
   C will return N as the leaf node that is labeled with class C;
   
If the attribute list (list of attributes) is empty, then
Return N as a leaf node labeled
In D with a majority class;
   
Now applying the attribute selection method(D, attribute list)
Trying to get the best splitting criterion;
Next labeling node N with splitting_criterion;

If the splitting_attribute is a discrete value also
   multiway splits are allowed then   // no restriction to binary trees

attribute_list = splitting the attributes;   // removing the splitting attribute
for each of the result j of splitting criterion

   // Fragment the tuples and create subtrees for the each partition
   
let Dj be a set of data tuples in D that satisfies the result j;   // a partition
   
   if the Dj is null or empty then
      attach a leaf that is labelled with the plurality 
      the class which is in D to the node N;
   else 
      attaching the node which is returned upon Generating 
      decision tree(Dj, attribute list) to the node N;
   end for
returning the N;

Tree Pruning

Tree pruning is required to remove irregularities in training data due to the noise or outliers. The pruned trees are small and more nuanced.

Approaches:

Pruning a tree in two ways —

  • Pre-pruning − The tree is pruned early by delaying its construction.
  • Post-pruning-This method separates a sub-tree from a fully-grown tree.

Complexity of prices

  • Cost complexity is calculated by the following two criteria.
  • The number of leaves in the tree,
  • The error rate of the tree.

Data Mining – Bayesian Classification

The Bayesian classification relies on the theorem of Bayes. The statistical classifiers are Bayesian classifiers. Class membership probabilities can be predicted by Bayesian classifiers, such as the probability that a given tuple corresponds to a particular class.

Bayes’ Theorem  

The Theorem of Bayes is named after Thomas Bayes. Two types of probabilities exist-

  • Posterior [P(H/X)]] probability
  • Prior [P(H)] Probability

In which X is a tuple of data and hypothesis being H.

In accordance with Bayes’ Theorem,

P(H/X) = P(X/H) P(H) / P(X)

Bayesian Belief Network

Joint conditional probability distributions are specified by Bayesian Belief Networks. They are also called Belief Networks, Probabilistic Networks, or Bayesian Networks. A Belief Network enables the interpretation of class conditional independencies among subsets of variables. It offers a graphical model of causal connection on which learning can be carried out.For classification, we can use a trained Bayesian Network.

A Bayesian Belief Network has two components that define it.

  • Directed acyclic graph (DAG)
  • Set of conditional probability tables

Directed Acyclic Graph (DAG)

A directed acyclic graph is a directed graph without directed cycles in mathematics, especially graph theory, and in computer science. That is, for each edge directed from one vertex to another it consists of vertices and edges (often called arcs), meaning that there really is no way to begin at any vertex v but pursue a consistently-directed series of edges that inevitably curves back once more to v.

Data Mining – Rule Based Classification

IF-THEN Rules

For classification, the rule-based classifier uses a series of IF-THEN principles. In the following, we will articulate a principle from –

IF the condition THEN the conclusion

Let us take into account Rule R1,

R1: IF age = kid AND student = yes 

THEN purchase_computer = yes

Points that should be noted −

  • The IF portion of the rule is considered the precedent or precondition of the rule.
  • The section of the THEN rule is considered the consequent rule.
  • The preceding section of the condition consists of one or more tests for characteristics, and these tests are logically ANDed.

Rule Extraction

Here by extracting IF-THEN rules from a decision tree, we can learn how to construct a rule-based classifier.

Points to recall-

  • To derive a rule/principle from a decision tree –
  • For each path from root to leaf node, one rule is formed.
  • Each division criteria are logically ANDed in order to form a rule precedent.
  • The leaf node retains the class prediction, forming the resulting rule.

Rule Induction Using Sequential Covering Algorithm

It is possible to use the Sequential Covering Algorithm to remove IF-THEN rules from training results. We do not need to first build a decision tree. Each rule for a given class encompasses several of that class’s tuples in this algorithm.

AQ, CN2, and RIPPER are some of the sequential Covering Algorithms. The rules are taught one at a time, as per the general approach. A tuple protected by the rule is deleted each time rules are discovered, and the procedure continues for the rest of the tuples. This is because the route to each leaf corresponds to a rule in a decision tree.

The induction of the decision tree can be viewed as learning a series of rules concurrently.

The following is a sequential learning algorithm in which rules for one class at a time are learned. We want a rule to include all tuples from class C only with no tuples from any other class when learning a rule from a class Ci.

Algorithm for Sequential Covering

Input: 

D, as a data set class-labelled tuple,

Att_val, which is the set of all the properties and their values.

Output:  

Getting a Set of IF-THEN rules.

Method:

Set_rule={ };    // the initial set of rules learned ought to be empty

for each of the class c do
   
   repeat

      Rule = Learn_One_Rule(D, Att_val, c);
      removing tuples covered by the Rule from D;
   until the termination condition;
   
   Set_rule= Set_rule +Rule; // adding a new rule to the rule-set
end for
return Set_rule;

Rule Pruning

For the following reason, the rule is pruned-

Quality assessment is carried out on the original training data set. On training data, the rule might execute well, but less well on subsequent data. That’s why pruning is needed as a rule.

By removing conjuncts, the rule is pruned. If the pruned version of R has a higher quality than what has been evaluated on an independent set of tuples, rule R is pruned.

One of the fast and straightforward techniques for rule pruning is FOIL. For R, for a given rule,

FOIL_Prune = pos – neg / pos + neg

Where the number of positive tuples covered by R is pos and neg, respectively. 

Miscellaneous Classification Methods

Other classification techniques such as Genetic Algorithms, Rough Set Approach, and Fuzzy Set Approach will be discussed here.

Genetic Algorithms

The genetic algorithm concept is based on natural evolution. In the genetic algorithm, the original population is generated first. This initial population is composed of rules generated randomly. With a sequence of bits, we can depict each rule.

In a given training data, for instance, two Boolean attributes, such as A1 and A2, describe the samples. And two classes like C1 and C2 are included in this given training set.

You can encode the IF A1 AND NOT A2 THEN C2 rule into a 100-bit string. The two leftmost bits represent the attributes A1 and A2, respectively, in this bit of representation.

The IF NOT A1 AND NOT A2 THEN C1 rule can likewise be encoded as 001.

Points to recall-

A new population is created based on the notion of the survival of the fittest, comprising of the fittest rules in the existing population and offspring values of these rules as well.

• The fitness of a rule is measured on a collection of training samples by its classification accuracy.

• To produce offspring, genetic operators such as crossover and mutation are applied.

• Through crossover, the substring is switched from a pair of rules to form a new pair of rules.

Rough Set Approach

With inaccurate and noisy data, we can use the rough set method to discover structural relationships.

Remember that you can only extend this approach to discrete-valued attributes. Therefore, prior to its use, continuous-valued attributes must be discretized.

Within the given training results, the Rough Set Theory is based on the establishment of equivalence classes. There are imperceptible tuples that form the equivalence class. It implies that with respect to the attributes defining the results, the samples are similar.

In the real-world information provided, there are some groups that cannot be separated in terms of available attributes. To loosely define such groups, we can use rough sets.

Fuzzy Set Approaches

Possibility Theory is often called the Fuzzy Set Theory. In 1965, Lotfi Zadeh proposed this hypothesis as an alternative to the two-value theory of logic and probability. This theory makes it possible for them to function at a high degree of abstraction. It also allows us the means to work with inaccurate calculation results.

The principle of the fuzzy set also helps one to work with unclear or inexact information. Suppose, it is true to be a part of a set of maximum wages (for example, if $60k is high, then what about $58k and $56k). Unlike the conventional CRISP set, where either the element belongs to S or its complement, the element will belong to more than one fuzzy set in fuzzy set theory.

Data Mining – Cluster Analysis

A cluster is a set of items belonging to the same class. Related objects, in other terms, are clustered in one cluster and separate objects are grouped in another cluster.

Clustering, what is it?

The method of rendering a collection of abstract objects into classes of identical objects is clustering.

Points to be remembered

• A data object cluster may be viewed as one party.

• We first divide the data collection into groups based on data similarities when doing cluster analysis, and then add the labels to the groups.

• The primary benefit of clustering over grouping is that it is adaptable to modifications and helps to define valuable characteristics that differentiate various classes.

     Applications 

• Cluster analysis is commonly used in several fields such as market analysis, pattern recognition, statistical analysis, and image analysis.

• Clustering will also allow advertisers to discover various classes within their consumer base. And they may classify their client segments on the basis of buying habits.

• In the field of genetics, it can be used to extract taxonomy from plants and animals, to categorize genes of comparable functionality, and to gain insight into the mechanisms found in populations.

• Clustering also allows classifying places with common land use in the Earth Observation Network. It also helps to classify clusters of houses in a city by type of home, value, and geographical area.

• Clustering also allows identifying records on the web for exploration of knowledge.

• Clustering is also used for outlier identification applications such as the detection of credit card fraud.

• Cluster analysis, a data mining function, is used as a method to obtain information about the quality of data in order to observe the features of each cluster.

Clustering Requirements in Data Mining

The subsequent pointers below shed light on why clustering is needed in data mining −

 Scalability − Highly scalable clustering algorithms are required to manage massive datasets.

Ability to work with various types of attributes − Algorithms should be capable of being applied to any kind of data including categorical data, interval-based (numeric) data, and binary data.

• Discovering the clusters with attribute shape − The clustering algorithm should be able to detect arbitrary shape clusters. They should not be bound only to distance measurements that tend to find small spherical clusters.

• High dimensionality − The clustering algorithm should not only be able to handle low dimensional data, but also high dimensional space.

• Interpretability – The effects of clustering should be interpretable, comprehensible, and functional.

Clustering Methods

Clustering approaches can be categorized into the following groups −

• Process of partitioning

• Hierarchical method

• Approach based on density

• Grid-based solution

• Model-based approach

• Constraint-based approach

Method of partitioning

Suppose we have a database of ‘n’ objects and a partitioning method creates a ‘k’ partitions of the records. Every partition will be a cluster and a k ≤ n. This means that the data will be grouped into k classes that fulfill the following criteria –

• Each category shall contain at least one item.

• Each object needs to belong to precisely one group.

Points to be noted –

  • The partitioning process can create an initial partitioning for a given number of partitions (say k).
  • Then it uses the iterative transfer method to boost partitioning by transferring objects from one category to another.

Hierarchical Methods

This approach produces a hierarchical decomposition of a given set of input objects. We may define hierarchical methods on the basis of the pattern of hierarchical decomposition. There are two responses to this −

• Agglomerative Approach

• Divisive method

Agglomerative Approach

This strategy is often referred to as the bottom-up approach. Here, we begin with each object creating a separate category. It keeps combining objects or classes that are similar to each other. It continues to do so until both classes are combined into one, or until the termination condition is preserved.

Divisive Approach

This strategy is often referred to as the top-down approach. In this, we begin with all the objects in the same cluster. In the continuous iteration, the cluster is separated into smaller clusters. It is down until and the entity is kept in a cluster or terminated state. This approach is rigid, i.e. it can never be reversed until the merger or separating is finished.

Approaches to Improve Quality of Hierarchical Clustering

Here are the two methods used to boost the efficiency of hierarchical clustering −

  • Thorough review of object ties at each hierarchical segmentation.
  • Integrate hierarchical agglomeration, first using a hierarchical agglomerative algorithm, to organise objects into micro-clusters, and then execute macro-clustering on micro-clusters.

Density-based Method

This approach is based on the principle of density. The basic principle is to keep the cluster expanding as long as the density in the region approaches a certain level i.e. for each data point within a cluster, the radius of the cluster must include at least a minimum number of points.

Grid-based Method

In this the objects form a grid together. Object space is quantified into a finite number of cells that form a grid structure.

Advantages

• Fast processing time is the key benefit of this approach.

• Depends only on the number of cells in each dimension of the quantified space.

Model-based approach

This approach also offers a means to automatically estimate the number of clusters based on normal data, taking into account outliers or noise. It then generates robust clustering approaches.

Constraint-based approach

In this approach, clustering is done by integrating user or application-oriented constraints. Constraint refers to the preferences of the consumer or the properties of the optimal clustering performance. Constraints provide an engaging way to engage with the clustering mechanism. Constraints can be defined by the customer or by the condition of the programme.

Data Mining – Mining Text Data

Text databases contain a huge array of records. They gather this information from a variety of sources, such as media articles, textbooks, digital archives, e-mail addresses, web sites, etc. Thanks to an increase in the volume of information generated, the text databases are increasing exponentially. In several text files, the data is quasi.

For instance, a documentation can contain certain standardised fields, such as name, writer, publishing date, and so on. However, along with the structure details, the document also includes unstructured text elements, such as abstract and material. Without understanding what the records could contain, it is difficult to devise successful queries for evaluating and extracting useful knowledge from the details. Users need tools to evaluate documents and rate their importance and relevance. Text mining has therefore become a common and central trend in data mining.

Information Retrieval

Information processing requires the retrieval of information from a vast number of text-based records. Any of the database structures are not typically present in information retrieval systems since they both manage various types of data. Examples include −

• Index scheme of the online library

• Online Record Management System

• Web Search Services and so on.

The key issue with the information retrieval system is to find the correct documents in a record set depending on the user’s demand. This sort of user query consists of certain keywords defining the need for details.

In the event of certain search issues, the user takes the initiative to retrieve the appropriate information from the array. This is necessary where the consumer wants ad-hoc details i.e. a short-term need. However if the client has a lengthy need for information, the retrieval system can still take the opportunity to push some recently arrived information object to the user.

This method of access to information is called Information Filtering. The related systems are known as Filtering Systems or Recommender Systems.

Basic Text Retrieval Initiatives

We need to verify the consistency of the framework by retrieving a number of documents from the input of the customer. Let the set of data relevant to the question be specified as {Relevant} and the set of recovered documents as {Retrieved}. The set of documents that are valid and retrieved can be specified as {Relevant} {Retrieved}.

There are three main metrics for determining the efficiency of text retrieval −

• Precision

• Recall 

• F-score value

Precision

Precision is the percentage of records retrieved that are in general, important to the request. Precision can be described as −

Precision= |{Relevant} ∩ {Retrieved}| / |{Retrieved}|

Recall

Recall is the percentage of records that are important to the request that have actually been recovered. Recall shall be described as –

Recall = |{Relevant} ∩ {Retrieved}| / |{Relevant}|

F-score

F-score is the trade-off that is widely used. The information retrieval system also needs an accuracy trade-off or likewise. 

The World Wide Web includes a large array of material that offers a rich source for data mining.

Data Mining – Mining World Wide Web

Challenges in Web Mining

The web faces huge challenges to the exploration of information and expertise based on the following findings −

• The web is so massive − the size of the web is very large and growing exponentially. It implies that the internet is too large for data collection and data mining.

• Web page complexity − Web pages do not have a cohesive structure. They are rather nuanced relative to the conventional text paper. There’s a massive volume of documents in the digital web library.

• The web is a diverse repository of knowledge − information on the web is updated easily. Data such as media, financial prices, climate, games, retail, etc are updated on a daily basis.

• Diversity in user groups − The user population on the internet is expanding exponentially. These people have diverse interests, preferences, and uses. More than 100 million workstations are connected to the Web and are rising exponentially.

• Relevancy of Information − It is assumed that a single person is usually involved in only a limited portion of the site, whereas the rest of the web includes material that is not important to the user and can flood the desired results.

Mining Web page layout structure

The fundamental framework of the website page is the Document Object Model (DOM). The DOM structure refers to a tree-like structure where the HTML tag on the page correlates to the node of the DOM tree. We can segment a web page using pre-defined tags in Html. The Html format is therefore versatile and the web pages do not conform with the W3C standards. Failure to comply with the W3C requirements can cause an error in the DOM tree structure.

The DOM framework was originally implemented for presentation in the browser, not for an explanation of the semantic structure of the web page. The DOM layout cannot adequately define the semantic relationship between the various sections of the web page.

Vision-based page segmentation (VIPS)

• The purpose of VIPS is to remove the semantic structure of the web page on the basis of its graphic appearance.

• Such a semantic structure corresponds to the structure of the tree. Each node corresponds to a block in this tree.

• Each node is assigned a value. This value is called the Coherence Degree. This value is given to represent the cohesive substance of a block dependent on visual perception.

• The VIPS algorithm first removes all the necessary blocks from the HTML DOM tree. After that, the separators can be located between these blocks.

• On a web page, separators refer to horizontal or vertical lines that overlap visually without blocks.

• The semantics of the web page was built on the basis of these blocks.

Data mining is commonly used in a number of regions. A variety of commercial data mining systems are currently available, but there are several problems in this region. In this guide, we will address the software and data mining patterns.

Data Mining Applications

Here is the list of fields in which data mining is commonly used –

• Review of financial data

• Commercial sector

• The mobile sector

• Study of biological evidence

• Other scientific applications;

• Identification of infringement

Financial Data Analysis

Financial data in the banking and financial sector is typically accurate and of good quality, which enables comprehensive data processing and data mining. Some of the common instances are as follows.

• Design and development of data centers for multi-dimensional data processing and data mining.

• Prediction of payment loans and review of consumer credit policies.

• Consumer classification and clustering for personalized ads.

• Preventing money laundering and other financial offenses.

Retail Industry

Data Mining has a significant amount of use in the retail sector because it gathers a vast amount of data on pricing, consumer buying patterns, goods transport, use, and services. It is natural that the volume of data gathered will continue to grow exponentially due to the growing simplicity, affordability, and popularity of the network.

• Construction and Design of data warehouses based on the benefits of data mining.

• Analysis of Multidimensional facets like sales, customers, products, time, and region.

• Analysis of the effectiveness of sales campaigns.

• Customer Retention.

• Product recommendation and cross-referencing of items.

Telecommunication Industry

Today, the telecommunications sector is one of the most emerging sectors offering a range of services, such as fax, pager, mobile phone, web messenger, photos, e-mail, online data delivery, etc. The telecommunications industry is expanding rapidly as a result of the emergence of modern computer and communication technologies. That’s why data mining is becoming really important to support and understand the market.

Data mining in the telecommunications sector aims to detect telecommunications trends, to catch illegal practices, to allow better use of capital, and to improve the quality of service. Here is a compilation of instances on which data mining enhances telecommunications services –

• Multidimensional analysis of telecommunications information.

• Analysis of the fraudulent trend.

• Recognizing irregular behaviors.

• Multidimensional correlation and study of sequential patterns.

• Wireless telephone networks.

• Use of visualization techniques in the study of telecommunications results.

Biological Data Analysis

In recent years, we have seen a significant rise in the field of genetics, such as genomics, proteomics, functional genomics, and biomedical research. Biodata mining is a very important aspect of Bioinformatics. The following are areas in which data mining leads to the study of biological data −

• Semantic integration of heterogeneous distributed genomics and proteomics databases.

• Grouping, batch processing, check for similarities, and comparative study of different nucleotide sequences.

• Identification of structural patterns and study of genetic networks and protein pathways. 

• Analysis of the association and route.

• Tools for simulation of genetic data analysis.

Intrusion Detection

• Intrusion refers to some type of action that compromises the credibility, security, or availability of network services.  Security has been a big concern in this field of networking. With the increased use of the internet and the availability of software and tricks for penetration and assault on the network, intrusion detection has become a vital component of network management. Here is a list of places where data mining technologies can be used for intrusion detection –

• implementation of a data mining algorithm to spot invasion.

• Relationship and similarity study, aggregation to better pick and create.

• Review of the data stream. 

• Distributed data mining activities.

• Tools for visualization and query.

Data Mining System Products

There are several products for data mining platforms and domain-specific data mining applications. New data mining systems and software have been applied to the previous systems. Efforts are still being made to standardize the languages of data mining.

Choosing a Data Mining System

The selection of a data mining method depends on the following characteristics−

• Data Styles – The data mining framework can manage formatted text, record-based data, and relational data. Data may also be stored in ASCII text, relational database data or data storage data. Therefore we can search for the exact format that the data mining system can accommodate.

• System Problems − We need to consider the consistency of the data mining system with various operating systems. One data mining system can operate on one or more operating systems. There are also data-mining programs that have web-based user interfaces that allow XML data to be entered.

• Data Sources − Data sources refer to the data types in which the data mining system runs. Some data mining programs can operate only on ASCII text files, while others on several relational sources.

• Data Mining methodologies and its functions − There are some data mining frameworks that provide only one data mining function such as classification while some provide multiple data mining functions such as idea definition, discovery-driven OLAP study, correlation mining, relation analysis,  classification, estimation, clustering, outlier analysis, statistical analysis, similarity search, etc.

Combining data mining with databases or data storage systems − Data mining systems must be combined with a database or data storage system. Coupled modules are combined into a common information management setting. Here are the types of couplings described below −

o No coupling

o Loose Coupling 

o Semi-tight binding 

o Tight coupling

Scalability − There are two problems of scalability of data processing –

Row (Database size) Scalability − The data mining method is known to be a scalable row when the number of rows is expanded ten times. It doesn’t take longer than 10 times to render a request.

Column (Dimension) Scalability − The data mining method is known to be scalable when the length of the mining query increases linearly with the number of columns.

Data mining concepts are still developing, and here are the new developments we can see in this area.

• Flexible and immersive tools for data mining.

• Standardization of the query language for data mining.

• Visual data mining activities.

• Discovery of applications.

• Preservation of privacy and confidentiality of information in data mining.

• Incorporating data mining with information networks, data management systems, and online database systems.

• Mining of biological data.

Data Mining – Themes

The theoretical basis for data mining

The scientific foundations of data mining include the following principles −

Data Reduction − The basic principle of this philosophy is to reduce the representation of data that trades pace precision in response to the need for fast, approximate responses to queries in very large datasets. Some methods for data reduction are as follows –

 Regression

 Wavelets

 Log-linear models

 Clustering

 Sampling

 Singular value Decomposition

 Histograms

Construction of Index Trees

Data Compression − The fundamental concept of this technique is to compress the data by encoding the following −

• Cluster 

• Bits

• Rules of Association

• Decision Trees

Pattern Detection − The fundamental concept of this theory is to find patterns in the database. Here are the fields that lead to this hypothesis −

• Machine Learning  

• The Neural Network

• Association of Mining

• Matching a sequential pattern

• Clustering

Visual Data Mining

Visual Data Mining uses information and/or knowledge analysis tools to uncover the implicit knowledge of massive data sets. Visual data mining can be seen as an aggregation of the following disciplines −

• Visualization of data

• Data Mining   

Visual data mining is tightly connected to the following −

• Computer graphics

• Multimedia platform

• Communication between human and computers

• Identification of pattern

• High Speed Computing

Generally, data visualisation and data processing may be integrated as follows –

Data Visualization − Data in a database or data warehouse can be presented in a variety of visual ways described below −

• Boxplots 

• 3-D cubes

• Distribution charts of data

• curves

• surfaces

• Connect graphs, etc.

Data Mining Result Visualization − Data Mining Result Visualization is the analysis of data mining findings in graphic form. These visual shapes could be dispersed plots, box plots, etc.

Data Mining Process Visualization − Data Mining Process Visualization provides a variety of data mining systems. It helps users in seeing how the data is being generated. It also helps users to see from which database or data warehouse the data is filtered, merged, pre-processed, and extracted.

Audio Data Mining

Audio data mining uses audio signals to show data trends or characteristics of data mining performance. By translating trends into sound and song, we will listen to notes and melodies instead of seeing pictures to recognize something important.

Data Mining and Collaborative Filtering

Consumers now experience a range of products and services when shopping. During live customer purchases, the Recommender Framework allows customers to make item recommendations. The Collaborative Filtering Technique is commonly used to recommend items to consumers. These reviews are based on the views of other clients.

Hope this tutorial was helpful in helping you understand the basics of data mining. For a more detailed course experience visit Great Learning Academy, where you will find elaborate courses on Data Science for free.

→ Explore this Curated Program for You ←

Avatar photo
Great Learning Editorial Team
The Great Learning Editorial Staff includes a dynamic team of subject matter experts, instructors, and education professionals who combine their deep industry knowledge with innovative teaching methods. Their mission is to provide learners with the skills and insights needed to excel in their careers, whether through upskilling, reskilling, or transitioning into new fields.

Recommended Data Science Courses

Data Science and Machine Learning from MIT

Earn an MIT IDSS certificate in Data Science and Machine Learning. Learn from MIT faculty, with hands-on training, mentorship, and industry projects.

4.63 ★ (8,169 Ratings)

Course Duration : 12 Weeks

PG in Data Science & Business Analytics from UT Austin

Advance your career with our 12-month Data Science and Business Analytics program from UT Austin. Industry-relevant curriculum with hands-on projects.

4.82 ★ (10,876 Ratings)

Course Duration : 12 Months

Scroll to Top