What is clustering

Clustering is one of the most common exploratory data analysis technique used to get an intuition about the structure of the data. It can be defined as the task of identifying subgroups in the data such that data points in the same subgroup (cluster) are very similar while data points in different clusters are very different.

What Is K Means Clustering ?

K means is one of the most popular Unsupervised Machine Learning Algorithms Used for Solving Classification Problems. K Means segregates the unlabeled data into various groups, called clusters, based on having similar features, common patterns.

Algorithm steps Of K Means

The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the value of K, to decide the number of clusters to be formed.

Step-2: Select random K points which will act as centroids.

Step-3: Assign each data point, based on their distance from the randomly selected points (Centroid), to the nearest/closest centroid which will form the predefined clusters.

Step-4: place a new centroid of each cluster.

Step-5: Repeat step no.3, which reassign each data point to the new closest centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to Step 7.

Step-7: FINISH

Applications of K-Means Clustering :

  • Academic Performance :

Based on the scores, students are categorized into grades like A, B, or C.

  • Diagnostic systems :

The medical profession uses k-means in creating smarter medical decision support systems, especially in the treatment of liver ailments.

  • Search engines :

Clustering forms a backbone of search engines. When a search is performed, the search results need to be grouped, and the search engines very often use clustering to do this.

  • Wireless sensor networks :

The clustering algorithm plays the role of finding the cluster heads, which collect all the data in its respective cluster.

Use-Cases in Security Domain


An intrusion detection system (IDS) is a device or software application that monitors a network for malicious activity or policy violations. Any malicious activity or violation is typically reported or collected centrally using a security information and event management system. Anomaly detection is one of intrusion detection system. Current anomaly detection is often associated with high false alarm with moderate accuracy and detection rates when it’s unable to detect all types of attacks correctly.

To overcome this problem, K-Means clustering is useful, which will cluster all data into the corresponding group before applying a classifier for classification purpose with reasonable false alarm rate. This approach has resulted in high accuracy and good detection rates but with moderate false alarm on novel attacks.


Malware detection refers to the process of detecting the presence of malware on a host system or of distinguishing whether a specific program is malicious or benign. Malware detection technique plays vital role in detecting malware attack that can give high impact towards the cyber world. By using clustering, unsupervised machine learning is able to detect malware attack by identifying the behavior of the malware.

Clustering detection model by using K-Means clustering approach to detect malware behavior of data based on the features of the malware. Clustering techniques that use unsupervised algorithm in machine learning plays an important role in grouping similar malware characteristics by studying the behavior of the malware which results in, model is capable to cluster normal and suspicious data into two separate groups with high detection rate which is more than 90 percent accuracy.


Electronic mail (email) has become an essential element for Internet users. The unwanted emails are known as spam email. These emails are sent in bulk to large number of recipients. This increased volume of spam email results a most common problem i.e. maintaining email inbox. Spam Email is major issue for internet community because it causes wastage of resources and also pollutes our environment. To prevent these adverse effects of spam email, spam filtering is essential task.

K-means Clustering is an effective way of identifying spam. The way that it works is by looking at the different sections of the email (header, sender, and content). The data is then grouped together. These groups can then be classified to identify which are spam. Including clustering in the classification process improves the accuracy of the filter to 97%.

Analyzing Logs from Proxy Server and Captive Portal Using K-Means Clustering Algorithm

The traffic on World Wide Web is rapidly increasing, and an enormous amount of generated data due to users’ various interactions with websites. Thus, web data becomes one of the most valuable resources for information retrievals and knowledge discoveries. The study utilized the logs from the Proxy Server and Captive Portal database and used Web Usage Mining to discover useful and exciting patterns from the web data. Moreover, k-means clustering algorithm was used to provide specific groups of the user access patterns specifically for the number of user sessions and websites accessed by the network users. Based on the results, it had been found out that most of the time, users are more engage in utilizing the internet.

1. Operational Framework

1) data collection 2) preprocessing 3) data transformation 4) pattern discovery and 5) pattern analysis.

1.1. Data Collection

Data Collection is primarily the first step in the web usage mining process . Proxy servers are employed to improve navigation speed through caching, and they collect data from the users accessing huge groups of web servers . The web logs from the proxy server contain the actual HTTP requests from multiple clients to multiple Web servers

1.2. Data Preprocessing

The purpose of preprocessing is to transform the unstructured raw data into a set of user profiles . It has three major tasks, namely: 1) data cleaning, 2) user identification, and 3) session identification. Data cleaning is the removal of irrelevant data User identification task is to identify the user that made the session while session identification is the login and logoff activities done by the users .

1.3. Data Transformation

In this stage, data from the user sessions database is extracted and transformed into a comma separated values (CSV) file. This file contains the dataset which is necessary for discovering session patterns. The CSV file is significant for a Matlab software in generating the clusters using k-means.

1.4. Pattern Discovery

Once all user transactions have been identified, a variety of data mining techniques is performed for pattern discovery in the web usage mining and one of those is clustering . Clustering techniques are widely utilized in web usage mining (WUM) to capture similar trends and interests among users accessing a website. Clustering aims to divide a data set into groups or clusters where inter-cluster similarities are minimized while the intra cluster similarities are maximized. The knowledge discovered from the clustering may be used to analyze the session patterns of the users . The k-Means clustering algorithm is one of the most commonly used methods for partitioning the data . It is more suitable for large datasets. k-Means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. Euclidean distance is used as a metric. The main advantages of this algorithm are its simplicity and speed which allows it to run on large datasets . K-Means algorithm is iterative in nature and repeats for each object. It converges until the objects are stable . K-Means clustering is simple, and the necessary steps it follows are :


Analyzing the logs coming from the proxy server and captive portal traffic in the network. Thus, based on the results, it had been found out that most of the time, users are more engage in utilizing the internet. It can also be used in identifying when the students and employees stay active in browsing the internet and the number of websites they accessed. Hence, it is recommended to exploit the use of other clustering algorithms other than k-means in identifying and grouping web user patterns.

Example : We can use these in the test data in the industry of school Management

Cyber Profiling

The idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene. Profiling is more specifically based on what is known and not known about the criminal .

Profiling is information about an individual or group of individuals that are accumulated, stored, and used for various purposes, such as by monitoring their behavior through their internet activity .

Cyber Profiling process can be directed to the benefit of:

1. Identification of users of computers that have been used previously.

2.Mapping the subject of family, social life, work, or network-based organizations, including those for whom he/she worked.

3.Provision of information about the user regarding his ability, level of threat, and how vulnerable to threats .

4.Identify the suspected abuser

The way in which Cyber Profiling works :

I Hope You Like It…

Thank You For Reading…

If you like it please clap it.😁🤗

Connect with me on LinkedIn | GitHub




Lifelong learning technologies

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

My First Internship Experience.

Regularization for Machine Learning in Python

Do you want to be a Data Scientist? Here’s how I began.

What Is Exploration Trade-Off?

The Dangers of Assuming Causal Relationships

Scope of Business Analytics in Manufacturing

Apply Neo4j Similarity Algorithm to analyse Chess “openings”

Intriguing IQ Tests to Measure the Intelligence of Your AI

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Arnab Saha

Arnab Saha

Lifelong learning technologies

More from Medium

How Important Is Cyber-Physical Security In Drones?

Splunk Data Import from t-pot

Creating a Research Honeypot on AWS

Error-free installation of Owncloud Server on Windows 10 within 30 minutes using WSL!