What is clustering

Clustering is one of the most common exploratory data analysis technique used to get an intuition about the structure of the data. It can be defined as the task of identifying subgroups in the data such that data points in the same subgroup (cluster) are very similar while data points in different clusters are very different.

What Is K Means Clustering ?

K means is one of the most popular Unsupervised Machine Learning Algorithms Used for Solving Classification Problems. K Means segregates the unlabeled data into various groups, called clusters, based on having similar features, common patterns.

Algorithm steps Of K Means

The working of the K-Means algorithm is explained in the below steps:

Applications of K-Means Clustering :

  • Academic Performance :

Use-Cases in Security Domain


An intrusion detection system (IDS) is a device or software application that monitors a network for malicious activity or policy violations. Any malicious activity or violation is typically reported or collected centrally using a security information and event management system. Anomaly detection is one of intrusion detection system. Current anomaly detection is often associated with high false alarm with moderate accuracy and detection rates when it’s unable to detect all types of attacks correctly.


Malware detection refers to the process of detecting the presence of malware on a host system or of distinguishing whether a specific program is malicious or benign. Malware detection technique plays vital role in detecting malware attack that can give high impact towards the cyber world. By using clustering, unsupervised machine learning is able to detect malware attack by identifying the behavior of the malware.


Electronic mail (email) has become an essential element for Internet users. The unwanted emails are known as spam email. These emails are sent in bulk to large number of recipients. This increased volume of spam email results a most common problem i.e. maintaining email inbox. Spam Email is major issue for internet community because it causes wastage of resources and also pollutes our environment. To prevent these adverse effects of spam email, spam filtering is essential task.

Analyzing Logs from Proxy Server and Captive Portal Using K-Means Clustering Algorithm

The traffic on World Wide Web is rapidly increasing, and an enormous amount of generated data due to users’ various interactions with websites. Thus, web data becomes one of the most valuable resources for information retrievals and knowledge discoveries. The study utilized the logs from the Proxy Server and Captive Portal database and used Web Usage Mining to discover useful and exciting patterns from the web data. Moreover, k-means clustering algorithm was used to provide specific groups of the user access patterns specifically for the number of user sessions and websites accessed by the network users. Based on the results, it had been found out that most of the time, users are more engage in utilizing the internet.

1. Operational Framework

1) data collection 2) preprocessing 3) data transformation 4) pattern discovery and 5) pattern analysis.

1.1. Data Collection

Data Collection is primarily the first step in the web usage mining process . Proxy servers are employed to improve navigation speed through caching, and they collect data from the users accessing huge groups of web servers . The web logs from the proxy server contain the actual HTTP requests from multiple clients to multiple Web servers

1.2. Data Preprocessing

The purpose of preprocessing is to transform the unstructured raw data into a set of user profiles . It has three major tasks, namely: 1) data cleaning, 2) user identification, and 3) session identification. Data cleaning is the removal of irrelevant data User identification task is to identify the user that made the session while session identification is the login and logoff activities done by the users .

1.3. Data Transformation

In this stage, data from the user sessions database is extracted and transformed into a comma separated values (CSV) file. This file contains the dataset which is necessary for discovering session patterns. The CSV file is significant for a Matlab software in generating the clusters using k-means.

1.4. Pattern Discovery

Once all user transactions have been identified, a variety of data mining techniques is performed for pattern discovery in the web usage mining and one of those is clustering . Clustering techniques are widely utilized in web usage mining (WUM) to capture similar trends and interests among users accessing a website. Clustering aims to divide a data set into groups or clusters where inter-cluster similarities are minimized while the intra cluster similarities are maximized. The knowledge discovered from the clustering may be used to analyze the session patterns of the users . The k-Means clustering algorithm is one of the most commonly used methods for partitioning the data . It is more suitable for large datasets. k-Means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. Euclidean distance is used as a metric. The main advantages of this algorithm are its simplicity and speed which allows it to run on large datasets . K-Means algorithm is iterative in nature and repeats for each object. It converges until the objects are stable . K-Means clustering is simple, and the necessary steps it follows are :


Analyzing the logs coming from the proxy server and captive portal traffic in the network. Thus, based on the results, it had been found out that most of the time, users are more engage in utilizing the internet. It can also be used in identifying when the students and employees stay active in browsing the internet and the number of websites they accessed. Hence, it is recommended to exploit the use of other clustering algorithms other than k-means in identifying and grouping web user patterns.

Cyber Profiling

The idea of cyber profiling is derived from criminal profiles, which provide information on the investigation division to classify the types of criminals who were at the crime scene. Profiling is more specifically based on what is known and not known about the criminal .

I Hope You Like It…

Thank You For Reading…

If you like it please clap it.😁🤗

Lifelong learning technologies