Big Data — how it’s managed and measured

Arnab Saha
6 min readSep 17, 2020
Big Data

We live in the world of information, data everywhere. Their is a massive amount of data all around.

How it came to existence:

With the proliferation of online services and mobile technologies, the world has stepped into a multimedia big data era. In the past few years, the fast and widespread use of multimedia data, including image, audio, video, and text, as well as the ease of access and availability of multimedia sources, have resulted in a big data revolution in multimedia management systems.

Currently, multimedia sharing websites, such as Yahoo, Flickr, and YouTube, and social networks such as Facebook, Instagram , and Twitter, are considered as inimitable and valuable sources of multimedia big data. For example, to date, Instagram users have uploaded over 20 billion photos, YouTube users upload over 100h of videos every minute in a day, and 255 million active users of Twitter send approximately 500 million tweets every day.

Problem

But there’s the problem, with the population of 7.8 billion with 1.05% of growth rate, world has reached to deal with the data in units of Zettabytes and Yottabytes .
(A little refresher:
1 bit(i.e. 0,1) *8 = 1 Byte
1 B * 1024 B = 1 Kilobyte
1 KB * 1024 KB = 1 Megabyte
1 MB * 1024 MB = 1 Gigabyte
1 GB * 1024 GB = 1 Terabyte
1 TB * 1024 GB = 1 Petabyte
1 PB * 1024 PB = 1 Exabyte
1 EB * 1024 EB = 1 Zettabyte
1 ZB * 1024 ZB = 1 Yottabyte
)

Big Data

The big data concept is essentially used to describe extremely large datasets. However, different scientists and technological enterprises have various definitions for this term. Bryant et al. (2008) used the “Big-Data Computing” term in 2008. Finally, in 2010, it was defined as “datasets which could not be captured, managed, and processed by general computers within an acceptable scope” by Apache Hadoop
Let’s see some interesting statistics about data
Internet

In 2014, there were 2.4 billion internet users. That number grew to 3.4 billion by 2016, and in 2017 300 million internet users were added. As of June 2019 there are now over 4.4 billion internet users. This is an 83% increase in the number of people using the internet in just five years!
A Day of Data 500 million tweets are sent
294 billion emails are sent
4 petabytes of data are created on Facebook
4 terabytes of data are created from each connected car
65 billion messages are sent on WhatsApp. WhatsApp had 450 million daily active users in Q2 2018.
Google gets over 3.5 billion searches daily.
Google remains the highest shareholder of the search engine market, with 87.35% of the global search engine market share as of January 2020. Big Data stats for 2020 show that this translates into 1.2 trillion searches yearly, and more than 40,000 search queries per second.
By 2025, it’s estimated that 463 exabytes of data will be created each day globally — that’s the equivalent of 212,765,957 DVDs per day!

Do you know how how big YouTube is?
Let me give you a picture,
The total number of people who use YouTube — 1,300,000,000.
300 hours of video are uploaded to YouTube every minute!
Almost 5 billion videos are watched on YouTube every single day.
YouTube gets over 30 million visitors per day.

Now lets see about the problem of Big Data

Volume

You’re not really in the big data world unless the volume of data is exabytes, petabytes, or more. Big data technology giants like Amazon, Shopify, and other e-commerce platforms get real-time, structured, and unstructured data, lying between terabytes and zettabytes every second from millions of customers especially smartphone users from across the globe. They do near real-time data processing and after running machine learning algorithms to do data analysis on big data.

Velocity

Imagine a machine learning service that is constantly learning from a stream of data, or a social media platform with billions of users posting and uploading photos 24x7x365. Every second, millions of transactions occur, and this means petabytes and zettabytes of data is being transferred from millions of devices to a data center every second. This rate of high volume data inflow per second defines the velocity of data.

Big Data Analytics:

Big Data Analytics largely involves collecting data from different sources, munge it in a way that it becomes available to be consumed by analysts and finally deliver data products useful to the organization business.

The Solution for this problem of enormous amount of Big Data is “Distributed Storage”

We are now reaching a tipping point at which the traditional approach to storage — the use of a stand-alone, specialized storage box — no longer works, for both technical and economic reasons. We need not just faster drives and networks, we need a new approach, a new concept of doing data storage. At present, the best approach to satisfying current demands for storing data seems to be distributed storage.

Now this is an architecture of Distributed Storage.

We have one Main Master Node at center and 8 Slave node who are sharing their resources to the Master Node. Now the master node has the processing power of all the 8 nodes and the storage also.

This distributed storage solves our problem of Volume and Velocity. Any data which will be stored here will be divided into this slave nodes. So upon the time of retrieval it will be superfast.

Technologies:

Some technologies which we use for handling such large data and migration,manipulation are

  1. Apache Hadoop
  2. Cassandra
  3. Mongo DB
  4. Apache Hive

and the list goes on

Underlying job of all the technologies is the same “To process the data faster and store it effectively with minimum cost and fastest response.”

If you like the Article please give it a clap and get connected with me on Linkedin | GitHub

--

--