Abstract - “Data is an ocean of Universal Facts”. Big data Before we deep dive into the taxonomy of the attributes of
once an emergent technology of study is in its prime with big data it is necessary to get a brief idea about the flow of
immense potential for future technological advancements. A data to information leading to informed decision making.
formal study in the attributes of data is essential to build robust
systems of the future. Data scientists need a basic foot hold when
studying data systems and their applications in various domains.
This paper intends to be THE go-to resource for every student
and professional desirous to make an entry in the field of Big
Data. This paper has two focus areas. The first area of focus is
the detailing of the 5 V attributes of data i.e. Volume, Variety,
Velocity, Veracity and Value. Secondly, we will endeavor to
present a domain wise independent as well as comparative of the
correlation between the 5 V’s of Big Data. We have researched
and collected information from various market watch dogs and
concluded by carrying out comparatives which are highlighted in
this publication. The domains we will mention are Wholesale
Trade Domain, Retail Domain, Utilities Domain, Education
Domain, Transportation Domain, Banking and Securities
Domain, Communication and Media Domain, Manufacturing
Domain, Government Domain, Healthcare Domain, etc. This is
invaluable information for Big Data system designers as well as
future researchers.
Keywords: 5V’s, Big Data, Data Attribute, Quantity,
Reliability, Volume, Variety, Velocity, Veracity, Value,
“Data is an Ocean of Universal Facts”, statement defines
Figure-1: The sequence of steps involved in data analysis
that all facts of the universe can be said to be data. Decisions
which lead to taking ‘wise’ decisions.
that need to be taken based on the facts must ensure that the
inferential value of the relevant facts are appropriately The start has to be very well defined and articulated. To
computed or ascertained. This can be done by carrying out elaborate Big Data Analytics definition - This is a pipeline
data analytics. It is only after the data is processed, one can of acquisition and recording; extraction cleaning and
take at intelligent and wise decisions. The rift between the annotation; integration, aggregation and representation;
increasing quantity of data and the derived information there analysis and modeling; interpretation. This leads us to
off is getting wider, deeper and more complex with every conclude that a study in attributes of data is critical and the
passing minute. Now the industry experts do not even call it most important aspect of the complete life cycle. It is the
data anymore and are forced to call it “BIG DATA”. foundation of all things which can be classified as
The objectives of business houses are always to take knowledge. Also, no amount of analysis can by-pass or
decisions which will lead to profitable growth in business justify any loss / lack of understanding of the data leading to
and improved customer satisfaction. The transaction cycle decisions. It is very important that knowing the attributes of
will always start at data that leads to data analytics for wise data and these are key imperatives to this world of Big Data.
decisions and it will always lead to taking decisions based There are various proposed attributes of Big Data based on
on analyzing patterns and behaviors of both humans as well various thought processes. Different companies have
as systems. A journey from collecting data, setting deployed various ideations for defining the key attributes of
objectives and then taking wise decisions through building data. This survey is centered around exploring these
knowledge quotient is a sequence of steps that leads to data attributes with respect to their definition in different
analytics for wise decision, as in figure-1. Business domains, their importance and inferential value. In
addition, it also compares the importance and challenges of
the attributes across the top business domains.
5V’s of Big Data Attributes and their Relevance and Importance across Domains
II. 5V’S OF BIG DATA the quantity of data available. The more the data that is
available the more clarity will exist in the other attributes of
In 2001 Meta Group publication and Gartner analyst Doug
the data. In case the data available is “very small” [very
Laney had introduced the 3 V‟s of Data Management,
small in the world of big data would still be in Tera Bytes at
defining the 3 main components of data as Volume,
times] it would not be possible to express other attributes of
Velocity and Variety [1]. This can be considered as the first
the data with a high degree of belief. We would be forced to
and most important step towards the revolution called Big
have more assumptions to substantiate our facts or
Data attributes for Analytics as it paved a way in a direction
hypothesis. In short:
to arrive at more ergonomic and reliable outcomes. The
outcomes would now be more accurate and lead to better
profitability given that the inferences arrived at, were in INFERENCES
close proximity to the true picture or reality.The growth of VARIETY: This is the proverbial spread of the data types
the ocean of data soon outlived these three attributes and that can be analyzed. It also refers to the breadth or the
there was a need to add more attributes for a more robust extent of coverage. It can be said to be the flavor of data
analysis to happen. which exists either in terms of structured, semi-structured or
IBM is attributed to the addition of the 4th V to the attribute a completely unstructured set as the input on which further
chain. It introduced Veracity and defined it as: “Uncertainty analytics are to be performed. The quality or inconsistencies
of the data”- The quality or trustworthiness of the data. One defines the value that can be generated from using such data
of the tools that helps to handle big data‟s veracity is to sets. In short:
discard “noise” and transform the data into trustworthy VARIETY SPREAD WIDTH SCOPE OR EXTENT
insights.” [2] OF COVERAGE
As the world moves ahead and the ocean of data continues
to grow beyond boundaries, the hunger of Man to grow in VELOCITY: This attribute defines the rate of growth of
leaps and bounds has now led to the need for two new V‟s data. In this age of internet and technological advances
namely Variability and Visualization. However, it is not the where not only humans but systems are smart enough to
endeavor to explore these attributes. generate data and define personal preferences to gather data
Kapil, Agarwal and Khan (2016) in an independent study without human intervention, the rate of growth of the „pool‟
mentions about the characteristics of big data. The study or „ocean‟ of data both in terms of the depth as well as width
presented that Big data is a new idea, and it has got is astronomical.
numerous definitions from researchers, organizations, and VELOCITY SPREAD AND WIDTH INCREASE RATE
individuals. In 2001, industry analyst Doung Laney OF GROWTH OF DATA
(currently with Gartner), articulated the mainstream of
VERACITY: With the growth of the volumes of data and the
definition of big data regarding in terms of three V's;
high velocities with which the data is being assimilated,
Volume, Velocity, and Variety [13]. SAS (Statistical
there is bound to be a lot of errors or duplication or
Analysis System) has added two additional dimensions i.e.
corruption of data. This can also be said to be the noise in
Variability and complexity. Further, Oracle has defined big
the system which is collateral damage to the volume and
data in terms of four V's i.e. Volume, Velocity, Variety and
velocity of big data gamut.
Value. Furthermore, Oguntimilehin A, presented big data in
terms of five V's Volume, Velocity, Variety, Variability,
Value and a Complexity [14]. In 2014, Data Science OF DATA
Central, Kirk Born has defined big data in 10 V‟s i.e. VALUE: Last but the most important is the Value attribute.
Volume, Variety, Velocity, Veracity, Validity, Value, This is the icing on the cake. Without this all the other
Variability, Venue, Vocabulary, attributes are just meaningless. High volumes of good
Vagueness [13] quality data shall be well analyzed for value and meaningful
proposition for direct relevance to the business requirements
of wise decisions for business growth, profitability and
customer satisfaction.
It encompasses everything. To name a few digital data, The costs to store and maintain sanctity of such large
financial transactions, scientific experiments, genomic data, volumes of data are astronomical. Also, processing of
logs, events, emails, social media, sensors, texts, audio data, such large data at high speeds requires state of the art
medical records, surveillance, images, and videos. systems and tools.
Fremont Rider (1944), Wesleyan University Librarian, In addition to this the biggest concern is the privacy and
publishes The Scholar and the Future of the Research protection of the data. [9]
Library. He estimates that American university libraries
were doubling in size every sixteen years. Given this growth IV. ARIETY ATTRIBUTE
rate, Rider speculates that the Yale Library in 2040 will Variety of data is “The type and nature of the data.” This
have “approximately 200,000,000 volumes, which will helps people who analyze it to effectively use the resulting
occupy over 6,000 miles of shelves…” October 2000 Peter insight. The „data‟ is generated as by date, time, place,
Lyman and Hal R. Varian at UC Berkeley publish “How number or some textual content. This data is either
Much Information?” The study finds that in 1999, the world structured/ semi structured or unstructured. From structured
produced about 1.5 Exabyte of unique information, or about data set, more set of unstructured data sets are generated
250 MB‟s for every man, woman, and child on earth. which results into more opportunities of analysis. For
June 2008 Cisco releases the “Cisco Visual Networking looking into variety of data and quantify the abstract, the
Index – Forecast and Methodology, 2007–2012” part of an industry needs newer algorithms and systems.
“ongoing initiative to track and forecast the impact of visual Data can now be categorized as:
networking applications.” It predicts that “IP traffic will
nearly double every two years through 2012” and that it will STRUCTURED DATA: Structured data is the data which
reach half a zettabyte in 2012. The forecast held well, as resides within a fixed field type. It has a fixed shape, size,
Cisco‟s latest report (May 30, 2012) estimates IP traffic in format, type, etc. This type of data is to be found in data
2012 at just over half a zettabyte and notes it “has increased bases or excel spreadsheets where each record is made up of
eightfold over the past 5 years.” multiple attributes and the data content at each cell conforms
There are 277,000 Tweets every minute, Google processes to the record and the attribute type. Examples of structured
over 2 million search queries every minute, 72 hours of new data are names, phone numbers, ZIP codes, salutations, age,
video are uploaded to YouTube every minute, more than etc.
100 million emails are sent every minute, Facebook UNSTRUCTURED DATA: Unstructured data is the next
processes 350 GB of data every minute and 571 new new. This data has non-standard formats or rather where no
websites are created every minute. standard formats exist. The data in its unstructured form
The below chart provides the volumetric distribution of Big cannot be stored in data bases directly or analyzed easily by
Data across various domains. the conventional tools and methods. The typical examples of
unstructured data are photographs, audio clips, video clips,
blog entries, forums, social media platforms like Facebook
and LinkedIn, presentations, pdf files, web pages, etc. It is
estimated that more than 80% data is of the unstructured
The below chart provides the velocity distribution of Big be considered as the factor which differentiates human
Data across various domains. intelligence from artificial intelligence.
The below chart provides the veracity distribution of Big
Data across various domains.
Figure -19: Correlation between the 5 V’s in Govt
1. Bohannon, P., Fan W., Geerts F., Jia X., Kementsietsidis A., ()
Domain Healthcare Domain:
Conditional Functional
Wikipedia defines this domain as “Health care or healthcare 2. Dependencies for Data Cleaning, University of Edinburg research
is the maintenance or improvement of health via the
3. Bresina, J L, Morris P H, (2006) Explanations and Recommendations
diagnosis, treatment, and prevention of disease, illness, for Temporal Inconsistencies, IWPSS,
injury, and other physical and mental impairments in human 4. https://www.stsci.edu/largefiles/iwpss/20066061912
beings. Healthcare systems are organizations established to IWPSS_draft4.pdf
meet the health needs of target populations.” [16] 5. Brisaboa, Nieves &Luaces, Miguel & Rodriguez, Andrea &Seco,
The correlation between the V Parameters in this domain are Diego. (2014). An inconsistency measure of spatial data sets with
respect to topological constraints. International Journal of
shown below.
Geographical Information Science. 28. 56-82.
6. Dr. S. Vijayarani and Ms. S. Sharmila, RESEARCH IN BIG DATA –
AN OVERVIEW, Informatics
7. Engineering, an International Journal (IEIJ), Vol.4, No.3, September
8. Du Zhang, „Inconsistencies in Big Data‟ proceeding, Cognitive
Informatics & Cognitive Computing (ICCI*CC), 2013 P. 61-67 12th
IEEE Conference.
9. Garboden, Philip. (2020). Sources and Types of Big
10. Data for Macroeconomic Forecasting.
11. 10.1007/978-3-030-31150-6_1.
12. Hartzband, David. (2019). “What Is Data?” DOI:
13. Jeffrey Ray, Olayinka Johnny, Marcello Trovati, Stelios Sotiriadis
and Nik Bessis, The Rise of Big Data Science: A Survey of
Techniques, Methods and Approaches in the Field of Natural
Language Processing and Network Theory, Big Data Cogn. Comput.
2018, 2, 22; doi:10.3390/bdcc2030022, www.mdpi.com/journal/bdcc
14. Khan, Samiya& Liu, Xiufeng& Shakil, Kashish&Alam, Mansaf.
(2017). A survey on scholarly data: From big data perspective.
Information Processing &Management.DOI53. 923-944.
15. Krogh, Jesper. (2020). Data Types. DOI
Figure- 20: Correlation between the 5 V’s in Healthcare 10.1007/978-1-4842-5584-1_13.
Domain 16. Kumar, Praveen. (2019). BIG DATA ANALYTICS IN HR
DOMAIN. DOI 10.1729/Journal.22887.
V. CONCLUSION 17. M. V. Martinez, A. Pugliese, G. I. Simari, V. S. Subrahmanian, and
H. Prade, How dirty is your relational database? An axiomatic
The above study reflects and identifies very clear inferences. approach, in Proc.
This survey conclude that data generated in different 18. 9th European Conference on Symbolic and
domains is also distinct in terms of its 5V attributes. That 19. Quantitative Approaches to Reasoning with
Uncertainty, ammamet, Tunisia, LNAI 4724, 2007, pp.103-114.
being understood, the challenges within one domain must be
analysed and treated unique to the domain rather than a
“one-size-fits-all” approach. Comparative study between the
domains show us that while roughly a 20% weightage does
exist between the 5V‟s across domains, there is still a subtle
