0% found this document useful (0 votes)
7 views10 pages

5V_s_of_Big_Data_Attributes_and_their_Re

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 10

International Journal of Innovative Technology and Exploring Engineering (IJITEE)

ISSN: 2278-3075, Volume-9 Issue-11, September 2020

5V‟s of Big Data Attributes and their Relevance


and Importance across Domains
Vinaya Keskar, Jyoti Yadav, Ajay H. Kumar

Abstract - “Data is an ocean of Universal Facts”. Big data Before we deep dive into the taxonomy of the attributes of
once an emergent technology of study is in its prime with big data it is necessary to get a brief idea about the flow of
immense potential for future technological advancements. A data to information leading to informed decision making.
formal study in the attributes of data is essential to build robust
systems of the future. Data scientists need a basic foot hold when
studying data systems and their applications in various domains.
This paper intends to be THE go-to resource for every student
and professional desirous to make an entry in the field of Big
Data. This paper has two focus areas. The first area of focus is
the detailing of the 5 V attributes of data i.e. Volume, Variety,
Velocity, Veracity and Value. Secondly, we will endeavor to
present a domain wise independent as well as comparative of the
correlation between the 5 V’s of Big Data. We have researched
and collected information from various market watch dogs and
concluded by carrying out comparatives which are highlighted in
this publication. The domains we will mention are Wholesale
Trade Domain, Retail Domain, Utilities Domain, Education
Domain, Transportation Domain, Banking and Securities
Domain, Communication and Media Domain, Manufacturing
Domain, Government Domain, Healthcare Domain, etc. This is
invaluable information for Big Data system designers as well as
future researchers.
Keywords: 5V’s, Big Data, Data Attribute, Quantity,
Reliability, Volume, Variety, Velocity, Veracity, Value,
Worth

I. INTRODUCTION
“Data is an Ocean of Universal Facts”, statement defines
Figure-1: The sequence of steps involved in data analysis
that all facts of the universe can be said to be data. Decisions
which lead to taking ‘wise’ decisions.
that need to be taken based on the facts must ensure that the
inferential value of the relevant facts are appropriately The start has to be very well defined and articulated. To
computed or ascertained. This can be done by carrying out elaborate Big Data Analytics definition - This is a pipeline
data analytics. It is only after the data is processed, one can of acquisition and recording; extraction cleaning and
take at intelligent and wise decisions. The rift between the annotation; integration, aggregation and representation;
increasing quantity of data and the derived information there analysis and modeling; interpretation. This leads us to
off is getting wider, deeper and more complex with every conclude that a study in attributes of data is critical and the
passing minute. Now the industry experts do not even call it most important aspect of the complete life cycle. It is the
data anymore and are forced to call it “BIG DATA”. foundation of all things which can be classified as
The objectives of business houses are always to take knowledge. Also, no amount of analysis can by-pass or
decisions which will lead to profitable growth in business justify any loss / lack of understanding of the data leading to
and improved customer satisfaction. The transaction cycle decisions. It is very important that knowing the attributes of
will always start at data that leads to data analytics for wise data and these are key imperatives to this world of Big Data.
decisions and it will always lead to taking decisions based There are various proposed attributes of Big Data based on
on analyzing patterns and behaviors of both humans as well various thought processes. Different companies have
as systems. A journey from collecting data, setting deployed various ideations for defining the key attributes of
objectives and then taking wise decisions through building data. This survey is centered around exploring these
knowledge quotient is a sequence of steps that leads to data attributes with respect to their definition in different
analytics for wise decision, as in figure-1. Business domains, their importance and inferential value. In
addition, it also compares the importance and challenges of
the attributes across the top business domains.
Revised Manuscript Received on September 05, 2020.
Vinaya Keskar, Research Student, Department of Computer Science,
Savitribai Phule Pune University (SPPU), Pune, Maharashtra, India.
Dr. Jyoti Y. Yadav, Assistant Professor, Department of Computer
Science, Savitribai Phule Pune University, Pune, Maharashtra, India.
Dr. Ajay H. Kumar, Research Guide, Department of Computer
Science, Savitribai Phule Pune University, Pune, Maharashtra, India.

Published By:
Retrieval Number: 100.1/ijitee.K77090991120 Blue Eyes Intelligence Engineering
DOI: 10.35940/ijitee.K7709.0991120 154 and Sciences Publication
5V’s of Big Data Attributes and their Relevance and Importance across Domains

II. 5V’S OF BIG DATA the quantity of data available. The more the data that is
available the more clarity will exist in the other attributes of
In 2001 Meta Group publication and Gartner analyst Doug
the data. In case the data available is “very small” [very
Laney had introduced the 3 V‟s of Data Management,
small in the world of big data would still be in Tera Bytes at
defining the 3 main components of data as Volume,
times] it would not be possible to express other attributes of
Velocity and Variety [1]. This can be considered as the first
the data with a high degree of belief. We would be forced to
and most important step towards the revolution called Big
have more assumptions to substantiate our facts or
Data attributes for Analytics as it paved a way in a direction
hypothesis. In short:
to arrive at more ergonomic and reliable outcomes. The
VOLUME  QUANITY DEPTH  RELIABILITY OF
outcomes would now be more accurate and lead to better
profitability given that the inferences arrived at, were in INFERENCES
close proximity to the true picture or reality.The growth of VARIETY: This is the proverbial spread of the data types
the ocean of data soon outlived these three attributes and that can be analyzed. It also refers to the breadth or the
there was a need to add more attributes for a more robust extent of coverage. It can be said to be the flavor of data
analysis to happen. which exists either in terms of structured, semi-structured or
IBM is attributed to the addition of the 4th V to the attribute a completely unstructured set as the input on which further
chain. It introduced Veracity and defined it as: “Uncertainty analytics are to be performed. The quality or inconsistencies
of the data”- The quality or trustworthiness of the data. One defines the value that can be generated from using such data
of the tools that helps to handle big data‟s veracity is to sets. In short:
discard “noise” and transform the data into trustworthy VARIETY  SPREAD  WIDTH  SCOPE OR EXTENT
insights.” [2] OF COVERAGE
As the world moves ahead and the ocean of data continues
to grow beyond boundaries, the hunger of Man to grow in VELOCITY: This attribute defines the rate of growth of
leaps and bounds has now led to the need for two new V‟s data. In this age of internet and technological advances
namely Variability and Visualization. However, it is not the where not only humans but systems are smart enough to
endeavor to explore these attributes. generate data and define personal preferences to gather data
Kapil, Agarwal and Khan (2016) in an independent study without human intervention, the rate of growth of the „pool‟
mentions about the characteristics of big data. The study or „ocean‟ of data both in terms of the depth as well as width
presented that Big data is a new idea, and it has got is astronomical.
numerous definitions from researchers, organizations, and VELOCITY  SPREAD AND WIDTH INCREASE  RATE
individuals. In 2001, industry analyst Doung Laney OF GROWTH OF DATA
(currently with Gartner), articulated the mainstream of
VERACITY: With the growth of the volumes of data and the
definition of big data regarding in terms of three V's;
high velocities with which the data is being assimilated,
Volume, Velocity, and Variety [13]. SAS (Statistical
there is bound to be a lot of errors or duplication or
Analysis System) has added two additional dimensions i.e.
corruption of data. This can also be said to be the noise in
Variability and complexity. Further, Oracle has defined big
the system which is collateral damage to the volume and
data in terms of four V's i.e. Volume, Velocity, Variety and
velocity of big data gamut.
Value. Furthermore, Oguntimilehin A, presented big data in
VERACITY ACCURACY MESSINESS  RELIABILITY
terms of five V's Volume, Velocity, Variety, Variability,
Value and a Complexity [14]. In 2014, Data Science OF DATA
Central, Kirk Born has defined big data in 10 V‟s i.e. VALUE: Last but the most important is the Value attribute.
Volume, Variety, Velocity, Veracity, Validity, Value, This is the icing on the cake. Without this all the other
Variability, Venue, Vocabulary, attributes are just meaningless. High volumes of good
Vagueness [13] quality data shall be well analyzed for value and meaningful
proposition for direct relevance to the business requirements
of wise decisions for business growth, profitability and
customer satisfaction.
VALUE  WORTH  RELEVANCE  DIRECT
VALUE-ADD TO THE BUSINESS

III. ROADMAP OF 5-V OF BIG DATA ATTRIBUTE


VOLUME ATTRIBUTE:
Volume is defined as “The quantity of generated and stored
data. The size of data determines value and potential insight-
and whether it can be considered big data or not.”[16]
The foundation of Big Data Analytics is dependent on the
Now, this 5 V‟s of data can be summarized in figure-2. quantity of data available.
Figure-2: The characteristics of Big Data Attribute (5-V)
VOLUME: Volume is the first data attribute and is the
foundation of all other attributes. It refers to the measure of

Published By:
Retrieval Number: 100.1/ijitee.K77090991120 Blue Eyes Intelligence Engineering
DOI: 10.35940/ijitee.K7709.0991120 155 and Sciences Publication
International Journal of Innovative Technology and Exploring Engineering (IJITEE)
ISSN: 2278-3075, Volume-9 Issue-11, September 2020

It encompasses everything. To name a few digital data,  The costs to store and maintain sanctity of such large
financial transactions, scientific experiments, genomic data, volumes of data are astronomical. Also, processing of
logs, events, emails, social media, sensors, texts, audio data, such large data at high speeds requires state of the art
medical records, surveillance, images, and videos. systems and tools.
Fremont Rider (1944), Wesleyan University Librarian,  In addition to this the biggest concern is the privacy and
publishes The Scholar and the Future of the Research protection of the data. [9]
Library. He estimates that American university libraries
were doubling in size every sixteen years. Given this growth IV. ARIETY ATTRIBUTE
rate, Rider speculates that the Yale Library in 2040 will Variety of data is “The type and nature of the data.” This
have “approximately 200,000,000 volumes, which will helps people who analyze it to effectively use the resulting
occupy over 6,000 miles of shelves…” October 2000 Peter insight. The „data‟ is generated as by date, time, place,
Lyman and Hal R. Varian at UC Berkeley publish “How number or some textual content. This data is either
Much Information?” The study finds that in 1999, the world structured/ semi structured or unstructured. From structured
produced about 1.5 Exabyte of unique information, or about data set, more set of unstructured data sets are generated
250 MB‟s for every man, woman, and child on earth. which results into more opportunities of analysis. For
June 2008 Cisco releases the “Cisco Visual Networking looking into variety of data and quantify the abstract, the
Index – Forecast and Methodology, 2007–2012” part of an industry needs newer algorithms and systems.
“ongoing initiative to track and forecast the impact of visual Data can now be categorized as:
networking applications.” It predicts that “IP traffic will
nearly double every two years through 2012” and that it will STRUCTURED DATA: Structured data is the data which
reach half a zettabyte in 2012. The forecast held well, as resides within a fixed field type. It has a fixed shape, size,
Cisco‟s latest report (May 30, 2012) estimates IP traffic in format, type, etc. This type of data is to be found in data
2012 at just over half a zettabyte and notes it “has increased bases or excel spreadsheets where each record is made up of
eightfold over the past 5 years.” multiple attributes and the data content at each cell conforms
There are 277,000 Tweets every minute, Google processes to the record and the attribute type. Examples of structured
over 2 million search queries every minute, 72 hours of new data are names, phone numbers, ZIP codes, salutations, age,
video are uploaded to YouTube every minute, more than etc.
100 million emails are sent every minute, Facebook UNSTRUCTURED DATA: Unstructured data is the next
processes 350 GB of data every minute and 571 new new. This data has non-standard formats or rather where no
websites are created every minute. standard formats exist. The data in its unstructured form
The below chart provides the volumetric distribution of Big cannot be stored in data bases directly or analyzed easily by
Data across various domains. the conventional tools and methods. The typical examples of
unstructured data are photographs, audio clips, video clips,
blog entries, forums, social media platforms like Facebook
and LinkedIn, presentations, pdf files, web pages, etc. It is
estimated that more than 80% data is of the unstructured
type.

Figure-3 Volume wise distribution of Big Data across


domains. [(Ratings: Very high-5, high-4, avg.-3, low-2,
very low-1)]
From above study data it is observed that, the challenges
with volumes are enormous. This data is available in
abundance. This abundance of data creates the below Figure-4: Structured v/s Un Structured Data [7].
challenges:
In addition, there also exists a mid-class called Semi-
 Sampling analysis of data does not allow for a smaller Structured data. This semi structured data contains
data set as sample size itself is again Big Data. Thus, it combination of structured and unstructured data. The minute
cannot be said that sampling will lead to effective and detail study is outside the scope of this paper. Some
„easy‟ analytics for the data. examples are comma separated value files and unformatted
 Space – The storage and access for such large volume data dumps in databases.
of big data requires storage space, infrastructure, time The below chart provides the variety distribution of Big
and speed for analyzing big data. Expecting faster Data across various domains.
analysis with this big data requires further technological
advancements in the field of hardware. Cloud based
solutions and shared processing are cost effective
solutions to the industry but this is still in the infancy.

Published By:
Retrieval Number: 100.1/ijitee.K77090991120 Blue Eyes Intelligence Engineering
DOI: 10.35940/ijitee.K7709.0991120 156 and Sciences Publication
5V’s of Big Data Attributes and their Relevance and Importance across Domains

With the huge volume of generated data, the fast velocity of


arriving data, and the large variety of heterogeneous data,
the quality of data is far from perfect. [8]
Velocity of the data can be roughly defined in-terms of the
below 4 types:
Real Time Data: All data which is created / generated in
present time and consumed by downstream or other systems
is classified as real time data. For example, a telephone
conversation between two or more people, chatting between
two people, video conferencing and video calls, live telecast,
etc. When a shopper hits the ATM, the bank balance and
transactional data have to be processed instantly, or so close
Figure-5: Variety wise distribution of Big Data across that the customer doesn‟t even notice the delay. This data is
domains. [(Ratings: Very high-5, high-4, avg.-3, low-2, very critical as the chances of loss are very high. [10]
very low-1)] Near Real Time Data: Near Real Time data is as the name
suggests conforming to real time definition however, there is
In view of variety wise distribution across domains, the
challenges with variety are enormous. This abundance of a delay introduced between the transmitter and the receiver.
sub data which is very subjective creates the below One can consider the noise in system which leads to delay in
challenges: transmission or transmission losses. Examples are the time
 A record of data for a single entry can run into hundreds delay in receiving the OTP [one time pin] on one‟s hand
of Tera bytes. So analysis of patterns across multiple phone for banking and e-commerce based transactions to be
objects of a similar category are practically unthinkable completed, time delays in transmission during live
or very time consuming and costly. broadcasts. Other examples can be where there are multiple
 The opportunities for error increase many fold as even a live feeds entering the same system and then they are being
few readings which are misplaced will skew the processed together to provide a single output or specific feed
analysis leading to loss of inferential value. output which forms the input for other downward systems.
 The vastness and vagueness of data creates a loss of This is also defined as CEP or Complete Event Processing.
reliability and higher obscurity. This is a part of Operational Intelligence and is a very
 On the web, 58% of the available documents are XML, powerful tool for sales and strategic planning in large
among which only one third of XML documents with organizations where the sensitivity and specificity are of
accompanying XSD/DTD are valid. 14% of the utmost importance in devising smart {can be compared to
documents lack well-formedness, a simple error of S.M.A.R.T. indicating Simple Measurable Achievable,
mismatching tags and missing tags that renders the Realistic and Time-bound} growth plans.
entire XML-technology useless over these documents. Batch Processing: Batch Processing can be defined as
[8] processing of archived or historical data in real time to
arrive at patterns. They are not very sensitive and the burden
V. ELOCITY ATTRIBUTE or constraint of time is not of paramount concern. Batch
processes run for hours and days on end. Some examples
Velocity is defined as “the speed at which the data is can be activities as simple as payroll processing which run
generated and processed to meet the demands and once monthly or batch processes which run as per scheduled
challenges that lie in the path of growth and development.” tasks or activities on server to much more complex activities
[16] Velocity not only relates to the speed but also the like analysis of vast non-parametric data and performing
sources being a vector quantity. For a scalar quantity, a confirmatory data analysis on them using Test of Hypothesis
collective quantitative analysis or understanding shall or creating Design of Experiments, etc.
suffice. But velocity being a vector quantity, both the Virtual Time Data: Smart systems and technology has made
collective quantity and the sources or domains from where it possible to invoke communication between interested
the data is being generated needs analysis. This paves way parties in virtual time. This concept can be expressed as
for the opportunities and cost benefit analysis. In domains communication between parties when the other party may or
with lower velocities, initiatives of high value and high may not be available at real time. Thus when the recipient
importance if only considered, it may not achieve the becomes available the message is delivered. Virtual time
expected or desired results. When velocity attribute is to be communication has made it possible for parties to
considered then, the velocity at the point of origin of data communicate at their time of choosing which is transmitted
shall be considered along with the path it takes from the to the recipients‟ server spaces waiting for the recipient to
source to the intended destination and the time consumed. retrieve the message. This has led to a blast in the data that
Here the bandwidth and technical / technological hardware is available over the internet. Social media platforms, email,
and software capabilities and availabilities are required so mobile social network software, voice mails, uploading
that the data maintains the features of timeliness i.e. being video messages, etc. are all examples in this category.
available at the right time for the right purpose. A delay in
this will cause a ripple effect and downstream systems will
be exponentially impacted.

Published By:
Retrieval Number: 100.1/ijitee.K77090991120 Blue Eyes Intelligence Engineering
DOI: 10.35940/ijitee.K7709.0991120 157 and Sciences Publication
International Journal of Innovative Technology and Exploring Engineering (IJITEE)
ISSN: 2278-3075, Volume-9 Issue-11, September 2020

The below chart provides the velocity distribution of Big be considered as the factor which differentiates human
Data across various domains. intelligence from artificial intelligence.
The below chart provides the veracity distribution of Big
Data across various domains.

Figure-6: Velocity wise distribution of Big Data across


Figure-7: Veracity wise distribution of Big Data
domains. [(Ratings: Very high-5, high-4, avg.-3, low-2,
across domains. [(Ratings: Very high-5, high-4, avg.-3,
very low-1)]
low-2, very low-1)]
Given the above study available, the major challenges with
Given the above study available, the major challenges with
velocity are listed below:
veracity are listed below:
 Large volume of generated data and the rate of arrival /
 Large volume of generated data and the rate of arrival /
transmission of data leads to losses attributable to
transmission of data leads to dilution of original
transmission losses as well as noise which is inherent in
message attributable to noise in the system or inherent
the system and is a natural
causes. [8]
cause. [8]
 There is no replacement for human intelligence in the
 Privacy of data and hacking of data which is travelling
universe and thus the accuracy of 100% can never be
at such velocities is a very critical and poses a great
achieved by any form of artificial intelligence.
challenge. To maintain the data integrity redundancy
 Since data comes in the form of images, videos and
checks or firewall/proxy servers‟ checks slow down the
other file types which are laden with large number of
traffic of data leading to the direct impact on the
data points, storage, retrieval and processing of the data
processing times. Also as the volumes are high if
is very time consuming and error prone, not mentioning
velocity is impacted then the systems in all possibility
the costs involved.
will choke and crash.
 Infrastructure costs to maintain data velocities and
VII. ALUE ATTRIBUTE
ensuring minimal losses are extremely costly and not
easily available. While cloud based infrastructure is a Understanding the attributes of the data and data analytics
major step forward in this direction, there is a still a leads to asking the million dollar questions. “Are we
long way ahead amidst privacy and protection of data building the right system?” Value attribute is responsible for
concerns. answering this question. Logical Reasoning says that the
destination must be known before embarking on the journey
VI. ERACITY ATTRIBUTE rather than decide along the way or move ahead with a trial
Veracity is defined as “The quality of captured data can vary and error methodology. In principle one must clearly
greatly, affecting accurate analysis.” This attribute of the 5 establish our Critical Expected / Desired Outcome before
V‟s paradigm is by far the most important from the embarking on the journey of data analytics. Many models,
perspective of data quality. A recent study has shown that an frameworks and methodologies speak about this in different
average billion-dollar company in the US is losing about tones. The ITIL framework defines Continual Service
$130 million per year in the US. [11] Improvement or CSI stage where quantitative process
Veracity can be further understood as the accuracy of data. improvement must follow the below 7 steps: [17]
Due to the human element as well as the uncertain element  REQUIREMENT: What is required to be done? This
which is an inherent property of a system, it is possible to also speaks about the requirements or the business
get veracity related issues and these are the most difficult to objectives which have been set at the top management
identify and/or repair. Veracity is most important to make level as a direction for the business or the expected
the data operational. The element of bias and uncertainties outcomes after the further activities are performed.
render the data as less accurate than one expect it for our  RAW DATA REQUIRED: What data is required to
analysis. It is an established fact that no amount of analysis conduct the analysis? This is the critical aspect at level
can replace the need for good and precise data. Terms are two and not at level one as one may be inclined to
reliability, trustworthiness, accuracy, quality, precision, believe. Working from data to the outcome may end up
etc… are synonyms of veracity.
reaching a destination not desirable.
E.g. the Englishman would call for a taxi whereas an
American would call a cab. Algorithms trained to check one
feature may report the data inaccurately. Thus veracity acts
as the uncertain factor with the maximum impact. It can also

Published By:
Retrieval Number: 100.1/ijitee.K77090991120 Blue Eyes Intelligence Engineering
DOI: 10.35940/ijitee.K7709.0991120 158 and Sciences Publication
5V’s of Big Data Attributes and their Relevance and Importance across Domains

Also data dump is vast and our objectives are generally


crisp requiring only a part of data.
 RAW DATA AVAILABLE: What data is available
with us?
 DATA MISSING: What data is missing?
 COLLECTING MISSING DATA: How can one get the
data?
 QUALITY OF FINAL DATA: Is the data clean and
good?
 TOOLS AND TECHNIQUES: Is/are the methods, tools
Figure-9: 5V distribution within domain
and techniques for analysis known? Below we are now providing the individual domain
The below chart provides the value distribution of Big Data descriptions.
across various domains. Wholesale Trade Domain:
Eurostat defines wholesale trade as - “Wholesale trade is a
form of trade in which goods are purchased and stored in
large quantities and sold, in batches of a designated quantity,
to resellers, professional users or groups, but not to final
consumers.” [12]
The correlation between the V Parameters in this domain is
shown below.
0.2 x VOLUME + 0.2 x VELOCITY + 0.1 x VARIETY +
0.4 x VERACITY + 0.1 x VALUE
Figure-8: Value wise distribution of Big Data across
domains. [(Ratings: Very high-5, high-4, avg.-3, low-2,
very low-1)]
Given the above study available, the major challenges with
value are listed below:
 Organizations may be biased with their ideas or plans
and may miss out vital messages hidden in the data as
they would only be looking for something pertinent.
Due to large amounts of data, analysis may take either more
time or more money and organizations may be forced to
compromise on the level of analysis. This may lead to the
skewedness of the results of the analysis. Action plans
arising out of the half cooked data may cause more damage
than harm.
Figure -10: Correlation between the 5 V’s in Wholesale
VIII. 5V’S CORRELATION IN SOME BUSINESS Trade Domain Education Domain:
DOMAIN
The education domain can be defined as that business
The earlier section talks about the V Attributes of data and segment where learning is imparted at various levels ranging
how the domains rank according to the attributes. It is from schools to colleges to Universities to Professional /
noteworthy to also write about the reverse correlation that Corporate Learning and Development programs.
exists between the V‟s within a domain. This can be The correlation between the V Parameters in this domain are
understood best as a quantitative correlation of percentage shown below.
contribution of a V-Parameter within a domain towards
quality of data attributes. It is expressed as a percentage 0.1 x VOLUME + 0.1 x VELOCITY + 0.2 x VARIETY +
importance of a V Attribute in the domain using the same 0.4 x VERACITY + 0.2 x VALUE
data.

Published By:
Retrieval Number: 100.1/ijitee.K77090991120 Blue Eyes Intelligence Engineering
DOI: 10.35940/ijitee.K7709.0991120 159 and Sciences Publication
International Journal of Innovative Technology and Exploring Engineering (IJITEE)
ISSN: 2278-3075, Volume-9 Issue-11, September 2020

Figure-11: Correlation between the 5 V’s in Education


Domain Utilities Domain:
The Energy & Utilities Domain Working Group defined the
utilities domain as “individuals and organizations engaged
in the geospatial aspects of the planning, delivery,
operations, reliability and ongoing management of electric,
gas, oil and water services throughout the world. [13]
The correlation between the V Parameters in this domain are
shown below.
Figure -13: Correlation between the 5 V’s in
Transportation Domain Manufacturing Domain:
The correlations between the V Parameters in this domain
are shown below.

Figure-12: Correlation between the 5 V’s in Utilities


Domain
Transportation Domain: Figure- 14: Correlation between the 5 V’s in
Manufacturing Domain
The transportation domain can be defined as the business
group which is responsible for all travel and transport Banking and Securities Domain:
segments both for human as well as cargo movement. It Banking and Securities domain as the name suggests deals
includes the air, sea and surface transport. The correlation with Financial Institutions.
between the V Parameters in this domain are shown below. The correlation between the V Parameters in this domain is
shown below.

Published By:
Retrieval Number: 100.1/ijitee.K77090991120 Blue Eyes Intelligence Engineering
DOI: 10.35940/ijitee.K7709.0991120 160 and Sciences Publication
5V’s of Big Data Attributes and their Relevance and Importance across Domains

Figure -15: Correlation between the 5 V’s in Banking


and Securities Domain
Figure -17: Correlation between the 5 V’s in Retail
Communication and Media Domain: Domain Insurance Domain:
This domain deals with all types of businesses involving
media & communication like print, TV and Movie in Technofunc defines this domain as “Insurance is a contract
addition to the telecommunication sector for mobile and between two parties, the insurer or the insurance company
wireless communication. The correlation between the V and the insured or the person seeking insurance, whereby
Parameters in this domain is shown below. the insurer agrees to hedge the risk of the insured against
some specified future events or losses, in return for a regular
payment from the insured as premium.” [15]
The correlation between the V Parameters in this domain is
shown below.

Figure -16: Correlation between the 5 V’s in


Communication and Media Domain
Retail Domain: Figure -18: Correlation between the 5 V’s in Insurance
Domain Government Domain:
Technofunc defines this domain as “Retail is the sale of
goods and services from individuals or businesses to the With the growth of technology and its availability to the
end-user. The retail industry provides consumers with goods common masses, governments across the world have started
and services for their everyday needs. Retailers are part of migrating to e-Governance models involving remote access
an integrated system called the supply-chain.” [14] and management of government and public interaction.
The correlation between the V Parameters in this domain are The correlation between the V Parameters in this domain is
shown below. shown below.

Published By:
Retrieval Number: 100.1/ijitee.K77090991120 Blue Eyes Intelligence Engineering
DOI: 10.35940/ijitee.K7709.0991120 161 and Sciences Publication
International Journal of Innovative Technology and Exploring Engineering (IJITEE)
ISSN: 2278-3075, Volume-9 Issue-11, September 2020

but realistic variation that exists. This is a very important


finding as it truly shows us that domain has a direct and a
very powerful influence on the 5V‟s or their correlation
there off. We would love to believe that domain
independence exists when it comes to the importance of the
5V‟s, but this paper finds adversely. This will pave way for
further research and studies to lay importance to the 5V‟s
when designing more sophisticated software of the future in
the field of big data analytics. Concluding this we can
beyond doubt we can safely say that Big Data clearly deals
with issues beyond volume, variety, velocity, veracity and
value. Their interdependence is clearly important but that
the domain is the key differentiator when it comes to the
actual correlation. This is a critical study and one which
requires attention by every researcher, academician and
person who has an interest in studying systems operating in
the space of Big Data and their impacts in various domains.

REFERENCES
Figure -19: Correlation between the 5 V’s in Govt
1. Bohannon, P., Fan W., Geerts F., Jia X., Kementsietsidis A., ()
Domain Healthcare Domain:
Conditional Functional
Wikipedia defines this domain as “Health care or healthcare 2. Dependencies for Data Cleaning, University of Edinburg research
publications.
is the maintenance or improvement of health via the
3. Bresina, J L, Morris P H, (2006) Explanations and Recommendations
diagnosis, treatment, and prevention of disease, illness, for Temporal Inconsistencies, IWPSS,
injury, and other physical and mental impairments in human 4. https://www.stsci.edu/largefiles/iwpss/20066061912
beings. Healthcare systems are organizations established to IWPSS_draft4.pdf
meet the health needs of target populations.” [16] 5. Brisaboa, Nieves &Luaces, Miguel & Rodriguez, Andrea &Seco,
The correlation between the V Parameters in this domain are Diego. (2014). An inconsistency measure of spatial data sets with
respect to topological constraints. International Journal of
shown below.
Geographical Information Science. 28. 56-82.
10.1080/13658816.2013.811243.
6. Dr. S. Vijayarani and Ms. S. Sharmila, RESEARCH IN BIG DATA –
AN OVERVIEW, Informatics
7. Engineering, an International Journal (IEIJ), Vol.4, No.3, September
2016
8. Du Zhang, „Inconsistencies in Big Data‟ proceeding, Cognitive
Informatics & Cognitive Computing (ICCI*CC), 2013 P. 61-67 12th
IEEE Conference.
9. Garboden, Philip. (2020). Sources and Types of Big
10. Data for Macroeconomic Forecasting.
11. 10.1007/978-3-030-31150-6_1.
12. Hartzband, David. (2019). “What Is Data?” DOI:
10.4324/9780429061219-2.
13. Jeffrey Ray, Olayinka Johnny, Marcello Trovati, Stelios Sotiriadis
and Nik Bessis, The Rise of Big Data Science: A Survey of
Techniques, Methods and Approaches in the Field of Natural
Language Processing and Network Theory, Big Data Cogn. Comput.
2018, 2, 22; doi:10.3390/bdcc2030022, www.mdpi.com/journal/bdcc
14. Khan, Samiya& Liu, Xiufeng& Shakil, Kashish&Alam, Mansaf.
(2017). A survey on scholarly data: From big data perspective.
Information Processing &Management.DOI53. 923-944.
10.1016/j.ipm.2017.03.006.
15. Krogh, Jesper. (2020). Data Types. DOI
Figure- 20: Correlation between the 5 V’s in Healthcare 10.1007/978-1-4842-5584-1_13.
Domain 16. Kumar, Praveen. (2019). BIG DATA ANALYTICS IN HR
DOMAIN. DOI 10.1729/Journal.22887.
V. CONCLUSION 17. M. V. Martinez, A. Pugliese, G. I. Simari, V. S. Subrahmanian, and
H. Prade, How dirty is your relational database? An axiomatic
The above study reflects and identifies very clear inferences. approach, in Proc.
This survey conclude that data generated in different 18. 9th European Conference on Symbolic and
domains is also distinct in terms of its 5V attributes. That 19. Quantitative Approaches to Reasoning with
Uncertainty, ammamet, Tunisia, LNAI 4724, 2007, pp.103-114.
being understood, the challenges within one domain must be
analysed and treated unique to the domain rather than a
“one-size-fits-all” approach. Comparative study between the
domains show us that while roughly a 20% weightage does
exist between the 5V‟s across domains, there is still a subtle

Published By:
Retrieval Number: 100.1/ijitee.K77090991120 Blue Eyes Intelligence Engineering
DOI: 10.35940/ijitee.K7709.0991120 162 and Sciences Publication
5V’s of Big Data Attributes and their Relevance and Importance across Domains

20. M-C de Marneffe, A. N. Rafferty and C. D. Manning, Finding AUTHORS’ PROFILE


Contradictions in Text, Proc. of 46th Annual Meeting of the
Association for Computational Linguistics: Human Language Vinaya Keskar is a Research Student in Computer
Science Department Savitribai Phule Pune University
Technologies, 2008, pp.1039-1047.
(SPPU) She received her Masters‟ degree from North
21. Nawsher Khan,Ibrar Yaqoob, Ibrahim AbakerTargio Hashem, Zakira
Maharashtra University, Jalgaon MS, India. Her research
Inayat, Waleed Kamaleldin Mahmoud Ali,1 Muhammad interests are Algorithms, System programming, Theory of
Alam,Muhammad computation, Big Data, and Teaching in addition to being
22. Shiraz,1 and Abdullah Gani, Big Data: Survey, Technologies, an ardent and experienced academician.
Opportunities, and Challenges, Hindawi Publishing Corporation, The
Scientific World Journal, Volume 2014, Article ID Dr. Jyoti Y. Yadavis an Assistant Professor at
23. 712826,https://doi.org/10.1155/2014/712826 Department of Computer Science, Savitribai Phule Pune
24. Özsu, M. &Valduriez, Patrick. (2020). Big Data Processing. University. She is also a research guide assisting research
10.1007/978-3-030-26253-2_10. scholars in fulfilling their academic aspirations. Her
25. Ptiček, Marina &Vrdoljak, Boris. (2018). Semantic web technologies research interests are Cloud Computing, Cloud Based
and big data warehousing. 1214-1219. Systems, Google Cloud Solutions, Computer
10.23919/MIPRO.2018.8400220. Programming in JAVA, SQL and C++ predominantly and Computer
26. Ritter, D. Downey, S. Soderland and O. Etzioni, It‟s a Algorithms.
27. Contradiction-No, It‟s Not: A Case Study Using Functional Relations,
Proc. of Conference on Empirical Methods in Natural Language Dr. Ajay H. Kumar is a Research Guide in Computer
Processing, 2008. Science Department, Savitribai Phule Pune University
28. Samiddha Mukherjee, Ravi Shaw, Big Data – Concepts, Applications, and Director of Jaywant Technical Campus. His research
Challenge and Future Scope, International Journal of Advanced interests are Computer Networks, Data Warehousing,
Research in Computer and Communication Engineering Vol. 5, Issue Cloud Computing and performance of communication
2, February 2016, ISSN (Online) 2278 – 1021, ISSN (Print) 2319 – systems.
5940
29. Sergio Luján-Mora, Manuel Palomar, „Reducing Inconsistency in
Integrating Data from Different Sources‟. Proceedings 2001
International Database Engineering and Applications Symposium
(IDEAS
30. 2001), p. 209-218: IEEE Computer Society, Grenoble (France), July
16-18 2001.
31. https://doi.org/10.1109/IDEAS.2001.938087
32. Smirnov, Alexander &Levashova, Tatiana &Shilov, Nikolay. (2012).
Ontology Alignment for IT Integration in Business Domains. 127.
153-164. 10.1007/978-3-642-34228-8_15.
33. Yaqoob, Ibrar& Hashem, Ibrahim &Gani, Abdullah & Mokhtar,
Salimah& Ahmed, Ejaz &Anuar, Nor &Vasilakos, Athanasios.
(2016). Big Data: From Beginning to Future. International Journal of
Information Management. 36.
10.1016/j.ijinfomgt.2016.07.009.
34. Zhang, On Temporal Properties of Knowledge Base Inconsistency.
Springer Transactions on Computational Science V, LNCS 5540,
2009, pp.20-37.
35. https://pediaa.com/what-is-the-difference-between-d ata-redundancy-
and-data-inconsistency/
36. https://techcrunch.com/2012/08/22/how-big-is-faceb ooks-data-2-5-
billion-pieces-of-content-and-500-ter abytes-ingested-every-day/
37. https://www.bizdata.com.au/blogpost.php?p=costs-o f-data-
redundancy-and-data-inconsistency
38. https://www.brainkart.com/article/Relationship-Typ es,-Relationship-
Sets,-Roles,-and-Structural-Constra
39. ints_11431/#:~:text=This%20constraint%20specifie
s%20the%20minimum,called%20the%20minimum
%20cardinality%20constraint.&text=We%20will%2
0refer%20to%20the,constraints%20of%20a%20rela tionship%20type.
40. https://www.dsayce.com/social-media/tweets-day/
41. https://www.emc.com/leadership/digital-universe/20
14iview/executive-summary.htm
42. https://www.happiestminds.com/Insights/big-data-a nalytics/
43. https://www.heshmore.com/how-much-data-does-go ogle-handle/
44. https://www.kdnuggets.com/2012/12/idc-digital-uni verse-2020.html
45. https://www.techopedia.com/definition/19504/functi onal-
dependency#:~:text=Functional%20dependenc
y%20is%20a%20relationship,is%20functionally%2
0dependent%20on%20X.
46. https://www.washingtonpost.com/national/health-sci ence/google-
says-one-hour-of-video-is-now-being-u ploaded-to-youtube-every-
second/2012/01/27/gIQA tubBdQ_story.html

Published By:
Retrieval Number: 100.1/ijitee.K77090991120 Blue Eyes Intelligence Engineering
DOI: 10.35940/ijitee.K7709.0991120 163 and Sciences Publication

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy