FDS - Unit 1 Question Bank
FDS - Unit 1 Question Bank
FDS - Unit 1 Question Bank
UNIT I
PART A
The two operations that are performed in various data types to combine them are as follows:
i. Joining- enriching an observation from one table with information from another table.
ii. Appending or stacking- adding the observations of one table to those of another table.
18. Define the term setting the research goal (in data science processing).
Understanding the business or activity that our data science project is part of is key to ensuring its success and the
first phase of any sound data analytics project. Defining the what, the why, and the how of our project in a project
charter is the foremost task. Every data science project should aim to fulfill a precise and measurable goal that is
clearly connected to the purposes, workflows, and decision-making processes of the business.
Multiple Job Options :- Being in demand, it has given rise to a large number of career opportunities in its various
fields. Some of them are Data Scientist, Data Analyst, Research Analyst, Business Analyst, Analytics Manager, Big Data
Engineer, etc.
Business benefits :- Data Science helps organizations knowing how and when their products sell best and that’s why
the products are delivered always to the right place and right time. Faster and better decisions are taken by the
organization to improve efficiency and earn higher profits.
Highly Paid jobs & career opportunities :- As Data Scientist continues being the sexiest job and the salaries for this
position are also grand. According to a Dice Salary Survey, the annual average salary of a Data Scientist $106,000 per
year.
Hiring benefits :- It has made it comparatively easier to sort data and look for best of candidates for an organization.
Big Data and data mining have made processing and selection of CVs, aptitude tests and games easier for the recruitment
teams.
Disadvantages of Data Science :-
Everything that comes with a number of benefits also has some consequences.
some of the disadvantages of Data Science :-
Data Privacy :- Data is the core component that can increase the productivity and the revenue of industry by making
game-changing business decisions. But the information or the insights obtained from the data can be misused against any
organization or a group of people or any committee etc. Extracted information from the structured as well as unstructured
data for further use can also misused against a group of people of a country or some committee.
Cost :- The tools used for data science and analytics can cost a lot to an organization as some of the tools are complex
and require the people to undergo a training in order to use them. Also, it is very difficult to select the right tools according
to the circumstances because their selection is based on the proper knowledge of the tools as well as their accuracy in
analyzing the data and extracting information.
2. Briefly explain the architecture of Data Mining
Data mining is a significant method where previously unknown and potentially useful information is
extracted from the vast amount of data. The data mining process involves several components, and these components
constitute a data mining system architecture.
Data Source:
The actual source of data is the Database, data warehouse, World Wide Web (WWW), text files, and other
documents. You need a huge amount of historical data for data mining to be successful. Organizations typically store
data in databases or data warehouses. Data warehouses may comprise one or more databases, text files spreadsheets,
or other repositories of data. Sometimes, even plain text files or spreadsheets may contain information. Another
primary source of data is the World Wide Web or the internet.
Different processes:
Before passing the data to the database or data warehouse server, the data must be cleaned, integrated, and
selected. As the information comes from various sources and in different formats, it can't be used directly for the data
mining procedure because the data may not be complete and accurate. So, the first data requires to be cleaned and
unified. More information than needed will be collected from various data sources, and only the data of interest will
have to be selected and passed to the server. These procedures are not as easy as we think. Several methods may be
performed on the data as part of selection, integration, and cleaning.
Knowledge Base:
The knowledge base is helpful in the entire process of data mining. It might be helpful to guide the search or
evaluate the stake of the result patterns. The knowledge base may even contain user views and data from user
experiences that might be helpful in the data mining process. The data mining engine may receive inputs from the
knowledge base to make the result more accurate and reliable. The pattern assessment module regularly interacts with
the knowledge base to get inputs, and also update it.
Operational System
An operational system is a method used in data warehousing to refer to a system that is used to process the
day-to-day transactions of an organization.
Flat Files
A Flat file system is a system of files in which transactional data is stored, and every file in the system must
have a different name.
Meta Data
A set of data that defines and gives information about other data.Meta Data used in Data Warehouse for a
variety of purpose, including:
• Meta Data summarizes necessary information about data, which can make finding and work with particular
instances of data more accessible.
For example, author, data build, and data changed, and file size are examples of very basic document
metadata.
• Metadata is used to direct a query to the most appropriate data source.
Single-Tier Architecture
Single-Tier architecture is not periodically used in practice. Its purpose is to minimize the amount of data
stored to reach this goal; it removes data redundancies.The figure shows the only layer physically available is the
source layer. In this method, data warehouses are virtual. This means that the data warehouse is implemented as a
multidimensional view of operational data created by specific middleware, or an intermediate processing layer.
The vulnerability of this architecture lies in its failure to meet the requirement for separation between
analytical and transactional processing. Analysis queries are agreed to operational data after the middleware interprets
them. In this way, queries affect transactional workloads.
Two-Tier Architecture
The requirement for separation plays an essential role in defining the two-tier architecture for a data warehouse
system, as shown in fig:
Although it is typically called two-layer architecture to highlight a separation between physically available
sources and data warehouses, in fact, consists of four subsequent data flow stages:
1. Source layer: A data warehouse system uses a heterogeneous source of data. That data is stored initially to
corporate relational databases or legacy databases, or it may come from an information system outside the corporate
walls.
2. Data Staging: The data stored to the source should be extracted, cleansed to remove inconsistencies and fill
gaps, and integrated to merge heterogeneous sources into one standard schema. The so-
named Extraction, Transformation, and Loading Tools (ETL) can combine heterogeneous schemata, extract,
transform, cleanse, validate, filter, and load source data into a data warehouse.
3. Data Warehouse layer: Information is saved to one logically centralized individual repository: a data
warehouse. The data warehouses can be directly accessed, but it can also be used as a source for creating data marts,
which partially replicate data warehouse contents and are designed for specific enterprise departments. Meta-data
repositories store information on sources, access procedures, data staging, users, data mart schema, and so on.
4. Analysis: In this layer, integrated data is efficiently, and flexible accessed to issue reports, dynamically
analyze information, and simulate hypothetical business scenarios. It should feature aggregate information navigators,
complex query optimizers, and customer-friendly GUIs.
Three-Tier Architecture
The three-tier architecture consists of the source layer (containing multiple source system), the reconciled
layer and the data warehouse layer (containing both data warehouses and data marts). The reconciled layer sits
between the source data and data warehouse.
The main advantage of the reconciled layer is that it creates a standard reference data model for a whole
enterprise. At the same time, it separates the problems of source data extraction and integration from those of data
warehouse population. In some cases, the reconciled layer is also directly used to accomplish better some operational
tasks, such as producing daily reports that cannot be satisfactorily prepared using the corporate applications or
generating data flows to feed external processes periodically to benefit from cleaning and integration.
This architecture is especially useful for the extensive, enterprise-wide systems. A disadvantage of this
structure is the extra file storage space used through the extra redundant reconciled layer. It also makes the analytical
tools a little further away from being real-time.
5. Explain briefly the need for Data Science with real time application
Data Science is the deep study of a large quantity of data, which involves extracting some meaningful from the
raw, structured, and unstructured data. The extracting out meaningful data from large amounts use processing of data
and this processing can be done using statistical techniques and algorithm, scientific techniques, different
technologies, etc. It uses various tools and techniques to extract meaningful data from raw data. Data Science is also
known as the Future of Artificial Intelligence.
For Example, Jagroop loves books to read but every time when he wants to buy some books he was
always confused that which book he should buy as there are plenty of choices in front of him. This Data
Science Technique will useful. When he opens Amazon he will get product recommendations on the basis of his
previous data. When he chooses one of them he also gets a recommendation to buy these books with this one as this
set is mostly bought. So all Recommendation of Products and Showing set of books purchased collectively is one of
the examples of Data Science.
1. In Search Engines
The most useful application of Data Science is Search Engines. As we know when we want to search for
something on the internet, we mostly used Search engines like Google, Yahoo, Safari, Firefox, etc. So Data Science
is used to get Searches faster.For Example, When we search something suppose “Data Structure and algorithm
courses ” then at that time on the Internet Explorer we get the first link of GeeksforGeeks Courses. This happens
because the GeeksforGeeks website is visited most in order to get information regarding Data Structure courses and
Computer related subjects. So this analysis is Done using Data Science, and we get the Topmost visited Web Links.
2. In Transport
Data Science also entered into the Transport field like Driverless Cars. With the help of Driverless Cars, it
is easy to reduce the number of Accidents.For Example, In Driverless Cars the training data is fed into the algorithm
and with the help of Data Science techniques, the Data is analyzed like what is the speed limit in Highway, Busy
Streets, Narrow Roads, etc. And how to handle different situations while driving etc.
3. In Finance
Data Science plays a key role in Financial Industries. Financial Industries always have an issue of fraud
and risk of losses. Thus, Financial Industries needs to automate risk of loss analysis in order to carry out strategic
decisions for the company. Also, Financial Industries uses Data Science Analytics tools in order to predict the future.
It allows the companies to predict customer lifetime value and their stock market moves. For Example, In Stock
Market, Data Science is the main part. In the Stock Market, Data Science is used to examine past behavior with past
data and their goal is to examine the future outcome. Data is analyzed in such a way that it makes it possible to
predict future stock prices over a set timetable.
4. In E-Commerce
E-Commerce Websites like Amazon, Flipkart, etc. uses data Science to make a better user experience with
personalized recommendations.For Example, When we search for something on the E-commerce websites we get
suggestions similar to choices according to our past data and also we get recommendations according to most buy the
product, most rated, most searched, etc. This is all done with the help of Data Science.
5. In Health Care
In the Healthcare Industry data science act as a boon. Data Science is used for:
Detecting Tumor.
Drug discoveries.
Medical Image Analysis.
Virtual Medical Bots.
Genetics and Genomics.
Predictive Modeling for Diagnosis etc.
6. Image Recognition
Currently, Data Science is also used in Image Recognition. For Example, When we upload our image
with our friend on Facebook, Facebook gives suggestions Tagging who is in the picture. This is done with the help of
machine learning and Data Science. When an Image is Recognized, the data analysis is done on one’s Facebook
friends and after analysis, if the faces which are present in the picture matched with someone else profile then
Facebook suggests us auto-tagging.
7. Targeting Recommendation
Targeting Recommendation is the most important application of Data Science. Whatever the user searches
on the Internet, he/she will see numerous posts everywhere. This can be explained properly with an example:
Suppose I want a mobile phone, so I just Google search it and after that, I changed my mind to buy offline. Data
Science helps those companies who are paying for Advertisements for their mobile. So everywhere on the internet in
the social media, in the websites, in the apps everywhere I will see the recommendation of that mobile phone which I
searched for. So this will force me to buy online
8. Airline Routing Planning
With the help of Data Science, Airline Sector is also growing like with the help of it, it becomes easy to
predict flight delays. It also helps to decide whether to directly land into the destination or take a halt in between like
a flight can have a direct route from Delhi to the U.S.A or it can halt in between after that reach at the destination.
9. Data Science in Gaming
In most of the games where a user will play with an opponent i.e. a Computer Opponent, data science
concepts are used with machine learning where with the help of past data the Computer will improve its
performance. There are many games like Chess, EA Sports, etc. will use Data Science concepts.
10. Medicine and Drug Development
The process of creating medicine is very difficult and time-consuming and has to be done with full
disciplined because it is a matter of Someone’s life. Without Data Science, it takes lots of time, resources, and
finance or developing new Medicine or drug but with the help of Data Science, it becomes easy because the
prediction of success rate can be easily determined based on biological data or factors. The algorithms based on data
science will forecast how this will react to the human body without lab experiments.
11. In Delivery Logistics
Various Logistics companies like DHL, FedEx, etc. make use of Data Science. Data Science helps these
companies to find the best route for the Shipment of their Products, the best time suited for delivery, the best mode
of transport to reach the destination, etc.
12. Autocomplete
AutoComplete feature is an important part of Data Science where the user will get the facility to just type
a few letters or words, and he will get the feature of auto-completing the line. In Google Mail, when we are writing
formal mail to someone so at that time data science concept of Autocomplete feature is used where he/she is an
efficient choice to auto-complete the whole line. Also in Search Engines in social media, in various apps,
AutoComplete feature is widely used
6.How do you set the research goal, retrieving data and Data preparation process in Data Science
process
The data science process is a systematic approach to solving a data problem. It provides a structured framework
for articulating your problem as a question, deciding how to solve it, and then presenting the solution to
stakeholders.The typical data science process consists of six steps.
Step 1 : Setting a research goal
The outcome should be a clear research goal, a good understanding of the context, well-defined deliverables, and a
plan of action with a timetable. This information is then best placed in a project charter. The length and formality can,
of course, differ between projects and companies. In this early phase of the project, people skills and business acumen
are more important than great technical prowess, which is why this part will often be guided by more senior personnel.
7. Explain the Data exploration ,data modeling and presentation process in Data Science