HRA Merged

Download as pdf or txt
Download as pdf or txt
You are on page 1of 254

Chapter 1

Introduction to HR
Analytics

1
"Information is the fuel of the 21st century, and data analytics
is the combustion engine"
-Peter Sondergaard
Vice President of Gartner
Motivation for studying analytics
1. Rising popularity: The global HR analytics market is expected to exhibit a
10.4% growth rate during the forecast period of 2019-2025 (DUBLIN, Jan. 9,
2020 /PRNewswire/).

2. It is an “in-thing” now: Both academicians and practitioners agree that


analytics is the most important technological priority for the top management
at present. According to Harvard Business Review magazine, analytics is
emerging as the “sexiest job of the century”.

3. Assure bright career: It was predicted that there will be a shortage of more
than 1.5 million managers with adequate training in analytics in the United
States alone by 2018.
3
Answer the following questions
1. How does the bank assess the riskiness of the loan it might make
to you?

2. How does Amazon.com know which books and other products to


recommend to you when you log in to their website?

3. How do airlines determine what price to quote to you when you


are shopping for a plane ticket?

4. How do a HR manager will know that particular employee is


intending to leave?
The Beginning
• In 2007, a group of IBM computer scientists initiated a project to
develop a new decision technology to help in answering these
types of questions.
• It was named as WATSON
• Their objective was to use vast amount of data now available on
the Internet can be used to make more data-driven, smarter
decision.
• Their initial clients were: Citibank, Well point health insurance
company etc.

5
Introduction
• Three developments spurred recent explosive growth in the use
of analytical methods in business applications:
• First development:
• Technological advances, Internet social networks, and data
generated from personal electronic devices, produce incredible
amounts of data for businesses.

• Businesses want to use these data to improve the efficiency and


profitability of their operations, better understand their
customers, price their products more effectively, and gain a
competitive advantage.

6
Introduction
• Three developments spurred recent explosive growth in the use
of analytical methods in business applications: (contd.)
• Second development:
• Advances in computational approaches to effectively
handle and explore massive amounts of data
• Faster algorithms for optimization and simulation, and
• More effective approaches for visualizing data.

7
Introduction
• Three developments spurred recent explosive growth in the use
of analytical methods in business applications:(contd.)
• Third development:
• Computing power and storage capability.
• Better computing hardware, parallel computing, and cloud
computing have enabled businesses to solve big problems faster
and more accurately than ever before.

8
Figure 1 - Google Trends Graph of Searches on
the term Analytics

9
Decision making
1. Tradition

2. Intuition

3. Rule of thumb

4. Data driven or fact based


Researchers at MIT’s Sloan School of Management and the University
of Pennsylvania concluded that firms guided by data driven decision
making have higher productivity and market value and increased
output and profitability (Brynjolfsson, Hitt and Kim, 2013).

10
A Categorization of Analytical Methods
and Models

11
A Categorization of Analytical Methods and
Models
• Descriptive analytics: It encompasses the set of techniques that
describes what has happened in the past.

Examples - data queries, reports, descriptive statistics, data


visualization (data dashboards), data-mining
techniques, and basic what-if spreadsheet
models.
• Data query - It is a request for information with certain
characteristics from a database.
12
A Categorization of Analytical Methods and
Models
• Data dashboards - Collections of tables, charts, maps, and
summary statistics that are updated as new data become
available.
• Uses of dashboards
• To help management monitor specific aspects of the company’s
performance related to their decision-making responsibilities.
• For corporate-level managers, daily data dashboards might summarize
sales by region, current inventory levels, and other company-wide
metrics.
• Front-line managers may view dashboards that contain metrics related
to staffing levels, local inventory levels, and short-term sales forecasts.

13
Data Visualization Techniques and Tools
•Interactive dashboards are very easily constructed with Excel 2010
and higher versions

•With Excel 2007, it requires use of certain advanced features to


create dashboards

•Some essential knowledge regardless of Excel’s version are:

- Application of Name Range, IF ELSE Function, Lookup function,


insert function and Form control functions

• For more visual appeal – Speedometer gauge construction


14
Proprietary Data Visualization Techniques and
Tools
•Tableau

•Qlik

•Spotfire (Tibco)

•Workday

15
Descriptive Analytics –A HR Dashboard

16
A Categorization of Analytical Methods and
Models
• Predictive analytics: It consists of techniques that use models
constructed from past data to predict the future or ascertain the
impact of one variable on another.

• Eg: Survey data and past purchase behavior may be used to help
predict the market share of a new product.

• Techniques: Linear regression, Time series analysis, Data mining


techniques and simulation

17
Basic Predictive Analytics Tools
Cross-tabulations, frequency distributions, means and SDs

Correlation

Regression

ANOVA/ t test

Chi square test

18
Advanced Predictive Analytics Tools
 Classification
-Supervised learning
- Decision Tree
- Unsupervised learning
- Clustering
- Neural Network
 Prediction
- Logistic Regression
- Exponential smoothing
- Neural Network
- Decision Tree
19
Predictive Analytics Software
•Excel, Excel Miner

•SPSS

•SAS

•R

•Python

20
A Categorization of Analytical Methods and
Models
• Prescriptive Analytics: It indicates a best course of action to take
• Models used in prescriptive analytics:
Optimization models
• Models that give the best decision subject to constraints of the situation.

Simulation optimization
• Combines the use of probability and statistics to model uncertainty with
optimization techniques to find good decisions in highly complex and highly
uncertain settings.

Decision analysis
• Used to develop an optimal strategy when a decision maker is faced with
several decision alternatives and an uncertain set of future events.
• It also employs utility theory, which assigns values to outcomes based on
the decision maker’s attitude toward risk, loss, and other factors.

21
Predictive Analytics- A Decision Tree

22
Prescriptive Analytics -Tools
Job allocation

Resource optimization Algorithms

Software requirements: Excel solver

23
A Categorization of Analytical Models for
Prescriptive analytics
• Optimization models
Model Field Purpose
Portfolio models Finance Use historical investment return data to
determine the mix of investments that yield the
highest expected return while controlling or
limiting exposure to risk.
Supply network Operations Provide the cost-minimizing plant and distribution
design models center locations subject to meeting the customer
service requirements.
Price markdown Retailing Uses historical data to yield revenue-maximizing
models discount levels and the timing of discount offers
when goods have not sold as planned.
Zeffanne’s HR Incorporates age distribution, mobility rate,
Multimatrix model productivity ratio, absenteeism, attrition rate etc.
24
Figure 2: The Spectrum of Business Analytics

25
“No one in finance, supply chain, marketing, etc. would ever propose
a solution in their area without a plethora of charts, graphs, and
data to support it, but HR is known to all too frequently rely instead
on trust and relationships. People costs often approach 60 percent of
corporate variable costs, so it makes sense to manage such a large
cost item analytically”…. Google co-counder and CEO Larry Page

26
Figure 3- Google Trends for Marketing, Financial,
and Human Resource Analytics, 2004–2012

27
Definition of HR Analytics
• According to Lauri, Bassi, McBassi & Company, HR analytics is
defined as the application of a methodology and the integrated
process for improving the quality of people related decisions for
the purpose of improving individual and/or organizational
performance.

• Three elements of HRA


• Evidence based analysis
• Strategic in nature
• Impact on organizational outcomes
28
What HR Analytics is not?
1. HR Analytics is not just about efficiency metrics or scorecards.

2. Analysing the performance gaps between two departments or


two branches in an organization is not HRA

3. Yearly analysis of departmental performance is not HRA if its


impact on the business is not understood.

4. Correlation and benchmarking is not HRA.

29
Evolution of HRA
31
HR Analytics in Practice
• Example for Human Resource (HR) Analytics:

• Sears Holding Corporation (SHC), owners of retailers Kmart and Sears, Roebuck and
Company, has created an HR analytics team inside its corporate HR function. The
team uses descriptive and predictive analytics to support employee hiring and to
track and influence retention.

32
33
Thank you

34
HR Metrics
“Numbers have an important story to tell. They rely on
you to give them a clear and convincing voice.”
“Without data we have only opinion”

2
HUMAN RESOURCE ACCOUNTING IN INDIA

• Though Human Resources Accounting was introduced way back in the 1980s, it
started gaining popularity in India after it was adopted and popularized by NLC.

• Even though the situation prevails, yet, a growing trend towards the
measurement and reporting of human resources particularly in public sector is
noticeable during the past few years.

• CCI, ONGC, Engineers India Ltd., National Thermal Corporation, Minerals and
Metals Trading Corporation, Madras Refineries, Oil India Ltd., Associated
Cement Companies, SPIC, Metallurgical and Engineering consultants India
Limited, Cochin Refineries Ltd. Etc. are some of the organizations, which have
started disclosing some valuable information regarding human resources in their
financial statements.
• Itis needless to mention here that, the importance of human
resources in business organization as productive resources was by
and large ignored by the accountants until two decades ago.

• But still there are companies which do not follow uniform


policies in reporting human resource information as no
Internationally Accepted Accounting Standard has been evolved
and no guidelines are available as well.
5
HR Measures

• There are different benchmarks to assess these ratios and the one by
Saratoga is considered the best. It was developed jointly by Saratoga
and SHRM (earlier ASPA).

• Saratoga holds a database of people performance metrics (HR Index)


of over 10,000 organizations.

• Over 40% fortune 500 companies are its clients. The Human capital
measurement and benchmarking business function of Price
Waterhouse Coopers

6
Classification of HR Measures

• By Saratoga
• Organization and operations
• HR staff and structure
• Compensation and benefits
• Staffing and hiring
• Retention and separations

• By Center for Talent Reporting as Talent development reporting


principles [TDRP]
• Capability management measures
• Leadership development
• Learning and development
• Performance management
• Talent acquisition measures
• Total rewards measures
7
Important HR metrics
1. Recruitment metrics
1. Cost per hire
2. Internal cost per hire
3. External cost per hire

8
9
Recruiting efficiency

• Interview time
• Hire rate
• Offer acceptance rate
• Quality of Hire

10
Recruiting efficiency

• Interview time
• Hire rate
• Offer acceptance rate
• Quality of Hire

11
Recruiting efficiency

• Interview time
• Hire rate
• Offer acceptance rate
• Quality of Hire

12
Recruiting efficiency

• Interview time
• Hire rate
• Offer acceptance rate
• Quality of Hire

13
Recruiter effectiveness = RT + TTF + CPH + OAR + QH/N

Where RE = Response time


TTF= Time to fill jobs
CPH = Cost per hire
AR = Acceptance rate
QH = Quality of hires

14
Training and development metrics
• Training cost factor= [CC+TR+S+RC+TL+TS+PS+OH]/PT
Where CC= Consultant cost
TR= Training facility rent
S = Supplies/ Stationery
RC = Refreshment cost
TL = Travel and lodging
TS= Trainer’s salary/ fees
PS = Participant’s salary
OH = Overhead of training dept./ agency
PT = No. of people trained
•Training cost per hour = TC/ TH
15
Calculating Training ROI
• % Training ROI = (Benefits/ Cost) X100

• Payback Period = Costs/ Monthly Benefits


An exercise
Employee Attrition

Employee Retention Rate


ERR = (Employees at the beginning – Employees leaving)/ Employees at
the beginning
Table 1: Manpower Details for the year 2018

Opening New Closing


Month Resigned
Balance Join Balance
Jan-18 1449 11 9 1451
Feb-18 1451 4 9 1446
Mar-18 1446 2 10 1438
Apr-18 1438 4 3 1439
May-18 1439 2 13 1428
Jun-18 1428 2 11 1419
Jul-18 1419 5 13 1411
Aug-18 1411 0 7 1404
Sep-18 1404 3 16 1391
Oct-18 1391 1 9 1383
Nov-18 1383 3 18 1368
Dec-18 1368 3 7 1364 18
Career Progression Metrics

• Career transfer ratio/Transfer= Transfer /(Promotion +Transfer)


• Career path ratio/Promotion = Promotion /(Promotion +Transfer)

• For eg: Let us assume that in an organization, there are 24 employees


who have received promotion while 59 of the rest have been
transferred to some other job location. Thus, career path ratio will be
numerically equal to?

19
Cost and benefit analysis
• Employee Cost factor = Total compensation/FTE
• Revenue Factor = Total Organizational Revenue/FTE
• Income factor = Total Organizational revenue-Total operating
expenses/FTE

20
The Indian Context
• No set guidelines for measuring HR’s effectiveness

• Some work has been done to identify key employee


competencies
HR Compass

• XLRI Jamshedpur, Confederation of Indian Industry (CII), and


National Human Resource Development Network (NHRD) have
developed HR Compass.
Road ahead..

• Need for mainstreaming for HR Analytics

• Need for dedicated professionals to manage the huge data generated


by Indian firms and make predictive decisions

• Currently, mainly descriptive analytics being practiced


Basics of MS-Excel
Types of data
Data cleaning
• Data validation
• Remove unnecessary spaces
• Convert text to column
• Remove duplicate
• Treat blanks
• Apply and remove formatting
• Spell check
• Identify errors in sheet
Introduction to Microsoft Excel
• Originally released in 1985, Microsoft Excel has become the most-used
spreadsheet program in the world.

• Microsoft Excel is a spreadsheet developed


by Microsoft for Windows, macOS, Android and iOS.

• It features calculation, graphing tools, pivot tables, and a macro programming


language called Visual Basic for Applications.

• The program also serves as a programming platform


for Visual Basic for Applications. Because of its utility, Excel has become a
staple in many enterprises.
Functions
• Excel 2016 has 484 functions.
• Of these, 360 existed prior to Excel 2010.
• Microsoft classifies these functions in 14 categories.

Add-ins
• Additional features are available using add-ins. Several are provided with
Excel, including:
• Analysis Tool Pak: Provides data analysis tools for statistical and
engineering analysis (includes analysis of variance and regression
analysis)
• Euro Currency Tools: Conversion and formatting for euro currency
• Solver Add-In: Tools for optimization and equation solving
Few shortcut keys
• Ctrl+end (move to the last cell in the workbook)
• Ctrl+home (to move to the first cell)
• Ctrl+right arrow (to go to the right edge of the active
range)
• Ctrl+left arrow (to go to left edge)
• Ctrl+up arrow (to go to the top edge of the active
range)
• Ctrl+s (save)
• Ctrl+c (copy)
• Ctrl+x (cut)
• Ctrl+z (undo)
Using data types to populate a worksheet
• Three type of data can be entered into excel: Text, numbers and
formulas
• Entering dates
• Dates are often used in worksheets to track data over a specified period of
time.
• Excel interpret two digit years from 00 to 29 as the years 2000 to 2029; two
digit years from 30 to 99 are interpreted as 1930 to 1999.
• When you enter a date into a cell in particular format, the cell is automatically
formatted even if you delete the entry.
• Regardless of the date format displayed in the cell, the formula bar displays
the date in month/day/four digit year format because that is the format required
for calculation and analyses.
Enter a date or time in a cell

1. On the worksheet, click a cell.

2. type a date or time as follows:

• To enter a date, use a slash mark or a hyphen to separate the


parts of a date; for example, type 9/5/2021 or 5-Sep-2021.

• To enter a time that is based on the 12-hour clock, enter the


time followed by a space, and then type a or p after the time,
for example, 9:00 p. Otherwise, Excel enters the time as AM.

• To enter the current date and time, press Ctrl+Shift+;


(semicolon).
Filling a series with auto-fill
• Excel provides auto-fill options that automatically fill cells with data.
• Fill handle is a small green square at the lower right corner of the
selected cell or range of cells.
• Range is group of adjacent cell that you are select to perform
operations on the all of the selected cells.
Sorting Data
• As we add more content to a worksheet, organizing this information becomes
especially important. We can quickly reorganize a worksheet by sorting your data.
For example, we could organize a list of contact information by last name. Content
can be sorted alphabetically, numerically, and in many other ways.

• Sort Range: It sorts the data in a range of cells, which can be helpful when working
with a sheet that contains several tables. Sorting a range will not affect other content
on the worksheet.

Custom sorting

• Sometimes we may find that the default sorting options can't sort data in the order we
need. Excel allows us to create a custom list to define our own sorting order.
Filtering Data
• Filters can be used to narrow down the data in the worksheet and hide
parts of it from view.

• Filtering is different from grouping because it allows us to qualify and


display only the data that interests us.

• For example, we could filter a list of survey participants to view only


those who are between the ages of 25 and 34.

• We could also filter an inventory of paint colors to view anything that


contains the word blue, such as bluebell or robin's egg blue.

• Filters are additive, meaning we can use as many as we need to


narrow our results.
Basics of Conditional Formatting of Data
• Conditional formatting takes the layout and design options for the Excel sheet to the
next level. It enables users to make sense of the data and spot important cues

• When a conditional format is applied on a cell it means the formatting of the cell is
based on a condition.

• As the name suggests, you can setup conditions based on which Excel will know
what formatting to apply for different values.

• Let’s try a very simple conditional format. Click on a cell and go the Home Tab >
Conditional Formatting. Then select the first option, Highlight Cell Rules > Greater
Than.

• In the next dialogue box, put a number, like 100 and press OK.
Now, if you put a number below 100, nothing will happen. But as soon as you put a number that is greater

than 100, the cell will turn red. Again, if you change the number to one which is less than or equal to 100,

the color disappears.

Conditional formatting thus allows you to visually analyze your data, based on a large number of condition

types,

• Greater than, Less than, Between

• Above / Below Average

• Top / Bottom 10

• Top / Bottom 10%

• Duplicates / Uniques

• Dates – Dynamic or a fix date range

• Text containing

Apart from these in-built ones, you can create ‘n’ number of conditions
Using Basic Formulas
UNDERSTANDING AND DISPLAYING FORMULAS
Types of referencing
1. A relative cell reference is one that adjusts the cell identifier
automatically if you insert or delete columns or rows, or if you copy the
formula to another cell.
2. An absolute cell reference refers to a specific cell or range of cells
regardless of where the formula is located in the worksheet. Absolute
cell references include two dollar signs in the formula, preceding the
column letter and row number. The absolute cell reference $B$3, for
example, always refers to column (B) and row (3).
3. A mixed cell reference is a cell reference that uses an absolute column or
row reference, but not both.
4. An external reference refers to a cell or range in a worksheet in another
Excel workbook, or to a defined name in another workbook.
Naming a Range
• After selecting a range of cells, you can name the range using three
different methods:
• By typing a name in the Name Box next to the formula bar
• By using the New Name dialog box
• By using the Create Names from Selection dialog box
Rules and guidelines for naming ranges
include the following:
• Range names can be up to 255 characters in length.
• Range names may begin with a letter, the underscore character (_), or
a backslash (\). The rest of the name may include letters, numbers,
periods, and underscore characters, but not a backslash.
• Range names may not consist solely of the letters “C”, “c”, “R”, or “r”,
which are used as shortcuts for selecting columns and rows.
• Range names may not include spaces. Microsoft recommends you use
the underscore character (_) or period (.) to separate words, such as
Fruit_List and Personal.Budget.
• Range names cannot be the same as a cell reference, such as A7 or
$B$3.
Using Functions
Using the COUNT, COUNTA, and
COUNTBLANK Functions
• Use the COUNT function when you want to determine how many
cells in a range contain a number.
• There are other variations of the COUNT function.
• The COUNTA function counts all nonblank entries in a range, whether
they include text or numbers.
• The COUNTBLANK function counts the number of blank cells in a
range.
• Using the MIN Function: The MIN function allows you to determine
the minimum value in a range of cells.
• Using the MAX Function: The MAX function returns the largest value
in a set of values.
• AutoSum makes that task even easier by calculating (by default) the
total from the adjacent cell up to the first nonnumeric cell, using the
SUM function in its formula.
Using the IF Logical Function
• The result of a conditional formula is determined by the state of a
specific condition or the answer to a logical question.
• An IF function sets up a conditional statement to test data. An IF
formula returns one value if the condition you specify is true and
another value if it is false.
• The IF function requires the following syntax:
• =IF(Logical_test,Value_if_true,Value_if_false)
Using the SUMIFS, COUNTIFS, and AVERAGEIFS
Functions
• The SUMIFS function adds cells in a range that meet multiple criteria. The syntax
for the SUMIFS function is:
• =SUMIFS(Sum_range,Criteria_range1,Criteria1,Criteria_range2,Criteria2,…)

• The COUNTIFS function counts the number of cells within a range that meet
multiple criteria. You can create up to 127 ranges and criteria. The syntax for the
COUNTIFS function is:
• =COUNTIFS(Criteria_range1,Criteria1,Criteria_range2,Criteria2,…)

• The AVERAGEIFS function returns the average (arithmetic mean) of all cells that
meet multiple criteria. You learn to apply the AVERAGEIFS formula in the
following exercise to find the average of a set of numbers where two criteria are
met. The syntax for the AVERAGEIFS function is:
• =AVERAGEIFS(Average_range,Criteria_range1,Criteria1,Criteria_range2,Criteria2,
…).
Looking Up Data Using the VLOOKUP Function
• When worksheets contain long and sometimes cumbersome lists of
data, you need a way to quickly find specific information within these
lists. This is where Excel’s lookup functions come in handy.
• The “V” in VLOOKUP stands for vertical. This formula is used when
the comparison value is in the first column of a table. Excel goes down
the first column until a match is found and then looks in one of the
columns to the right to find the value in the same row. The VLOOKUP
function syntax is:
• =VLOOKUP(Lookup_value,Table_array,Col_index_num,Range_lookup)
Looking Up Data Using the HLOOKUP
Function
• The “H” in HLOOKUP stands for horizontal. HLOOKUP searches
horizontally for a value in the top row of a table or an array and then
returns a value in the same column from a row you specify in the
table or array.
• The HLOOKUP function syntax is similar to that of the VLOOKUP
function, with the exception of the Row_index_num argument that
replaces the Col_index_num argument from VLOOKUP. The HLOOKUP
syntax is:
• =HLOOKUP(Lookup_value,Table_array,Row_index_num,Range_lookup
)
Using the MATCH and INDEX Functions
• The MATCH and INDEX functions are lookup functions that can help
you locate a specific item or the position of an item in a specified
range.
• Use the MATCH function when you need to know the relative position
of an item in a range you specify, rather than the item itself.
• Use the INDEX function to return a value or the reference to a value in
a specified range.
Application for Tableau in HR
data Visualization
Introduction
• Tableau Software is an interactive data
visualization software company founded on January
2003 by Christian Chabot, Pat Hanrahan and Chris
Stolte, in Mountain View, California.

• Tableau is a powerful and fastest growing data


visualization tool used in the Business Intelligence
Industry. It helps in simplifying raw data into the very
easily understandable format. Data analysis is very
fast with Tableau and the visualizations created are in
the form of dashboards and worksheets.
Tableau Software Versions
• Tableau desktop (Individually licensed)
• Tableau for academicians
• Tableau server
• Tableau Reader
• Tableau public
• Tableau public premium
Benefits of Tableau
• Speed
• User friendly
• Interactive dashboards
• Direct connections
• Easy publication
• Growing market
How to download Tableau
• https://www.tableau.com/
• Click “try now”
• Provide valid email id
• Download 14 days free trail version
• Check the system requirements for windows,
Mac etc.
• Open the tableau desktop folder, then check the I
agree with terms and condition box and install
the software.
• And finally register.
Connect to database
• Connect to file
• Connect to server
• Sample file in tableau: Documents- My
tableau repositories Data source
• Live vs Extract
Data types in Tableau
1. # number data type
2. abc- String
3. Calendar- Date type
4. Date and time
5. Globe- geographic data type
6. Boolean values- true/False
How to view database
• Drag and drop
• View selected data
• Hide fields
• Rename the Column
• Split the cell
Basic activities in Tableau
• Sorting
– Quick sort
– Sorting with mark card
– Swapping the fields (ctrl + w)
• Drill down and hierarchies
• Groupings
– Create group using header
– New group will reflect as dimension
Auto generated field (Italics format)
• Number of records
• Measure name
• Measure value
Word maps
Steps to create word map
• Eg: which state has highest sales?
1. State Text
2. Sales  Size
3. Mark Dropdown Text
4. Sales Color
5. Colorsedit colorspalette
6. Textdropdownselectbold
Creating HR Dashboards Using Microsoft
Excel
By: Sharda Singh
Introduction
• Data visualization is a vital aspect of data analytics.
• Data visualization begin with very basic Excel charts and graphs to
more sophisticated options in the form of CG plots, bubble plots,
maps and box-plots, which can be devised using proprietary data
visualization packages.
• HR visualization software packages :
• Tableau
• Spot fire
• Qlik
• Workday likewise
Few Key Excel add-ins
• Create name range
• The developer tab:
• Enable the developer tab
1. Click the File tab and then click Options.
2. In the Excel Options dialog box, click Customize Ribbon.
3. In the Main Tabs list on the right, check the Developer box if it is not already checked.
This adds the Developer tab to the Excel ribbon. Click OK.
• Developer tab has three key features:
1. Code: It is meant for creating MACROS
2. XML: To import XML maps, XML expansion packs and to use XML commands to create
customizable operations.
3. Control: Form control function.
Form controls
• Two important tabs for dashboard development purposes are:
1. LIST BOX
2. COMBO BOX
Employee Ledger
Creating customized graph
Creating Interactive speedometers
• Speedometer add more visual appeal to the dashboards.
• This is an optional exercise, but in certain decision making context, it
can be very effective.
• The name speedometer here is symbolic of the speedometer dial of a
motorcycle or a car.
Slicing and Dicing of HR data
• In business analytics, the master database is called Online Analytical
Processing (OLAP) data cube that has multi-dimensional data, which
can be mined through different data mining techniques.
• Few popular data mining techniques are
• Slicing
• Dicing
• Drill down
• Roll up
• Drill through
• Slicing: refers to slice off a part of data from a data cube.
• Dicing: also refers to separating a part of data but dice exhibits all the
dimensions of data cube. In other words, the diced portion is the
miniature of the data cube.
• Roll up: It is process of zoom out in functions of camera. For eg:
"street > city > province > country"
• Drill down: It is an activity to gain insights into the data. For eg: "day
< month < quarter < year."
• Drill through: It is an act of selecting a part from a model and access
the original data from which that part was derived.
Preparing data for slicing and dicing
1. Unfreeze panes
2. Unhide all the rows or columns
3. Be aware about the formula
4. No blank columns
5. No blank rows
6. No merged cells
Using PIVOT tables
• Word “PIVOT” means: Spin around, Spin, Revolve, Rotate, Turn, etc
• Pivot Table is an interactive excel report which is used to summarize,
analyze and explore data.
• Pivot tables are great tools for comparing data and used for cross-
tabulations.
• Pivot Chart visualizes the summary data of a Pivot table report, to
easily see the comparisons, patterns and trends.
• Both a Pivot table and Pivot chart enables us to make informed
decisions about any critical data set.
Few of the important functions in PIVOT tables
1. SORTING PIVOT TABLE
2. FILTER PIVOT TABLE
3. CHANGE SUMMARY CALCULATION
4. FORMAT VALUES
Essentials of predictive analytics
(parametric tests)
Topics to be covered
• Correlation
• Regression
• T test/ ANOVA
Correlation
• The analysis to find the extent to which two or more
variables are related with each other
• That means if one variable increases, other variable(s)
increase (+ve correlation) or decrease (-ve correlation)
• If only 2 variables involved-bivariate correlation
• Pearson’s correlation coefficient (r) –indicates the
strength of association/relation between two or more
variables when the variables are continuous/
parametric
• Value of r ranges as: -1 <r<+1
• There is no IV or DV here
• If r = Zero, this means no association or
correlation between the two variables.

• If 0 < r < 0.25 = weak correlation.

• If 0.25 ≤ r < 0.75 = intermediate correlation.

• If 0.75 ≤ r < 1 = strong correlation.

• If r = 1 = perfect correlation.
MS Excel Data Analysis ToolPak
• An Excel Add-In which includes several
statistical analysis techniques

5
Installing Analysis Toolpack
1. Open a blank Excel spreadsheet.
2. Click on the windows icon (pre 2010) or the file tab
(2010+).
3. Choose Excel Options (pre 2010) or just options (2010+).
4. Choose add-ins.
5. In manage (bottom of window), choose Excel Add-ins and
click Go.
6. Check the box that says Analysis ToolPack and click OK.
7. After loading the Analysis ToolPack, the Data Analysis
command is available under the Data tab. It should be the
far right option.
Analysis Toolpack input interface
• The Input Range refers to the location of
the data set.

• Check the option button Columns or


Rows to indicate how your data is
grouped.

• If there are labels in the first row of each


column of data, then check the Labels in
First Row box.

• The Output Range refers to where the


results of the analysis will be displayed in
the current worksheet.

• Check the Summary Statistics box to


calculate the most commonly used
statistics.

7
Excel Correlation Output
Excel Correlation Output
Tools / data analysis / correlation…

Tree Height Trunk Diameter


Tree Height 1
Trunk Diameter 0.886231 1

Correlation between
Tree Height and Trunk Diameter
SPSS – STATISTICAL PACKAGE FOR
SOCIAL SCIENCES
• Versions- 16, 20, 21

• Essential data formats – excel file, spss file, R


file, text file

• Used for – Multivariate predictive data


analysis
Data entry in SPSS
Data transformation
Saving output into text/doc file
Correlation analysis in SPSS
SPSS Analysis
SPSS Output
Interpretation
• In a sample of 66 children with CP, there is no
significant relationship between age of the children
and systolic BP, r = 0.02, p = 0.90.
• Assuming non-normal distribution of either one of
the variables, a non-parametric test was used
(Spearman Rank correlation), r = 0.025, p = 0.84.
• In either test, there is no linear relationship between
age at surgery and the SBP of these patients.

*However the absence of a linear association does not rule out a


non-linear relationship between the age of these patients and
their SBP.
Regression
• The process of finding the impact of independent
variable(s) on dependent variable
• If DV is continuous, linear regression
• If DV is categorical, logistic regression
• Coefficient of determination (R2/R Squared) =
total variance explained/ total effect created by
IV(s) on DV
• Value ranges from 0 (no effect) to 1 (100% effect)
Multiple Regression Equation

The coefficients of the multiple regression model are estimated using sample data

Multiple regression equation with k independent variables:


Estimated Estimated
(or predicted) Estimated slope coefficients
intercept
value of Y

Ŷi  b0  b1X1i  b 2 X2i    bk Xki


Example:
2 Independent Variables
• A distributor of frozen desert pies wants to
evaluate factors thought to influence
demand
– Dependent variable: Pie sales (units per week)
– Independent variables: Price (in $)
Advertising ($100’s)

• Data are collected for 15 weeks


Pie Sales Example
Price Advertising Multiple regression equation:
Week Pie Sales ($) ($100s)
1 350 5.50 3.3
2 460 7.50 3.3
3 350 8.00 3.0
Sales = b0 + b1 (Price)
4
5
430
350
8.00
6.80
4.5
3.0
+ b2 (Advertising)
6 380 7.50 4.0
7 430 4.50 3.0
8 470 6.40 3.7
9 450 7.00 3.5
10 490 5.00 4.0
11 340 7.20 3.5
12 300 7.90 3.2
13 440 5.90 4.0
14 450 5.00 3.5
15 300 7.00 2.7
REGRESSION IN EXCEL
• Step 1: Go to Data analysis
• Select Regression from drop down menu
• Select input range – the cells containing IV
values in input Y range and the cell containing
DV value in input X range
• Select output range – select any cell where the
result will be shown
Multiple Regression Output
Regression Statistics
Multiple R 0.72213
R Square 0.52148
Adjusted R Square 0.44172
Standard Error 47.46341 Sales  306.526 - 24.975(Pri ce)  74.131(Adv ertising)
Observations 15

ANOVA df SS MS F Significance F
Regression 2 29460.027 14730.013 6.53861 0.01201
Residual 12 27033.306 2252.776
Total 14 56493.333

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%


Intercept 306.52619 114.25389 2.68285 0.01993 57.58835 555.46404
Price -24.97509 10.83213 -2.30565 0.03979 -48.57626 -1.37392
Advertising 74.13096 25.96732 2.85478 0.01449 17.55303 130.70888
Regression analysis in SPSS
Linear Regression: SPSS
SPSS REGRESSION OUTPUT
Model Summary Interpretation
• R Squared. Multiple correlation coefficient-
indicates whether proposed hypothesis is
actually observed
• Adjusted R Squared – Whether sample size is
adequate after adjusting for addition of
extraneous variables. Value should be as close
to R Squared as possible.
Interpretation of ANOVA Table
• Df = Degrees of freedom – How many variables
are allowed to vary?
• DF1 = Total Degree of freedom due to total
variance = N-1
• Eg: If sample size is 200, DF1= 199
• DF 2 = Regression degree of freedom due to total
number of variables in the study – one constant
(fixed)
• Eg: If in a study there are 7 variables, DF2 = 7-1=6
Things to look out for in parameter
estimates
• Standardized Beta values
• Significance levels (<.05)
• Beta coefficients are measures of individual
impact of each IV on DV
Things to remember for regression
• Only one DV can be used in the study, IVs may
be as many you want
• Only continuous DV can be used, IVs may be
both continuous and categorical
• If DV is categorical, logistic regression is used
• If multiple DVs need to be analyzed
simultaneously, path analysis techniques such
as PLS and SEM should be used
Tests for comparing two or more
groups of data
• T test – to compare means of two groups (eg:
comapring mean performance of two teams)
• ANOVA – to compare means of more than two
groups (Eg: to compare the output of three
different production plants)
• Both are used to see whether the groups are
significantly different in terms of their mean
values/outputs from each other
• ANOVA = Multiple t tests
• Then why do it?
Type I Error
• Type 1 error – the error committed by
researchers by rejecting a null hypothesis
when it is actually true
• In simple terms this means – to think there is
an effect of IV on DV when actually there is
none; or to think there is a correlation when
actually the variables are uncorrelated

• In simpler terms – Anhoni ko honi


Type II Error
• It is reverse of Type I error – to accept a null
hypothesis when it is actually false

• That is, failing to detect an effect when there is


one

• Again, to simplify- Honi ko anhoni

• Why do these errors occur?


Why do type 1, 2 errors occur?
• Main reason for such errors is human mistake
• To allow for such mistakes, statisticians resort to type I
error rate, measured on the basis of confidence
interval
• Confidence interval signifies the extent to which a
researcher is sure that his results are correctly
interpreted
• Generally set at 95 %, which means that out of 100
times, the researcher may be wrong in his
interpretation 5 times
• Therefore, type I error rate for one test= (1-.95)=.05
Why ANOVA better than t test
• One t test causes a type I error of [1-.95]=.05
• For a t test to compare three groups, three
separate two group comparisons have to be
done
• Therefore total error = [1-(.95)*3] = 1-
.857=.143 = 14.3%
• Thus error rate has increased from 5 % to
14.3%
Independent t-test
Features: Basic Concepts:
• One Independent Variable • The t stat is determined
based on the level of
• Two Groups, or Levels of significance and the sample
the Independent Variable size (Degrees of Freedom)
• Independent Samples • As degrees of freedom
increase, the t-distribution
(Between-Groups): the two becomes increasingly
groups are not related in normal.
any way • Degree of freedom = no. of
• Interval/Ratio variables allowed to vary
Measurement of the • Df1=
Dependent Variable • Df2=
Computing the Independent t-test Using
Excel
• Enter the data for group 1 in column A; enter the data for
group 2 in column B.
• Go to tools  data analysis.
• Highlight t-test: Two samples assuming equal variances. Click
OK.
• Click in the variable one range window and highlight the data
for group1. Do the same for group 2.
• Click in the hypothesized mean difference box and enter 0.
• Under output options, click the radio button to the left of
output range. Click in the output range box and highlight an
area of your spreadsheet to the side or below your data
(about 15 rows by 3 columns).
• Click OK. Adjust the width of the columns so you can see all of
the information.
Computing the Independent t-test Using
SPSS
• Enter data in the data editor or download the file.
• Click analyze  compare means  independent
sample t-test. The independent samples t-test dialog
box should open.
• Click on the independent variable, and click the
arrow to move it to the grouping variable box.
• Highlight the variable (? ?) in the grouping variable
box. Click define groups. Enter the value 1 for group
1 and 2 for group 2. Click continue.
• Click on the dependent variable, and click the arrow
to place it in the test variable(s) box.
Interpreting the Output

Group Statistics

Std. Error
anx_cat N Mean Std. Deviation Mean
Overall GPA Low 10 3.3800 .45959 .14533
High 10 2.9900 .58205 .18406

The group statistics box provides the mean, standard deviation, and
number of participants for each group (level of the IV).
Interpreting the Output
Independent Samples Test

Levene's Test for


Equality of Variances t-test for Equality of Means

Mean Std. Error


F Sig. t df Sig. (2-tailed) Difference Difference
Overall GPA Equal variances
1.167 .294 1.663 18 .114 .39000 .23452
assumed
Equal variances
1.663 17.081 .115 .39000 .23452
not assumed

Levene’s test is designed to compare the equality of the error variance of the dependent
variable between groups. We do not want this result to be significant. If it is significant,
we must refer to the bottom t-test value.
Analysis of variance (ANOVA)
• Technique to compare difference in mean
output/ attitude/ performance (continuous
variable) between three or more groups of
employees/ product lines/ sales teams/ students
• When there is only one dependent variable
• Not meant for categorical dependent variable
• Assumption: Groups should be homogeneous
(indicated by Levene’s test; should be
insignificant)
Types of ANOVA
• One way – the groups are different in terms of
only one level/factor
• Eg: comparing marks of MBA, BBA, & LAW
students
• Two way - the groups are different in terms of
two levels/ factors
• Eg: comparing marks of MBA(Boys & Girls),
BBA (Boys & Girls), LAW (Boys & Girls)
One-Way Analysis of Variance
(ANOVA)
• Evaluate the difference among the means of three or
more groups
Examples: Accident rates for 1st, 2nd, and 3rd shift
Expected mileage for five brands of tires

• Assumptions
– Populations are normally distributed
– Populations have equal variances
– Samples are randomly and independently drawn
Conducting ANOVA in Excel
• Step 1: On the Data tab, click Data Analysis.
• Step 2: Select ANOVA: Single Factor and click
OK
• Step 3: Click in the Input Range box and select
the range for the variables for comparison.
• Step 4: Click in the Output Range box and
select any suitable cell.

Excel ANOVA Output
ANOVA
Source of Variation SS df MS F P-value F crit
Between Groups 5.127333 2 2.563667 10.21575 0.008394 4.737416
Within Groups 1.756667 7 0.250952

Total 6.884 9

1 less than number number of data values -


of groups number of groups
(equals df for each group
1 less than number of individuals/data added together)
Steps to conduct one way ANOVA in
SPSS
• Step 1: Create a separate variable with separate codes
for the different groups (say 1=undergrad, 2=grad,
3=post-grad, 4=PhD….) and insert data for the
dependent variable
• Step 2: Analyze-> Compare means-> One-way ANOVA
• Step 3: put the grouping variable into the Factor box
• Step 4: Insert the dependent variable in the dependent
list
• Step 5: Click on option and select ‘descriptives’, and
‘homogeneity of variance’ test
Displays overall mean,
means for each level of
duration, mean for each
level of modality and the
means for each
combination of duration by
modality (= the interaction
means).
Produces Levene’s test
for homogeneity of
variance (one of the
assumptions of Anova –
i.e. that the variances
within each cell of the
design are not
significantly different.

Homogeneity Test
Levene’s test. This significant result means
the assumption of equal group variances
has not been met.
SPSS One way ANOVA OUTPUT

ANOVA

VAR00001
Sum of
Squares df Mean Square F Sig.
Between Groups .152 2 .076 .147 .865
Within Groups 4.635 9 .515
Total 4.787 11
What Is SAS Enterprise Guide?
SAS Enterprise Guide is an easy-to-use Windows
client application that provides these features:

– access to much of the functionality of SAS


– an intuitive, visual, customizable interface
– transparent access to data
– ready-to-use tasks for analysis and reporting
– easy ways to export data and results to other
applications
– scripting and automation
– a program editor with syntax completion and
built-in function help

56
Import from an External File
• The Import Data wizard enables you to create
SAS data sets from text, HTML, or PC-based
database files (including Microsoft Excel,
Microsoft Access, and other popular formats).
When you use the Import Data wizard, you
can specify import options for each file that
you import.

57
Import Data
• File >> Import Data

58
The Task List
You can use tasks to do
everything from
manipulating data, to
running specific analytical
procedures, to creating
reports.

59
Correlation in SAS Enterprise Guide
• Tasks >> Multivariate >> Correlations

60
Correlation Analysis
• Since p-values are less than 0.05, there are
significant (positive) relationships between Q6
(Overall satisfaction on Advisor) and Q1, Q2,
Q3, Q4, Q5.

61
Linear Regression
• Tasks >> Regression >> Linear Regression

62
Regression Analysis
These are the F Value and
p-value, respectively, testing
the null hypothesis that the
Model does not explain the
variance of the response
variable.

R-Square defines the


proportion of the total
variance explained by the
Model.

63
Regression Analysis
These are the t Value and
p-value, respectively, testing
the null hypothesis that the
coefficients are significantly
equal to 0.

64
One-Sample t-Test

• Tasks >> ANOVA >> t Test

65
T-Test Output

Since p-value is less than 0.05, it


can be concluded that average
female students consider
themselves as a well-prepared
students for advising
appointment (

Since p-value is less than 0.05, it


can be concluded that average
male students also consider
themselves as a well-prepared
students for advising
appointment 66
One-Way ANOVA
• Tasks >> ANOVA >> One-Way ANOVA

67
One-Way ANOVA results

Since p-value is greater than 0.05, it can


be concluded that there is no significant
difference in average Advisor
Satisfaction among year(s) of study.
Therefore, there is no need to check the
Post Hoc tests.
68
Predictive analytics –
Logistic Regression
What is Logistic Regression?
 A supervised learning algorithm to predict odds of
an event occurring
 It is useful for situations in which you want to be
able to predict the presence or absence of a
characteristic or outcome based on values of a set
of predictor variables.
 It is similar to a linear regression model but is
suited to models where the dependent variable is
dichotomous.
What is Logistic Regression?
 Logistic regression is often used because the
relationship between the DV (a discrete
variable) and a predictor is non-linear
Example : the probability of occurrence of heart disease
changes very little with a ten-point difference among
people with low-blood pressure, but a ten point change can
mean a drastic change in the probability of heart disease in
people with high blood-pressure.
Discrete/ Categorical/Binary
Dependent Variables
In many regression settings, the Y variable is (0,1)

A Few Examples:
 Consumer chooses brand (1) or not (0);
 A quality defect occurs (1) or not (0);
 A person is hired (1) or not (0);
 Evacuate home during hurricane (1) or not (0);
 A person quits (1) or stays (0)
The linear vs. logistic regression
equations
The Logistic Regression Model

The logistic regression equation:


ln[p/(1-p)] =  + X + e

 p is the probability that the event Y occurs,


p(Y=1)
 p/(1-p) is the "odds ratio"
 ln[p/(1-p)] is the log odds ratio, or "logit"
Similarities and dissimilarities
with Linear Regression
Linear Regression Logistic Regression
 DV is continuous  DV is categorical/ discrete
 the relationship between  the relationship between
the DV (a discrete variable) the DV (a discrete variable)
and a predictor is linear and a predictor can be non-
 Based on least square linear
estimation
 Interpretation of beta -  Based on maximum
Keeping all other likelihood estimation
independent variables  Interpretation of odds ratio-
constant, how much the The effect of a one unit of
dependent variable is change in X in the predicted
expected to odds ratio with the other
increase/decrease with an variables in the model held
unit increase in the
independent variable. constant.
An Example with SPSS

• Independent variables (covariates)


 Eg: Age (continuous)
 Categorical: Gender (1 = female, 2 = male)
• Dependent variable
 Dichotomous (1=Pass, 0=Fail)

Next
Binomial Logistic Regression:

Starting the Procedure in SPSS

•In the menu, click on


Analyze

•Point to Regression

•Point to Binary Logistic…

…and click.
Selecting Variables

Choose the variables


for analysis from the
list in the variable
box.

Scroll down the list,


point to the dependent
variable)

…and click.
Entering Categorical Variables

To designate
gender as a
categorical
variable, click
the button
labeled
Categorical.
Contd…
Click in the box labeled
Covariates,
then click the arrow to move
it to the box labeled
Categorical Covariates.
Binomial Logistic Regression:

Selecting a Method of Entry

The covariates can be


entered into the
analysis using six
different methods.
This tutorial includes
the Enter method.
Binomial Logistic Regression Output

Case Processing Summary

Next
Binomial Logistic Regression Output

Variable Coding
Binomial Logistic Regression Output (Enter Method)

Beginning Block: Classification Table


Binomial Logistic Regression Output (Enter Method)
Beginning Block: Variables in the
Equation
Logistic regression output
interpretations
 Like F test is the omnibus test to show whether the
model is showing some effect, here we have Chi
Square test of model coefficients.
 For overall effect - Negelkerke R Square (similar to
R Square in linear regression)
 For individual effect – exp (B)/ exponentiated Beta
(Similar to beta coefficient in linear regression)- this
is the change in odds
 For t tests we have Wald Statistic
Binomial Logistic Regression Output (Enter Method)

Model Summary
Binomial Logistic Regression Output (Enter Method)

Classification Table
Binomial Logistic Regression Output (Enter Method)

Variables in the Equation


Comparison with Neural
Networks
 Both work well with Non-linear data
 NN has the advantage of using validation and test
data
 Logistic regression does not encounter the issues
of overfitting and underfitting
 NNs can be more useful for future data prediction
Factor Analysis

© 2007 Prentice Hall


Factor Analysis
• Factor analysis is a class of procedures used for data
reduction and summarization.
• It is an interdependence technique: no distinction
between dependent and independent variables.

• Factor analysis is used:


– To identify underlying dimensions, or factors, that
explain the correlations among a set of variables.
– To identify a new, smaller, set of uncorrelated variables
to replace the original set of correlated variables.
Correlation Matrix
Q1 Q2 Q3 Q4 Q5 Q6
Q1 1
Q2 .987 1
Q3 .801 .765 1
Q4 -.003 -.088 0 1
Q5 -.051 .044 .213 .968 1
Q6 -.190 -.111 0.102 .789 .864 1

• Q1-3 palpitation, dry mouth, sweating


• Q4-6 worry, apprehension, nervousness
Statistics Associated with Factor Analysis
• Bartlett's test of sphericity. Bartlett's test of
sphericity is used to test the hypothesis that the
variables are uncorrelated in the population

• Correlation matrix. A correlation matrix is a


lower triangle matrix showing the simple
correlations, r, between all possible pairs of
variables included in the analysis. The diagonal
elements are all 1.
Statistics Associated with Factor Analysis
• Communality. Amount of variance a variable
shares with all the other variables. This is the
proportion of variance explained by the common
factors.
• Eigenvalue. Represents the total variance
explained by each factor.
• Factor loadings. Correlations between the
variables and the factors.
• Factor matrix. A factor matrix contains the factor
loadings of all the variables on all the factors
Statistics Associated with Factor Analysis

• Factor scores. Factor scores are composite scores


estimated for each respondent on the derived factors.
• Kaiser-Meyer-Olkin (KMO) measure of sampling
adequacy. Used to examine the appropriateness of factor
analysis. High values (between 0.5 and 1.0) indicate
appropriateness. Values below 0.5 imply not.
• Percentage of variance. The percentage of the total
variance attributed to each factor.
Conducting Factor Analysis
Fig. 19.2
Problem formulation

Construction of the Correlation Matrix

Method of Factor Analysis

Determination of Number of Factors

Rotation of Factors

Interpretation of Factors

Calculation of
Factor Scores

Determination of Model Fit


Formulate the Problem
• The objectives of factor analysis should be
identified.

• The variables to be included in the factor


analysis should be specified. The variables
should be measured on an interval or ratio
scale.

• An appropriate sample size should be used. As


a rough guideline, there should be at least four
or five times as many observations (sample size)
as there are variables.
Construct the Correlation Matrix

• The analytical process is based on a matrix of correlations


between the variables.

• If the Bartlett's test of sphericity is not rejected, then factor


analysis is not appropriate.

• If the Kaiser-Meyer-Olkin (KMO) measure of sampling


adequacy is small, then the correlations between pairs of
variables cannot be explained by other variables and factor
analysis may not be appropriate.
Determine the Method of Factor Analysis
• In Principal components analysis, the total variance in
the data is considered.
-Used to determine the min number of factors that will
account for max variance in the data.

• In Common factor analysis, the factors are estimated


based only on the common variance.
-Communalities are inserted in the diagonal of the
correlation matrix.
-Used to identify the underlying dimensions and when the
common variance is of interest.
Determine the Number of Factors

• A Priori Determination. Use prior knowledge.

• Determination Based on Eigenvalues. Only factors with


Eigenvalues greater than 1.0 are retained.

• Determination Based on Scree Plot. A scree plot is a plot


of the Eigenvalues against the number of factors in order of
extraction. The point at which the scree begins denotes the
true number of factors.

• Determination Based on Percentage of Variance.


Rotation of Factors

• Through rotation the factor matrix is transformed into a


simpler one that is easier to interpret.

• After rotation each factor should have nonzero, or


significant, loadings for only some of the variables. Each
variable should have nonzero or significant loadings with
only a few factors, if possible with only one.

• The rotation is called orthogonal rotation if the axes are


maintained at right angles.
Rotation of Factors
• Varimax procedure. Axes maintained at right angles

-Most common method for rotation.

-An orthogonal method of rotation that minimizes the


number of variables with high loadings on a factor.

-Orthogonal rotation results in uncorrelated factors.

• Oblique rotation. Axes not maintained at right angles

-Factors are correlated.

-Oblique rotation should be used when factors in the


population are likely to be strongly correlated.
Example of EFA
• Fisher (2014) suggested that a comprehensive measure of
well-being at work would consist of hedonic well-being,
eudaimonic well-being and social well-being. To date, Fisher’s
higher-order conceptualization of well-being at work has not
been tested empirically
Fisher’s model of well being: The dimensions of
wellbeing included in this model are as follows:
Hedonic/Subjective Well-being at Work: It is important to note that
subjective well-being and happiness are often used interchangeably in
organizational research. Subjective well-being at work includes
experience of positive and negative affect as well as job satisfaction.

Eudaimonic Well-being at work: Eudaimonic well-being involves living a


virtuous life, not only a pleasurable one.

Social well-being at work: Positive short-term interactions and long-term


relationships with others while working improves well-being of
employees. Conceptualization and measurement of social well-being at
work at present is in an initial stage.
Cluster Analysis
Introduction

 Cluster Analysis is a collection of techniques for


aggregating objects into groups based on
similarity measures or distances (dissimilarity
measures).

 Cluster Analysis is a set of methods for


constructing a sensible and informative
classification of an initially unclassified set of
data, using the variable values observed on
each individual. (Everitt, 1998)
What is Cluster Analysis?
 Finding groups of objects such that the objects in a
group will be similar (or related) to one another and
different from (or unrelated to) the objects in other
groups
Inter-cluster
Intra-cluster distances are
distances are maximized
minimized
Notion of a Cluster can be Ambiguous

How many clusters? Six Clusters

Two Clusters Four Clusters


Types of Clustering

 A clustering is a set of clusters


 Important distinction between hierarchical
and non-hierarchical sets of clusters
 Non-Hierarchical Clustering
 A division data objects into non-overlapping subsets
(clusters) such that each data object is in exactly one
subset

 Hierarchical clustering
 A set of nested clusters organized as a hierarchical tree
Hierarchical Clustering

p1
p3 p4
p2
p1 p2 p3 p4
Traditional Hierarchical Clustering Traditional Dendrogram

p1
p3 p4
p2
p1 p2 p3 p4
Non-traditional Hierarchical Clustering Non-traditional Dendrogram
K-means Clustering

 Partitional clustering approach


 Each cluster is associated with a centroid (center
point)
 Each point is assigned to the cluster with the closest
centroid
 Number of clusters, K, must be specified
 The basic algorithm is very simple
Hierarchical Clustering

 Produces a set of nested clusters organized


as a hierarchical tree
 Can be visualized as a dendrogram
 A tree like diagram that records the sequences of
merges or splits 6 5

4
0.2 3 4
2
5
0.15
2

0.1
1
3 1
0.05

0
1 3 2 5 4 6
Strengths of Hierarchical Clustering

 Do not have to assume any particular number


of clusters
 Any desired number of clusters can be obtained
by ‘cutting’ the Dendrogram at the proper level
 They may correspond to meaningful
taxonomies
 Example in biological sciences (e.g., animal
kingdom, phylogeny reconstruction, …)
Cluster Similarity

 Similarity of two clusters is based on the two


most similar (closest) points in the different
clusters
 Determined by one pair of points, i.e., by one link
in the proximity graph.
I1 I2 I3 I4 I5
I1 1.00 0.90 0.10 0.65 0.20
I2 0.90 1.00 0.70 0.60 0.50
I3 0.10 0.70 1.00 0.40 0.30
I4 0.65 0.60 0.40 1.00 0.80
I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5
Final Comment on Cluster Validity
“The validation of clustering structures is the
most difficult and frustrating part of cluster
analysis.
Without a strong effort in this direction, cluster
analysis will remain a black art accessible only
to those true believers who have experience
and great courage.”

Algorithms for Clustering Data, Jain and Dubes


Thank You
Decision tree analysis in SPSS
Maths and Statistics Help Centre

Introduction
Decision tree analysis helps identify characteristics of groups, looks at relationships between independent
variables regarding the dependent variable and displays this information in a non-technical way. The
process can also be used to identify classification rules for future events e.g. identifying people who are
likely to belong to a particular group.

Basic model
The following example uses records from the Titanic on passengers. The tree will look at what factors
affected chances of survival.

Dependent variable: Binary


indicator of survival (1 = survived)

Independent variables:
Gender
Class (1st, 2nd, 3rd)
Child under 13 (Under 13, adult)
Travelling alone/ travelling with
others.

Growing method: The most commonly used growing methods are CHAID (Chi-squared automatic
interaction detection) and CRT (Classification and regression).
Summary of differences:
 Treatment of missing values. CRT uses surrogates (classification via other independent variables
with a high association with the independent variable with a missing value) whereas CHAID treats
all missing values within an independent variable as one category.
 CHAID uses Pearson’s Chi-squared to decide on variable splits and CRT uses Gini
 CRT only produces binary splits. If all independent variables are binary, the resulting tree from CRT
and using the Pearson’s Chi-squared option within CHAID will produce the same tree.
 CRT has a pruning ability so that extra nodes which do not increase the risk (wrong classification) by
much can be automatically removed to leave a simpler tree.
Decision tree analysis in SPSS
Maths and Statistics Help Centre

Basic output using CHAID

For each node, the number of people and % who


died/ survived is given. The splits occur in order of
importance. Here, gender was the most significant
factor regarding survival so the ‘parent’ node
containing all 1309 passengers splits into two ‘child’
nodes, one containing males and the other females.

TERMINAL NODE: This is the end of a branch


and a point of classification. The classification
is highlighted in grey. For example, those in
node 7 are classified as dying. These people
are female and were in 3rd class.

Terminal Path Classification Number correct Number wrong


node
4 Male under 13 Survived 27 23
5 Female  1st Class Survived 139 5
6 Female  2nd Class Survived 94 12
7 Female  3rd Class Died 110 106
8 Male Adult 1st Class Died 118 57
9 Male Adult 2nd or 3rd Class Died 541 77

The risk represents the proportion of cases misclassified by the proposed classification. The
classification table summarises the percentages classified correctly. The model classified 95.1% of
those dying correctly, but only 52% of those who survived.
Decision tree analysis in SPSS
Maths and Statistics Help Centre

/* Node 1 */.
IF (((Gender = "male") OR (Gender != "female") AND (Number of accompanying siblings
or spouses != "1")))
THEN
Node = 1
Prediction = 0
Probability = 0.809015

/* Node 5 */.
IF (((Gender = "female") OR (Gender != "male") AND (Number of accompanying siblings
or spouses = "1"))) AND (((Class = "1st" OR Class = "2nd") OR (Class != "3rd") AND
((Age NOT MISSING AND (Age > 23.5)) OR Age IS MISSING AND (Number of accompanying
siblings or spouses != "3 or more")))) AND (((Class = "1st") OR (Class != "2nd") AND
(Age IS MISSING OR (Age > 34.5))))
THEN
Node = 5
Prediction = 1
Probability = 0.965278

/* Node 6 */.
IF (((Gender = "female") OR (Gender != "male") AND (Number of accompanying siblings
or spouses = "1"))) AND (((Class = "1st" OR Class = "2nd") OR (Class != "3rd") AND
((Age NOT MISSING AND (Age > 23.5)) OR Age IS MISSING AND (Number of accompanying
siblings or spouses != "3 or more")))) AND (((Class = "2nd") OR (Class != "1st") AND
(Age NOT MISSING AND (Age <= 34.5))))
THEN
Node = 6
Prediction = 1
Probability = 0.886792

/* Node 4 */.
IF (((Gender = "female") OR (Gender != "male") AND (Number of accompanying siblings
or spouses = "1"))) AND (((Class = "3rd") OR (Class != "1st" AND Class != "2nd")
AND ((Age NOT MISSING AND (Age <= 23.5)) OR Age IS MISSING AND (Number of
accompanying siblings or spouses = "3 or more"))))
THEN
Node = 4
Prediction = 0
Probability = 0.509259
Decision tree analysis in SPSS
Maths and Statistics Help Centre
Forecasting
Introduction
• All operational systems operate in an environment of uncertainty
• Information is an asset - managing it is a necessity
• “Reliable” information about the future is a source of competitive edge
• Forecasting supplements the intuitive feelings of managers and
decision makers.
Forecasting « Laws »
• Forecasts are always wrong
• Forecasts always change
• The further into the future, the less reliable the forecast.
• Forecasts for group statistics tend to be more accurate than forecasts for
individuals (risk/uncertainty pooling concept)
• Learning not predicting
Forecasting techniques
Types of Forecasting
Approaches & Methods

Unpredictable Quantitative

Time-Series Methods Causal / Explanatory Methods

Moving Avarages Simple Regression


Exponential Smoothing
Linear Trend
Simple (No Trend) Quadratic Trend
Double (Linear Trend) Exponential Trend
Triple (Curvilinear Trend) Multiple Regression
Econometric Modeling
Leading Indicator Analysis
Diffusion Indexes
Qualitative

Delphi Technique
Expert Opinion
Factor Listing Method
Time Series Patterns

© Wiley 2007
Overview of forecasting models
• Time series
• Moving average
• weighting past value
• The decomposition method
• Associative models
• Correlation analysis
• Regression analysis
Forecasting Software
• Spreadsheets
• Microsoft Excel, Quattro Pro, Lotus 1-2-3
• Limited statistical analysis of forecast data
• Statistical packages
• SPSS, SAS, NCSS, Minitab
• Forecasting plus statistical and graphics
• Specialty forecasting packages
• Forecast Master, Forecast Pro, Autobox, SCA

© Wiley 2007
Neural Network
Biological inspirations
• Some numbers…
• The human brain contains about 10 billion nerve cells (neurons)
• Each neuron is connected to the others through 10000 synapses

• Properties of the brain


• It can learn, reorganize itself from experience
• It adapts to the environment
• It is robust and fault tolerant
Biological neuron
synapse axon

nucleus

cell body

dendrites

• A neuron has
• A branching input (dendrites)
• A branching output (the axon)
• The information circulates from the dendrites to the axon via the
cell body
• Axon connects to dendrites via synapses
• Synapses vary in strength
Neural Networks
• A mathematical model to solve engineering problems
• ANN are black box-type learning algorithms which are designed to elucidate latent
relationships and patterns from a set of input data, typically non-linear and not
following any assumptions of normal distribution, and then map them onto a set of
stochastic outputs for the end user.
• Components of ANN:
• The input layer: It is the receiver of information from external environment
• The hidden layer (processing element): It is the black box which contains the
artificial neurons for processing of the input data and extraction of hidden
pattern from the data.
• The output layer: this present actual outputs from the analysis
• Weightage factor: Each input is assigned a particular weight which signifies the
impact that the input will have on the hidden layer during data processing
• Summation function: it is the aggregate of weights associated with all the inputs.
Processing Information
in an Artificial Neuron (Figure 15.3)

Inputs Weights

x1 w1j
Output
Neuron j Yj
x2 w2j 
 wij xi

Summations Transfer function


xi wij

5
Popular Neural Network Architectures
1. Single layer Feed Forward Architecture
• Multiple layer feed forward architecture
Recurrent Neural Networks
Backpropagation

• Backpropagation (back-error propagation):Once one round of analysis


is complete, the network compares between the expected outcome and
the actual outcome and calculates the difference which is typically the
network error.
• This error is fed back to the input layer via the processing elements and
weights of the input layer are adjusted accordingly.

9
Objective of Neural network
• Compute Outputs

• Compare Outputs with Desired Targets

• Adjust Weights and Repeat the Process

• Set the weights by either rules or randomly

• Set Delta Error (actual output - desired output )

• Objective is to Minimize the Delta (Error)

• Change the weights to reduce the Delta

10
Learning
• The procedure that consists in estimating the parameters of neurons so that the
whole network can perform a specific task

• 2 types of learning
• The supervised learning : Analyst decides the adjustments in weights
• The unsupervised learning: ANN organizes the outputs on its own and
develop learning insights using some internal criteria.
Stages of data Analysis with ANNs

Training data • To train the neural network

• To check the prediction accuracy of the neural


Holdout data network

Testing data • To re-check the accuracy of the network (optional)

• To predict the future outcomes based on learning


New Data through training
Advantages and Disadvantages of ANNs
• Advantages:
• ANNs are capable of identifying and estimating highly non-linear
relationships.
• ANNs are not bound by the stochastic restrictions of normality
• They can handle wide variety of data and can manage big data too
• Thy usually provide better results for classifications and prediction type
problems compared to its statistical counterparts such as decision trees and
logistic regression.
• ANNs can compute both numerical and categorical variables, though the
categorical variables need to undergo some transformation.
Advantages and Disadvantages of ANNs
• Disadvantages
• They are known as black box solutions and as a result, hard to explain and
understand
• Finding optimal values for large number of network parameters is not and
easy task
• Optimal design of ANNs is an art. It require expertise and extensive
experimentation
• Including a lot of variables in the analysis can result in over fitting problems
Practical Implications
1. Attracting talent: https://www.phenompeople.com/ (It is a provider of machine learning based
talent attraction suite that helps in identifying potentially fit candidate for vacant roles for a firm by
browsing millions of social media data and job search browsing data from job boards.)
2. Future attrition detection
3. Individual skills management/performance development: Workday suggests personalized training
recommendations for trainees
4. Enterprise Management: Google has been pioneered in people focused management. They have
shown the world how data can help in addressing pressing HR concerns such as whether employee
engagement polices were fruitful in retaining employees or not.
5. Post-hire outcome algorithm: like candidates eligibility for promotion after a year
6. Succession Planning algorithm: Saba, a global leader in next generation cloud solutions for talent
management, developed a prediction system called The intelligent Mentor (TIM) system, which
would revolutionize traditional job promotion practices into a more enriching and fulfilling
experience.
7. Behavior tracking: Text analytics models such as ‘Prototypicality’ algorithm can identify emotional
activity in comments posted in Twitter or Face book and draw out specific personality and
dispositional traits of employees.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy