0% found this document useful (0 votes)
6 views10 pages

Final SSD

This document outlines a project focused on analyzing New York City's 311 Customer Service Requests dataset to enhance public service efficiency through data analytics. It details the stages of data understanding, preparation, analysis, exploration, and statistical testing, ultimately aiming to identify trends and improve resource allocation for service requests. The findings highlight significant insights into complaint types, response times, and regional service issues, providing actionable recommendations for city officials.

Uploaded by

anshraut807
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views10 pages

Final SSD

This document outlines a project focused on analyzing New York City's 311 Customer Service Requests dataset to enhance public service efficiency through data analytics. It details the stages of data understanding, preparation, analysis, exploration, and statistical testing, ultimately aiming to identify trends and improve resource allocation for service requests. The findings highlight significant insights into complaint types, response times, and regional service issues, providing actionable recommendations for city officials.

Uploaded by

anshraut807
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 10

1.

Introduction
Public service agencies are increasingly using data analytics to improve operational
efficiency and service delivery in the age of data-driven decision-making. The
examination of New York City's 311 Customer Service Requests dataset, which contains
millions of resident complaints and service requests, is the main emphasis of this
coursework. This project's main goal is to comprehend, prepare, investigate, and analyze
the dataset for significant insights using Python programming and data analysis
methodologies. Understanding the data is the first step in this project's extensive process,
which also includes data preparation, exploration, statistical analysis, and visualization.
We investigate a number of dataset features, including complaint categories, service
response times, and regional trends in service requests. To prepare the dataset for
additional mining and statistical testing, a large portion of the study entails cleaning it up
by eliminating unnecessary columns and dealing with missing values. We seek to identify
important information regarding where and how services are requested, as well as how
well they are handled, by calculating summary statistics, examining correlations, and
visualizing complaint trends. To ascertain whether response times vary by complaint type
and whether complaint types are linked to particular locations, statistical tests are also
conducted. The analysis's findings can assist local authorities in better allocating
resources and enhancing response effectiveness.

2. Data understanding
New York City residents' non-emergency complaints and service requests are tracked in
the NYC 311 Customer Service Requests dataset. Numerous columns, including
"Created Date," "Closed Date," "Complaint Type," "Descriptor," "Agency," and
"Borough," are included in the data. A distinct service request is represented by each
row in the dataset. To perform meaningful analysis, one must comprehend the contents
and structure of this collection. Examining the quantity of rows and columns,
determining the data types of each column, and assessing the significance of the
important variables are all part of the initial investigation. For instance, "Complaint Type"
describes the nature of the issue reported, while "Created Date" and "Closed Date"
represent the timeline of the service request. Additional fields provide geographical and
administrative context, such as the borough or agency responsible. A crucial part of this
phase is identifying missing or inconsistent values and determining which columns are
relevant or redundant. Several columns, such as those related to school details, taxi
information, and bridge or highway names, may not contribute meaningfully to the
analysis and can be dropped in the data preparation stage. This understanding lays the
foundation for effective data cleaning, transformation, and deeper exploration in
subsequent steps.
3. Data preparation

Cleaning and converting the raw material into an appropriate format for analysis is a
crucial step in the data preparation process. Using Python packages like pandas, the
NYC 311 Customer Service Requests dataset is initially imported for this project.
Following data loading, an initial analysis is carried out to determine the kinds of
information contained in the dataset, such as different complaint types, timestamps,
locations, and agency specifics. Certain transformations are used to properly prepare
the data. In order to compute a new column named "Request_Closing_Time" that
indicates the amount of time that passed between the creation and closure of a
complaint, the "Created Date" and "Closed Date" columns must first be changed to the
datetime format. This new capability is crucial for service efficiency analysis. The
following stage is eliminating a list of superfluous columns that don't contribute to the
study, like comprehensive data about schools, cars, and bridges. To cut down on data
noise, a Python program is used to drop these columns. To maintain consistency in the
dataset, all missing (NaN) values are then eliminated. Finally, to learn more about the
variability of the data, the unique values in each column are analyzed. These
preparatory procedures guarantee that the dataset is clear, pertinent, and prepared for
insightful analysis and statistical testing.
4. Data Analysis

Finding significant statistical insights in the cleaned dataset is the main goal of the data
analysis phase. This entails creating summary statistics and investigating correlations
between various factors. Key metrics including total, mean, standard deviation,
skewness, and kurtosis are computed for the dataset's numerical columns using Python
packages like pandas, numpy, and scipy. Understanding the distribution and variability
of service request durations as well as other numerical aspects is made easier with the
aid of these descriptive statistics. The "Request_Closing_Time" mean and standard
deviation, for instance, can be used to determine how quickly complaints are normally
handled and how much response time variance there is. Finding outliers or anomalous
patterns can be aided by knowing the symmetry and peakedness of the data
distribution, which is provided by skewness and kurtosis. The correlation matrix is
created to look at correlations between numerical variables in addition to univariate
analysis. This correlation analysis shows the degree of relationship between several
variables, including the potential relationship between location-related attributes and
complaint duration. To ensure that the research is both data-driven and statistically
valid, these discoveries form the foundation for more in-depth data exploration,
visualization, and hypothesis testing.

5. Data Exploration and visual insights

In order to find patterns, trends, and insights that might not be readily apparent from raw
numbers, data exploration entails visually examining the dataset. The purpose of this
project is to better understand the behavior of 311 service calls in New York City by
applying visualization techniques utilizing Python libraries like matplotlib, seaborn, and
plotly. Four important findings are found by visual analysis. First, the complaints'
distribution throughout the boroughs reveals which ones have the most service
problems. Second, determining the most common complaint categories aids in
identifying the most prevalent issues among the general public, such as unlawful
parking, noise, or heating issues. Third, to comprehend daily or seasonal patterns in
service requests, the relationship between time and complaint frequency is illustrated.
Fourth, differences in average request closure times by borough and complaint category
indicate regions with higher rates of service delays. Additionally, complaint types are
arranged based on their average response times, categorized by location, and
illustrated in comparative bar or box plots. These visualizations help city planners and
public service departments prioritize issues and optimize response strategies. Overall,
data exploration enhances the interpretability of the dataset and provides a foundation
for more advanced statistical testing.

6. Statistical Testing

Formal hypothesis testing is used in statistical testing to verify hypotheses and find
important links in the dataset. This study runs two important tests. The first test looks
into if there are any notable differences in the average response time
(Request_Closing_Time) between the different kinds of complaints. This is
accomplished by using a one-way ANOVA test, in which the alternative hypothesis (H₁)
proposes that at least one complaint type has a different mean, whereas the null
hypothesis (H₀) posits that all complaint kinds have the same average response time.
The test's p-value aids in deciding whether to accept or reject the null hypothesis. A
statistically significant difference in service response times across complaint kinds is
indicated by a low p-value, which is usually less than 0.05. The second test looks for a
correlation between the type of complaint and the borough in which it was filed. For this,
a Chi-square test of independence is employed. In this case, the alternative hypothesis
(H₁) proposes a link between complaint type and location, whereas the null hypothesis
(H₀) asserts that they are independent. This assertion is evaluated with the aid of the
obtained p-value. These tests offer statistical support for data-driven decision-making in
addition to validating patterns found during data exploration.

7. Conclusions & Recommendations


Through the application of numerous data analytic techniques and statistical tools, this
research offered a thorough approach to examining the NYC 311 Customer Service
Requests dataset. Preparing the data for additional data mining and deriving valuable
insights into the trends of public service requests around New York City were the main
goals.

Data understanding was the initial stage, during which the dataset's structure and
properties were carefully investigated. For analysis, key columns including "Complaint
Type," "Created Date," "Closed Date," and "Borough" were determined to be crucial.
This stage made sure that the subsequent procedures had a solid base. The dataset was
cleaned and converted during the data preparation stage; missing values were dealt with,
extraneous columns were eliminated, and the "Request_Closing_Time" feature was
developed to gauge service responsiveness. These actions greatly enhanced the quality of
the data and prepared it for analysis.

The data analysis step then used correlations, skewness, and summary statistics to
investigate the statistical nature of the data. These metrics highlighted data distribution
and relationships between variables. Following that, data exploration used visualizations
to reveal trends such as the most common complaint types, boroughs with the highest
request volumes, and average resolution times across different areas. Finding service
bottlenecks and efficiency gaps required these insights.

Hypothesis tests were used in the statistical testing phase to see whether the average
response times for the various complaint kinds were statistically significant and whether
the complaint types were associated with certain locations. Valid, fact-based conclusions
from both tests supported the exploratory findings.

All things considered, our investigation showed how Python programming can be used
practically to solve real-world issues, with the help of data wrangling, visualization, and
statistical analysis. City officials can improve response tactics and service delivery for
New Yorkers with the help of the project's insights.

8. Appendix
import pandas as pd, numpy as np, matplotlib.pyplot as plt, seaborn as sns
from scipy import stats
from statsmodels.stats.multicomp import pairwise_tukeyhsd
plt.rcParams["figure.figsize"]=(10,6); sns.set_theme(style="whitegrid")

# 1 Data Understanding -------------------------------------------------


df = pd.read_csv("311_Service_Requests.csv", low_memory=False)
print("SHAPE:", df.shape); df.info()
print(df.head()); print(df.describe().T)

# 2 Data Preparation ---------------------------------------------------


for c in ["Created Date","Closed Date"]:
df[c]=pd.to_datetime(df[c],errors='coerce')
df["Request_Closing_Time"]=(df["Closed Date"]-df["Created
Date"]).dt.total_seconds()/3600

cols_to_drop=[ 'Agency Name','Incident Address','Street Name','Cross Street


1','Cross Street 2',
'Intersection Street 1','Intersection Street 2','Address Type','Park Facility
Name','Park Borough',
'School Name','School Number','School Region','School Code','School Phone
Number','School Address',
'School City','School State','School Zip','School Not Found','School or Citywide
Complaint',
'Vehicle Type','Taxi Company Borough','Taxi Pick Up location','Bridge Highway
Name',
'Bridge Highway Direction','Road Ramp','Bridge Highway Segment','Garage Lot
Name','Ferry Direction',
'Ferry Terminal Name','Landmark','X Coordinate (State Plane)','Y Coordinate
(State Plane)','Due Date',
'Resolution Action Updated Date','Community Board','Facility Type','Location' ]
df.drop(columns=[c for c in cols_to_drop if c in df.columns], inplace=True)

print("Rows before dropna:", len(df)); df.dropna(inplace=True)


print("Rows after dropna :", len(df))

for c in df.columns: print(f"{c:<25} {df[c].nunique()}")

# 3 Data Analysis ------------------------------------------------------


num=df.select_dtypes(np.number)
summary=pd.DataFrame({"sum":num.sum(),"mean":num.mean(),"std":num.std(),
"skew":num.skew(),"kurtosis":num.kurtosis()})
print(summary); corr=num.corr(); sns.heatmap(corr); plt.show()

# 4 Data Exploration ---------------------------------------------------


top10=df["Complaint Type"].value_counts().nlargest(10)
sns.barplot(y=top10.index,x=top10.values).set(title="Top 10 Complaint Types");
plt.savefig("fig1_top10.png"); plt.clf()

borough=df["Borough"].value_counts()
sns.barplot(x=borough.index,y=borough.values).set(title="Requests by
Borough"); plt.savefig("fig2_borough.png"); plt.clf()

sns.histplot(df["Request_Closing_Time"],bins=100,kde=True)
plt.xlim(0,df["Request_Closing_Time"].quantile(0.99)); plt.title("Distribution of
Closing Time"); plt.savefig("fig3_closing_time.png"); plt.clf()
df["MonthCreated"]=df["Created Date"].dt.to_period("M").dt.to_timestamp()
df.groupby("MonthCreated").size().plot(); plt.title("Requests Over Time");
plt.savefig("fig4_trend.png"); plt.clf()

pivot=(df.groupby(["Borough","Complaint Type"])
["Request_Closing_Time"].mean().reset_index())
sns.barplot(data=pivot.sort_values("Request_Closing_Time",ascending=False).h
ead(30),
x="Request_Closing_Time",y="Complaint Type",hue="Borough")
plt.title("Slowest Complaint-Type/Borough Combos");
plt.savefig("fig5_slowest_combos.png"); plt.clf()

# 5 Statistical testing ------------------------------------------------


keep=df["Complaint Type"].value_counts()[lambda s:s>=500].index
anova=[df.loc[df["Complaint Type"]==ct,"Request_Closing_Time"] for ct in keep]
f,p=stats.f_oneway(*anova); print("ANOVA F,p:",f,p)
if p<0.05:
tukey=pairwise_tukeyhsd(endog=df[df["Complaint Type"].isin(keep)]
["Request_Closing_Time"],
groups=df[df["Complaint Type"].isin(keep)]["Complaint
Type"],alpha=0.05)
print(tukey.summary())

top=df["Complaint Type"].value_counts().nlargest(10).index
table=pd.crosstab(df.loc[df["Complaint Type"].isin(top),"Complaint
Type"],df["Borough"])
chi,chi_p,dof,exp=stats.chi2_contingency(table)
print("Chi²,p,dof:",chi,chi_p,dof)
# ----------------------------------------------------------------------
print(" Finished – figures saved, console shows stats.")

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy