Analyzing Crime in Chicago Through Machine Learning: Nathan Holt
Analyzing Crime in Chicago Through Machine Learning: Nathan Holt
Abstract
In this paper, we attempt to apply techniques from machine learning to data from
a dataset released by the government on crime in the city of Chicago. We apply
K-means clustering as a way of compartmentalizing the records of crime. We then
propose methods of "predicting" hate crimes based upon the clustered data.
I. Introduction
The city of Chicago publishes an up-to-date list of all reported crimes. The
records span from 2002 to the modern date, allowing a few days of delay to
catalog the crimes and publish them. According to official FBI data, Chicago is
one of the leading cities in homicides, having more than quadruple the amount
of crimes in New York City, and more than double the amount of crimes in
Los Angeles [3]. Being able to parse the data presented by these huge data sets
is a problem central to understanding the crimes in the city of Chicago.
By analyzing the data in a mathematically rigorous way, researchers may
be able to glean insight into the underlying causes of crimes, and also may be
able to figure out indicators of future crimes to occur. All of this categorization
and analysis falls under the umbrella of Data Science, a field which analyzes
large sets of data using probability and statistics, and makes useful conclusions
from the analysis.
In this paper, we will attempt to parse the city of Chicago’s up-to-date
dataset, and try to perform some crime "prediction." We will do this by utilizing
techniques from machine learning: specifically, K-means clustering.
II. Methods
In this section, I will discuss the methods used to actually obtain the results.
1
II. METHODS
k
arg min ∑ ∑ k x − µ i k2
S i = 1 x ∈ Si
2
II. METHODS
3
II. METHODS
The first step in the algorithm is to assign initial clusters. There is much in
the literature about various ways to assign initial clusters. However, I just chose
the naive approach, and took the first k points in the data set. I consider these
data points as the k centroids of the clusters. One we have the initial centroids,
we assign each data point to one of the clusters, based upon minimization of the
Euclidean distance between the centroid and the data point. This corresponds
to the equation:
Si = { x : k x − µi k2 ≤ k x − µ j k2 ∀ j, 1 ≤ j ≤ k}
As before, the µi is the centroid point of the ith cluster. Thus, the initial
clusters are just all the points whose Euclidean distance is minimized.
Then, the iteration begins. In each step of the iteration, we calculate new
centroids based on the assignment of points in the current iteration. We take
the distances summed up together (the equation on the second page), and
divide by the number of points in the cluster. That is, we take the average of
the points, and declare this as the new centroid of the cluster. This corresponds
to the equation:
1
µit+1 = t ∑ x j
| Si | x ∈ S t
j i
Three things are important to note here. First, the summation of the points
is done component-wise, since they’re three dimensional data points. Secondly,
since we’re taking the average of all the components to create a new centroid,
the new centroid we create may not be a point in our original data set. This is
okay, since we don’t care if the centroid of a given cluster is in our data set. We
just care that the distance between the points and the centroid is minimized.
We only naively set the original centroids to be points in our data set to make
initialization of the algorithm easy. Thirdly, it’s important to note that now
that we’re iterating, I introduce the superscript notation to denote the given
iteration. Thus, µit+1 is the centroid of the ith cluster, on the t + 1 iteration of
the algorithm.
The iteration continues until the centroids don’t change when updated. At
this point, the centroids are said to "converge," and we take that as our final
centroids.
It’s also important to note that since we have a discrete set, we must
partition the values in each axis as values between 0 and 1. Otherwise, the
weighting calculated by the Euclidean Norm will not be the same in every
dimension. As a visual proof of this, I have included the following figure in
which I removed normalization of all the values in all dimensions. As visually
apparent, it simply partitions the set based upon the time dimension. This is
because there is a much larger span of possible values in the time dimension
(orders of magnitude larger), so the algorithm would minimize the distance
between points based solely on this dimension; the other dimensions would
have little impact.
4
III. RESULTS
Figure 1: A plot showing the dominance of the time dimension over others without normaliza-
tion
III. Results
First, we’ll plot visualizations of the points in 3 dimensional space. Since
the data set we serialize is over 6.5 million entries long, Python will crash if
attempting to plot all the points in a single display. Instead, we sample points.
5
III. RESULTS
Immediately, by plotting these points we can garner insight into the un-
derlying structure of the data that would not otherwise be visible in just a
spreadsheet. We notice that crimes corresponding to IUCR codes between 200
to 250 are very spare in contrast to other crimes. When cross-referencing this
with the database of IUCR codes provided by the state of Illinois, we obtain
that these values of IUCR codes correspond to sexual crimes. Immediately, we
see that sexual crimes such as rape or sexual assault are far less frequent than
other crimes.
We also gather that gambling is far less frequent than other crimes by the
same method. It’s worth noting that the database includes all reported crimes,
but unreported crimes are not listed in the database (of course, that’s because
no one has reported them!), so it may not be entirely accurate of what actually
happens in the city of Chicago.
Now that data is serialized, we can run our Python implementation of
Lloyd’s Algorithm and K-means Clustering. We have to make a choice of value
of K (decide how many clusters to partition into). We partition into 4 clusters
first.
When run, we obtain the following output:
6
IV. DISCUSSION
IV. Discussion
This project has largely been a proof of concept to show that clustering data
sets about crime is entirely possible, and allows us to gain insight into the
underlying trends of crime.
i. Future Work
Clustering data and analyzing it is an important step to many other machine-
learning algorithms. In particular, prediction of crime points can be achieved
by multiple methods:
7
REFERENCES REFERENCES
the crime output would take the form of (1, 1, z), where z corresponds to an
IUCR code. By minimizing the distance to centroids, we could calculate the
most likely cluster it’s in. We can also form a regression line for serialized
points we have that match coordinates via least squares or similar methods. We
could also implement the K-nearest neighbors algorithm (which finds the most
similar points given a specific points), with the condition that they be in the
same cluster as we predict our data point will be in. This would provide a list
of k possible crime types that are most likely to occur in Chicago on January 1
at midnight.
References
[1] Chicago Police Department - Illinois Uniform Crime Reporting (IUCR) Codes
| City of Chicago | Data Portal. url: https : / / data . cityofchicago .
org/Public-Safety/Chicago-Police-Department-Illinois-Uniform-
Crime-R/c7ck-438e/data.
[2] Crimes - 2001 to present | City of Chicago | Data Portal. url: https://data.
cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-
q8t2.
[3] Eliott C. McLaughlin. With Chicago, it’s all murder, murder, murder ... but
why? Mar. 2017. url: http://www.cnn.com/2017/03/06/us/chicago-
murder-rate-not-highest/.