Exploratory Data Analysis
Exploratory Data Analysis
Exploratory Data Analysis
Import/Preview Data
Read the CSV file.
Code
df_cities = pd.read_csv('../input/cities.csv')
df_cities.head()
Output
We made a scatter plot of the locations of all the cities. There is a surpise waiting in the
locations of the cities :)
Code
fig = plt.figure(figsize=(20,20))
#cmap, norm = from_levels_and_colors([0.0, 0.5, 1.5], ['red', 'black'])
plt.scatter(df_cities['X'],df_cities['Y'],marker = '.',c=(df_cities.CityId
!= 0).astype(int), cmap='Set1', alpha = 0.6, s = 500*(df_cities.CityId ==
0).astype(int)+1)
plt.show()
Output
Code
# using sieve of eratosthenes
def sieve_of_eratosthenes(n):
primes = [True for i in range(n+1)] # Start assuming all numbers are
primes
primes[0] = False # 0 is not a prime
primes[1] = False # 1 is not a prime
for i in range(2,int(np.sqrt(n)) + 1):
if primes[i]:
k = 2
while i*k <= n:
primes[i*k] = False
k += 1
return(primes)
prime_cities = sieve_of_eratosthenes(max(df_cities.CityId))
df_cities['IsPrime'] = prime_cities
def total_distance(dfcity,path):
prev_city = path[0]
total_distance = 0
step_num = 1
for city_num in path[1:]:
next_city = city_num
total_distance = total_distance + \
np.sqrt(pow((dfcity.X[city_num] - dfcity.X[prev_city]),2) +
pow((dfcity.Y[city_num] - dfcity.Y[prev_city]),2)) * \
(1+ 0.1*((step_num % 10 ==
0)*int(not(prime_cities[prev_city]))))
prev_city = next_city
step_num = step_num + 1
return total_distance
dumbest_path = list(df_cities.CityId[:].append(pd.Series([0])))
print('Total distance with the dumbest path is '+
"{:,}".format(total_distance(df_cities,dumbest_path)))
Output
Code
df_cities.plot.scatter(x='X', y='Y', s=0.07, figsize=(15, 10))
north_pole = df_cities[df_cities.CityId==0]
plt.show()
Output
We can see that the city coordinates create a picture of Sant'as reindeer and some trees. The prime
cities seem to be relativley evenly distributed thorughout the "map".
How many Prime cities are there compared to not prime cities?
Code
print(df_cities['IsPrime'].value_counts())
Output
There is about one tenth the amount of prime cities as regular cities... makes sense given that each
10th stop should be to a prime city to avoid the 10% distance penalty…
Sieve of Eratosthenes
Sieve of Eratosthenes is an algorithm used to find all prime numbers smaller than n, when n is
smaller than 10 million or so.
The steps involved are:
1. Create a list of all integers from 2 to n (i.e., the upper limit)
2. At first, let p = 2, the smallest prime number.
3. Starting from p, count up in increments of p (from 2p to n), and mark them in the list
(these will be 2p, 3p, 4p..; the p itself should remain unmarked)
4. Find the first number that is greater than p in the unmarked list. If there’s no such
number, then stop. Otherwise, let p be equal this new number (the next prime), and go
to step 3 and repeat the process.
5. When the algorithm terminates, the remaining unmarked numbers in the list are all the
prime numbers less than n.
The main idea here is that every single unmarked value up to p will be prime, because if it were
composite it would be marked as a multiple of some other, smaller prime. This algorithm can
also be used for n = 1000, 10000, 100000, etc.
Tools Used
● Python 3.8.0
● pandas 0.25.3
Output
The dumbest path seems pretty bad. We are sending Santa all over the map, without any
consideration for him whatsoever :)
Approach 2: Slightly better path
We already reduced our total distance by more than half using a simple sorting !! The first 100
steps of the sorted path look as follows:-
If we divide the the whloe network into a grid of X's and Y's, then Santa can cover each square in the
grid, before moving on to the next square.
We can now see that Santa is going to a point, trying to cover many of the cities around that
point, before moving up. This feels like a better way to cover all the cities and gives better
results as expected.
Nearest neighbor algorithm offered a significant improvement. Whats even more mesmerising is
the visualization!!
Code
df_path = pd.DataFrame({'CityId':nnpath}).merge(df_cities,how = 'left')
fig, ax = plt.subplots(figsize=(20,20))
ax.plot(df_path['X'], df_path['Y'])
Output
Approach 5: Nearest Neighbor / Greedy Algorithm With Prime Swaps
So far we haven't used the prime-number related constraint on the path for the optimization. It says
"every 10th step is 10% more lengthy unless coming from a prime CityId". So, it is in our interest to
make sure that the prime numbers end up at the beginning of every 10th step.
Looking at the distribution of prime numbers from the EDA, it is clear that they are evenly spaced.
Hence this approach should not be as costly.
Approach:
● Loop through the path, whenever we encounter a city that has prime CityId, we will
try to swap it with the nearby city that have an index ending in 9 (ie 10th step), if
they makes the total path smaller.
● We try to swap the prime city with two cities that occur before the current city and
have an index ending in 9, or two cities that occur after the current city and have an
index ending in 9.
○ For example if we have a prime city in the 76th place, we will try to
see if that can be swapped with any of the cities in the 59th, 69th,
79th and 89th place and result in a shorter path.
● When checking to see if the path becomes shorter by swapping the cities, we will
only check the length of the sub-path that corresponds to the swap. No need to
check the length of the entire path. This will make the search lot more efficient.
Results
A summary of our results in each step can be seen below:
Method Distance
1. Dumbest path 446884407.5212135
2. Sorted city path 196478811.25956938
3. Sorted cities within a grid path 3226331.4903367283
4. Greedy Nearest Neighbor Algorithm 1812602.1861388374
5. Greedy Nearest Neighbor Algorithm with Prime 1811953.6824856116
Swaps
Since the least amount of distance is covered when we use Greedy Nearest Neighbor Algorithm
with Prime Swaps, we can conclude that it is the most optimal solution for our case.