PythonExercises Results 0
PythonExercises Results 0
The main purpose of script programming is to automate tedious work in processing our data, and to use logic to
direct that process. I think those two words are key: automate and logic. They distinguish this activity from the
more common interaction we use computers for most of the time. To communicate by email, to compose a
document, or to design a map, we need to interact; to process a lot of data, we need to automate and use logic
to guide the automation.
In geoprocessing script logic, we’ll make decisions that allow us to, for example, handle rasters differently from
vector data, or only set map projections for unprojected data, or process datasets collected only at certain times.
For any serious GIS work, scripting and other forms of programming becomes a necessity, not an option.
In this exercise, we’ll explore the use of Python to create scripts that allow us to use the vast suite of
geoprocessing tools in ArcGIS Pro. All of the tools you can use from the toolbox or in a model can also be used
in a Python script. And these scripts can be made into script tools that we can use like any other geoprocessing
tools. We’ll be doing this in a later section of the exercise.
Here's a guide to going through the exercise. Obviously all of the text is important or I wouldn't have written it. It
will guide you through what you're needing to learn. But there will be certain parts where you need to respond,
and I've used icons to point these out:
➦ This directs you to do something specific, maybe in the operating system or answer something conceptual.
⌨ To run this code cell, press the Run button at the top or type Ctrl-Enter in the code cell.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex01_IntroPython_results.html 1/27
1/27/24, 2:48 PM Ex01_IntroPython_results
In [ ]: x = 5
print(x)
➦ Go through Help menu to learn your way around. Jupyter Notebook also has a tour. You'll find keyboard
shortcuts, a more extensive reference, and Markdown formatting. And you'll see that there are links to references
for a lot of things we'll be using in the class: Python, NumPy, Matplotlib, pandas.
Now create a character-string scalar named msg and assign it "Hello World". It should look like this:
Reminder: Look for the keyboard icon for when you're supposed to write code in the following code cell:
In [ ]: #
Hello World
In [ ]: #
25
➦ The next code cell should just be run; you don't need to try to figure it out. It provides multiple outputs from a
code cell. You can save this with any notebook you create, using it as "boilerplate".
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex01_IntroPython_results.html 2/27
1/27/24, 2:48 PM Ex01_IntroPython_results
Think of a variable as a box where you can store anything, even other boxes (variables). When programming, we
represent variables with one or more letters of our choosing (x, i, slope_raster). We then assign a value to a
variable with the equal sign “=” (name = “Anne”). This value can be in the format of a number, a string of text, a
list, or a complex set of things. We can re-assign a new value to our variable as many times as we want – which
is why we call it a variable – its value can vary. Type out the following in console to get a better sense of
variables. Text behind the “#” is a comment:
numeric variables
There are two types of numeric variables in python - integer and floating point. Integer variables hold integer
numbers only. Floating point variables hold decimal point numbers.
⌨ Run some similar code below to explore how variables and printing work:
In [ ]: x = -17
print(type(x))
x = -23.568
print(type(x))
<class 'int'>
<class 'float'>
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex01_IntroPython_results.html 3/27
1/27/24, 2:48 PM Ex01_IntroPython_results
msg = "Hello" creates string variable msg and assigns “Hello” as its value
path = "c:/prog/mydata" string variable path with value "D:/prog/mydata"
⌨ Run some similar code below to explore how string variables work. Make sure to create msg and path
variables you'll use below
In [ ]: #
c:/prog/mydata
tf = 5 > 6
print(tf)
print(type(tf))
print(type(True))
In [ ]: #
False
<class 'bool'>
<class 'bool'>
It's often useful to realize that True can also be interpreted or entered as the integer 1 , and False is 0 as
you can see by including the Boolean variable in a mathematical expression like False * 2 or True * 2 .
In [ ]: False * 2
Out[ ]: 0
In [ ]: True * 2
Out[ ]: 2
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex01_IntroPython_results.html 4/27
1/27/24, 2:48 PM Ex01_IntroPython_results
In fact, any non-zero value will be interpreted as True , but False is always zero.
In [ ]:
Functions are written as functionName(input) . We'll see a lot of these in the math module, but you can
see what the built-in functions are at https://docs.python.org/3/library/functions.html
(https://docs.python.org/3/library/functions.html)
In [ ]: import math
sin = math.sin
radians = math.radians
pi = math.pi
abs(-5)
sin(30/180*pi)
sin(radians(30))
Out[ ]: 5
Out[ ]: 0.49999999999999994
Out[ ]: 0.49999999999999994
Methods are written as obj.method(parameters) and applies to that object. To see what methods apply
to an object, type the dot and press tab. There aren't many for the simple numeric variables. Try it out, but
here's one example.
y = 0.5
y.
y.as_integer_ratio()
In [ ]: #
Out[ ]: (1, 2)
In [ ]: mystr = "hello"
mystr.capitalize()
Out[ ]: 'Hello'
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex01_IntroPython_results.html 5/27
1/27/24, 2:48 PM Ex01_IntroPython_results
Properties are similar to methods in being applied to an object, but there are no parameters, thus no () ;
they simply are a property of some sort. Once again, to see what properties are available to an object, press
tab after the dot. Try it out
x = 2
x.
x.denominator
In [ ]: #
Out[ ]: 1
list
A list is simply a set of objects, which might be entered as numeric, Boolean or string constants.
Syntax: To create a list, you must use brackets [] :
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex01_IntroPython_results.html 6/27
1/27/24, 2:48 PM Ex01_IntroPython_results
You can also populate a list with previously assigned variables or other objects like lists.
⌨ Enter some new variables and combine these and some constants into a list:
x = 5
msg = "Hello"
path = "c:/py/data"
anotherList = [x, msg, path, 2.7, aList]
anotherList
In [ ]: #
In [ ]: #
2
5
c:\data\exer
Subsetting lists:
You can pull out a subset of a list (sometimes called slicing) by using list indices defining the position or positions
of list items that you want to extract. This takes a bit of getting used to, but the key is to know that the index
refers to a position between each element in a list, as shown here.
You can use a single index value to refer to a single item, starting at that position, as shown here:
Out[ ]: 'landuse'
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex01_IntroPython_results.html 7/27
1/27/24, 2:48 PM Ex01_IntroPython_results
In [ ]: lyr[1:2]
Out[ ]: ['landuse']
... which means the same thing, starts at 1 and goes to start of 2
Look for the ❔ question mark to answer a question, by editing the markdown cell it's in and entering a new line.
Once you're done editing, you can "run" it to format it, just like a code cell.
In [ ]: #
⌨ You can also identify positions relative to the end of the list: lyr[-2:] starts two indices before the end of
the list:
In [ ]: #
In [ ]: #
Out[ ]: ['cities']
⌨ How would you get only the first item, as a list of that one item? lyr[:1]
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex01_IntroPython_results.html 8/27
1/27/24, 2:48 PM Ex01_IntroPython_results
In [ ]: #
Out[ ]: ['geology']
If you have a list of one item, to get what that one item contains, you can follow it with another list request:
In [ ]: print(lyr[1:2])
print(lyr[1:2][0])
print(lyr[1]==lyr[1:2][0])
['landuse']
landuse
True
If you can create an empty list (ie. myList = [] ) or you just want to add more items to your list, you can use
the append method.
⌨ We haven't used a method before, but it is something you do to a Python object. The syntax for a method is
object.method(), as in Addresses.append(a1) below:
Addresses = []
Addresses.append(a1)
Addresses.append(a2)
Addresses.append(a3)
print(Addresses[0])
print(Addresses[1][1])
print(Addresses[2][3])
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex01_IntroPython_results.html 9/27
1/27/24, 2:48 PM Ex01_IntroPython_results
⌨ Then how would you print the house number from the third record of Addresses?
Addresses[2][0]
In [ ]: #
Out[ ]: 3434
⌨ Create and print a list built from your name and semester schedule, with the semester schedule built of lists
with prefix (e.g. "GEOG") and course number (e.g. "625").
In [ ]:
append vs extend
When we used .append above, each time we append one item, which we saw was often a list. What if we
wanted to simply combine two lists into one longer one? That's what .extend does, or its equivalent, adding
lists together with a + operator which is really simpler so we'll use that:
Pacific = ['AK','CA','OR','WA']
Desert = ['AZ','NV','UT']
Mountain = ['ID','MT','WY','CO','NM']
WestStates = Pacific + Desert + Mountain
WestStates
In [ ]: #
Out[ ]: ['AK', 'CA', 'OR', 'WA', 'AZ', 'NV', 'UT', 'ID', 'MT', 'WY', 'CO', 'NM']
In [ ]: States = Pacific
States.append(Desert)
States
Sorting lists
.sort sorts a list. If we have a random list...
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex01_IntroPython_results.html 10/27
1/27/24, 2:48 PM Ex01_IntroPython_results
Out[ ]: [1, 1, 7, 3, 7, 8, 2, 8, 3, 0, 9, 6, 4, 1, 3, 1, 1, 8, 8, 4]
... the sort() method sorts it. Note that it changes the list called since it's a method of the list. Then we can
display it:
mylist.sort()
mylist
In [ ]: #
Out[ ]: [0, 1, 1, 1, 1, 1, 2, 3, 3, 3, 4, 4, 6, 7, 7, 8, 8, 8, 8, 9]
Counting lists
.count(x) returns a count of the value x specified.
for i in range(10):
print("frequency of {}: {}".format(i, mylist.count(i)))
Out[ ]: [2, 2, 1, 9, 6, 1, 9, 2, 1, 5, 0, 4, 8, 3, 6, 4, 3, 9, 5, 6]
frequency of 0: 1
frequency of 1: 3
frequency of 2: 3
frequency of 3: 2
frequency of 4: 2
frequency of 5: 2
frequency of 6: 3
frequency of 7: 0
frequency of 8: 1
frequency of 9: 3
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex01_IntroPython_results.html 11/27
1/27/24, 2:48 PM Ex01_IntroPython_results
⛬ We haven't used loops before, but see if you can interpret what the code above did and how it worked.
Tuples
We've just been using lists, which allow you to append new members to the list. Sometimes if you know you're
not going to want to change anything in a list you may want to use an immutable collection called a tuple. It
othewise works the same as as list but to create one you use parentheses instead of brackets. You can also
create them with commas, so the following tuples are identical:
In [ ]: mytuple1 = 5, 7, "name", 8
mytuple2 = (5, 7, "name", 8)
print(mytuple1)
print(mytuple2)
(5, 7, 'name', 8)
(5, 7, 'name', 8)
⌨ So we might use tuples in the above example. While we may want to append to the set of addresses, each
individual address could be a tuple. So create a comparable example this way:
Addresses = []
Addresses.append(a1)
Addresses.append(a2)
Addresses.append(a3)
print(Addresses)
In [ ]: #
[(1212, 'First St', 'SF', 'CA'), (2323, 'Second St', 'Seattle', 'WA'), (3434,
'Third St', 'Denver', 'CO')]
⌨ You may want to note that accessing a part of a tuple requires using square brackets (since using
parentheses might be interpreted as a method or function call), like this:
Addresses[1][1]
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex01_IntroPython_results.html 12/27
1/27/24, 2:48 PM Ex01_IntroPython_results
Dictionaries
Dictionaries are used when you want to store named data as key:data pairs, and uses braces to create. We'll find
dictionaries useful when we start working with NumPy arrays and pandas.DataFrames, where we'll want to start
organizing our data with variable names or individual object names. A dictionary is a set that is ordered (as of
Python 3.7), mutable, but do not allow duplicates.
Two common purposes of dictionaries for GIS data processing is to create rows (records or observations) and to
create columns (fields or variables) in our data. We'll start with create a row.
CA = {
"name":"California",
"capital":"Sacramento",
"areakm2":423970,
"population":39538223
}
print(len(CA))
CA
In [ ]: #
GROVELAND = {"ELEVATION":853,
"LATITUDE":37.8444,
"LONGITUDE":-120.2258,
"PRECIPITATION":176.02}
GROVELAND
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex01_IntroPython_results.html 13/27
1/27/24, 2:48 PM Ex01_IntroPython_results
⌨ Now we'll create a (short) variable PRECIPITATION with keys as indices, and see how the key works this
way:
PRECIPITATION = {"GROVELAND":176.02,
"LEE VINING":71.88,
"PLACERVILLE":170.69}
PRECIPITATION["PLACERVILLE"]
In [ ]: #
Out[ ]: 170.69
PRECIPITATION["BRIDGEPORT"] = 41.4
PRECIPITATION
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex01_IntroPython_results.html 14/27
1/27/24, 2:48 PM Ex01_IntroPython_results
⌨ There are several dictionary methods, and a very useful one is to access the keys themselves as a list, which
then can be used for various purposes, such as looping through the data.
print(PRECIPITATION.keys())
for station in PRECIPITATION.keys():
print(PRECIPITATION[station])
In [ ]: PRECIPITATION.keys()
for station in PRECIPITATION.keys():
print(PRECIPITATION[station])
176.02
71.88
170.69
41.4
In [ ]: GROVELAND.keys()
We'll be looking at dictionaries more when we get to NumPy arrays and pandas, where some database objects
expect to be built with dictionaries.
Arithmetic operators
Check the following arithmetic operators as to what they return, integer or floating point:
Addition or subtraction:
x = 2 + 3
y = 2. + 3
z = 2 - 3
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex01_IntroPython_results.html 15/27
1/27/24, 2:48 PM Ex01_IntroPython_results
In [ ]: #
<class 'int'>
<class 'float'>
<class 'int'>
You don't really need to assign a variable, since you can just check an expression like the following.
In [ ]: print(2.+3)
5.0
But by reusing the variable you can avoid testing two different things due to typos, and this is good practice for
coding.
In [ ]: y = 2. + 3
print(y)
type(y)
5.0
Out[ ]: float
z=2-3
print(z)
print(type(z))
In [ ]: #
-1
<class 'int'>
Multiply
2 * 3 2. * 3
⌨ For each of these, assign variables, print the result and type, like the following:
m = 2 * 3
print(m)
print(type(m))
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex01_IntroPython_results.html 16/27
1/27/24, 2:48 PM Ex01_IntroPython_results
In [ ]: #
6.0
<class 'float'>
⛬ Interpretation (Multiplication):
Divide
1 / 2
1. / 2
4 / 2
In [ ]: #
Out[ ]: 0.5
In [ ]: #
Out[ ]: 0.6
In [ ]: #
Out[ ]: 2.0
⛬ Interpretation (Divide):
Power
5 ** 2
5 ** 2.0
In [ ]: #
Out[ ]: 25
In [ ]: #
Out[ ]: 25.0
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex01_IntroPython_results.html 17/27
1/27/24, 2:48 PM Ex01_IntroPython_results
Square root
25 ** (1/2)
25 ** (0.5)
In [ ]: #
Out[ ]: 5.0
Modulo
Occasionally very useful is the remainder from division, called the "modulo". Knowing the remainder from division
is great when you have wrapping values, like those of a clock or a compass. You can also use modulo to figure
out if one number is divisible by another (modulo will equal 0). In Python, the operator used for modulo -- % -- is
unfortunately confusing, but its use is simple, and best seen by example.
print("modulus = 10")
for n in range(2, 14, 2):
print(n, n % 10) # 10 might be a common repeated value
print("modulus = 360, for compass azimuth (°)")
for n in range(90, 720, 90):
print(n, n % 360) # Compass azimuth (°) is a good application of modulos in a
cycle.15
In [ ]: #
modulus = 10
2 2
4 4
6 6
8 8
10 0
12 2
modulus = 360, for compass azimuth (°)
90 90
180 180
270 270
360 0
450 90
540 180
630 270
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex01_IntroPython_results.html 18/27
1/27/24, 2:48 PM Ex01_IntroPython_results
Conversion functions
We'll be looking at many additional functions imported from various modules, but there are some functions that
are so commonly needed that they are built in to the core language. Conversion functions of various types are
good examples. For example, you often need to convert numbers to other formats, or convert strings to numbers
and vice-versa.
In [ ]: #
Out[ ]: '5.278'
In [ ]: #
Out[ ]: 4
int("4.17")
In [ ]:
In [ ]: #
Out[ ]: 4.17
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex01_IntroPython_results.html 19/27
1/27/24, 2:48 PM Ex01_IntroPython_results
x = 9.53
y = int(x)
print(y)
z = float(y)
In [ ]: #
x = "7.25"
int(float(x))
In [ ]: #
Out[ ]: 7
❔ Why did this work ? (In relation to int(x) which raises an error):
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex01_IntroPython_results.html 20/27
1/27/24, 2:48 PM Ex01_IntroPython_results
To review: Functions, Methods, and Properties: With numerical objects we've been working with functions
which apply to the numerical object in the form function(objects) . A method is like a function but it applies
to an object that is in a class that includes that method as a capability, and is run as object.method() . We'll
also see some properties which look a bit like methods, with the form object.property , and thus have no
parameters; they are just properties of the object.
A string object might be a string literal (like "a string" or 'c:/625/pr/hmb/landuse.shp' ) or a variable
that has been assigned a string. Methods are applied by writing them in after a dot following the string object.
The best way to understand strings is to try them in Python, which we’ll do next. Key takeaway: You should
clearly understand (1) how a string can be assigned to a variable; (2) how you can use and extract parts of a
string variable; and (3) how strings can be manipulated.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex01_IntroPython_results.html 21/27
1/27/24, 2:48 PM Ex01_IntroPython_results
In [ ]: #
⛬ Interpret the above. What did you learn from the above?
String indices
Strings can be dealt with as a list of characters, with the first one having a zero index. A set of string characters
can be defined with a format like 1:3, meaning "from the beginning of 1 to the beginning of 3" -- so it doesn't
display the character that is at index 3.
print(s[0])
s1 = s[1]
print(s1)
print(s[2])
print(s[-5:])
print(s[4:11])
print(s.split(" ")[1])
In [ ]: #
t
h
e
where
science
science
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex01_IntroPython_results.html 22/27
1/27/24, 2:48 PM Ex01_IntroPython_results
In [ ]: #
where
5
Concatenating strings
In Python, strings can be concatenated using + . No spaces are inserted between them however, so you often
need to also concatenate spaces:
print(words[3] + words[1])
print(words[3] + " " + words[1])
In [ ]: #
wherescience
where science
print('Jerry\'s Kids')
print('\nJerry\'s\nKids')
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex01_IntroPython_results.html 23/27
1/27/24, 2:48 PM Ex01_IntroPython_results
But what if you want to include a backslash itself, such as in a Windows file path which uses backslashes?
You can create a single backslash by using a double backslash \\ , so a path might look like the following.
print('d:\\work\\soil.shp')
In [ ]: #
In [ ]: #
In [ ]:
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex01_IntroPython_results.html 24/27
1/27/24, 2:48 PM Ex01_IntroPython_results
p = 'd:/work/lu.shp'
print(p.find('.'))
print(p[p.find('.'):])
print(p.split("/"))
In [ ]: #
10
.shp
['d:', 'work', 'lu.shp']
⌨ So one solution is to create a raw string by prefacing it with an r like the following:
print(r'd:\work\soil.shp')
Note that a backslash doesn't always produce a special character result; it depends on what follows it. Best
practice for filepaths is to always use r"..." to create a raw string.
In [ ]: #
Jerry's Kids
Jerry's
Kids
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex01_IntroPython_results.html 25/27
1/27/24, 2:48 PM Ex01_IntroPython_results
Then there are various Boolean operators for combining multiple Boolean values, like
x = 5
y = 2
print(x > y)
print(not (x > y))
print(x == y)
In [ ]: #
True
False
False
Out[ ]: False
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex01_IntroPython_results.html 26/27
1/27/24, 2:48 PM Ex01_IntroPython_results
print(2*(x>y))
j = (x>y)+(x==y)+(y>x)
print(j)
In [ ]: #
2
1
And if you think this is just esoteric, you may be surprised at how useful this is for spatial analysis.
print((x>y)|(y>x))
print((x>y)&(y>x))
print((x>y)|(x==y))
print(not (x==y))
print((x>y)|(-1*(x==y))==True)
In [ ]: #
True
False
True
True
True
key
➦ This directs you to do something specific, maybe in the operating system or answer something conceptual.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex01_IntroPython_results.html 27/27
1/27/24, 2:48 PM Ex02_LogicFlowIO_results
There are also many modules you can download -- for example numeric processing, such as numpy -- but to find
the latest, do searches at www.python.org or google.
To use a given module, it must be imported first. You do a lot of importing modules in Python. Normally you would
put a line import at the top of your program. For instance:
import sys
would occur at the top of your program before you use any sys methods. It doesn't have to be the first line, just
before you use it. If you are entering commands through the console, then just do the import before you need to
use it.
Multiple modules, separated by commas, can be entered at one time, such as:
➦ First, just run the following boilerplate code to allow multiple outputs from a cell:
import math
print(math.log10(100))
Note that all module functions are prefaced with the module name. This allows for re-use of function names in
different modules.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex02_LogicFlowIO_results.html 1/32
1/27/24, 2:48 PM Ex02_LogicFlowIO_results
In [ ]: #
2.0
print(math.log(100)) ⌨
In [ ]: #
4.605170185988092
print(math.pi) ⌨
In [ ]: #
3.141592653589793
❔ Other than what it returns how is math.pi different from math.log() ? What do the parentheses indicate? :
pi = math.pi print(pi) pi ⌨
In [ ]: #
3.141592653589793
Out[ ]: 3.141592653589793
❔ What does the above tell you about the print statement for variables? :
Note: the following cells relate to trigonometry, and they expect you to remember the types of units used for
measuring angles and how basic trigonometric functions work. You may want to review these.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex02_LogicFlowIO_results.html 2/32
1/27/24, 2:48 PM Ex02_LogicFlowIO_results
In [ ]: #
Out[ ]: 0.0
Out[ ]: 1.0
Out[ ]: 1.2246467991473532e-16
Out[ ]: -1.0
Look carefully at the output of sin(pi) . It's in computerese scientific notation where the e-16 is giving us
−16
the power of 10, so this one should be something like 1.2246467991473532 × 10 (and knowing what
value to expect from sin(π)) what value does this really represent?
In [ ]: #
Out[ ]: 0.7071067811865476
Out[ ]: 0.7071067811865476
Out[ ]: 30.000000000000004
Out[ ]: 30.000000000000004
⛬ Interpret the results of the above code in terms of angle unit conversions:
For a module like math, the built-in help system may be useful for finding some quick information. Entering
help(math) may help.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex02_LogicFlowIO_results.html 3/32
1/27/24, 2:48 PM Ex02_LogicFlowIO_results
In [ ]:
⛬ Use the above to find help on something useful you weren't previously aware of in the math module.
Consider the following and envision how it might be useful in working with spatial problems:
x0 = 14; y0 = 8
x1 = 17; y1 = 12
dx = x1-x0; dy = y1-y0
h = (dx**2 + dy**2)**0.5
h
math.hypot(dx,dy)
In [ ]: #
Out[ ]: 5.0
Out[ ]: 5.0
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex02_LogicFlowIO_results.html 4/32
1/27/24, 2:48 PM Ex02_LogicFlowIO_results
import random
print(random.random())
rnd = random.random
for i in range(5):
print(rnd())
In [ ]: #
0.5078208875118928
0.33065927005499385
0.33767419682200284
0.976511861465647
0.4266331853676444
0.007580517526020292
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex02_LogicFlowIO_results.html 5/32
1/27/24, 2:48 PM Ex02_LogicFlowIO_results
rndi = random.randint
for i in range(5):
print(rndi(0,100))
for i in range(5):
print(int(rnd()*100))
In [ ]: #
57
79
19
22
60
11
6
24
44
82
mu = 50
s = 10
for i in range(10):
print(random.gauss(mu, s))
In [ ]: #
40.0817300002313
58.860089693839356
36.97461795602848
65.74460501389359
42.35850155701855
61.27908385468605
37.605217033380484
38.59688917701296
25.039607376019987
74.712744232311
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex02_LogicFlowIO_results.html 6/32
1/27/24, 2:48 PM Ex02_LogicFlowIO_results
In [ ]: #
Out[ ]: [6, 3, 4, 2, 0, 5, 1, 7, 5, 2, 4, 6, 8, 6, 8, 7, 3, 6, 5, 2]
Out[ ]: [0, 1, 2, 2, 2, 3, 3, 4, 4, 5, 5, 5, 6, 6, 6, 6, 7, 7, 8, 8]
frequency of 0: 1
frequency of 1: 1
frequency of 2: 3
frequency of 3: 2
frequency of 4: 2
frequency of 5: 3
frequency of 6: 4
frequency of 7: 2
frequency of 8: 2
frequency of 9: 0
In [ ]:
key
➦ This directs you to do something specific, maybe in the operating system or answer something conceptual.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex02_LogicFlowIO_results.html 7/32
1/27/24, 2:48 PM Ex02_LogicFlowIO_results
while Keep executing (loop through) the set multiple times while a condition is true
for Loop through the set for each value in a range of values
The if, while, and for statements all end in a colon, followed by an indented block of statements to be used under
the conditions defined by the if, while or for. In the code cell, you'll note that after you enter the line with a colon,
the next line will be indented. In the line following the code you want to run, backspace to get back to an
unindented section.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex02_LogicFlowIO_results.html 8/32
1/27/24, 2:48 PM Ex02_LogicFlowIO_results
if
Scenario: You would like to create a series of hillshade rasters to represent summer, winter and equinox
conditions. The hillshade tool requires inputs of sun angle (the sun's maximum altitude in the sky on a given
day) and azimuth. The sun angle depends on the solar declination. You could look up values in a table, but why
not have the computer derive these from what you know? We'll start with a somewhat informed situation -- we
know the values for solar declination for four significant dates during the year:
⌨ Enter the following code that derives sun angle and azimuth from solar declination (the latitude where the
sun's rays are vertical at noon) and latitude. It starts with information to populate two variables, lat (latitude for
the area of study, negative if south of the equator) and decl (solar declination), and from these derives
sunangle and azimuth :
lat = 30
decl = 20
sunangle = 90 - lat + decl
azimuth = 180
if sunangle > 90:
sunangle = 180 - sunangle
azimuth = 0
print(f"Noon sun angle {sunangle}, at azimuth {azimuth}")
(Note that we used an alternative formatted printing method, with f"..." ), working much the same as
"...".format() but perhaps a bit easier to read in code, since the variable names appear where you use
them.)
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex02_LogicFlowIO_results.html 9/32
1/27/24, 2:48 PM Ex02_LogicFlowIO_results
For now, we've hard-coded the inputs of lat and decl . In this case sunangle would be assigned the value
80 since 90 - 30 + 20 is evaluated as 80. The next line assigns 180 to the variable azimuth . There's then a
section of statements that assigns new values to sunangle and azimuth if sunangle ends up with value
greater than 90 after the first two lines are processed. Sun azimuth is either from the south (180) or from the
north (0) at solar noon.
starts with if followed by a Boolean expression ( sunangle > 90 ) as a condition followed by a colon :
the code that will run if the condition is true follows as an indented series of statements
the next indented code continues the program code: it runs whether or not the if structure runs (as long as
there's no error raised)
⌨ Here's a simple if structure that just prints out what it's doing:
print('\n\nStart of script...')
x = 5
if x > 0:
print('In the "if" block, since x > 0 ...')
print('Still in the indented "if" block ...')
print("Not indented -- we're at the next step in the script.")
In [ ]: #
Start of script...
In the "if" block, since x > 0 ...
Still in the indented "if" block ...
Not indented -- we're at the next step in the script.
In [ ]: #
Start of script...
Not indented -- we're at the next step in the script.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex02_LogicFlowIO_results.html 10/32
1/27/24, 2:48 PM Ex02_LogicFlowIO_results
In GISci, we often need to work with data, so we'll explore some simple methods of accessing data using
relevant file paths, which is one application of flow control structures like if .
We just looked at using if to demonstrate running blocks of indented code using a condition. In the following, we'll
also explore another handy module, os.path .
Create a data folder in the folder where your jupyter files are being saved.
Then in the data folder create a text file "test.txt". You might want to make sure your folder is not hiding file
extensions (tools/folder options/view tab) so you don't create "test.txt.txt".
⌨ Try the following code. Make sure to indent the print statement.
import os.path
if os.path.exists("data"):
print("data folder exists")
In [ ]: #
❔ Did the data folder exist? If not, did you remember to create it first?
In our code, we've made use of relative paths which is a good practice since it makes your code more portable.
However, it's common to need to access absolute paths, sometimes on a connected server, external drive, or
even on the same computer but in a different location. So we need to know how to work with absolute paths.
➦ Use your OS to find the path to the data folder we just created. If you're not sure how to do this one way in
Windows at least is to click in the path area of of the file explorer window until it changes to display the path,
looking something like mine: C:\py\ex01\data .
⌨ Alter your code to read as follows, replacing my path with yours. Note the use of backslashes in the path:
import os.path
if os.path.exists("C:\py\ex01\data"):
print("data folder exists")
else:
print("data folder doesn't exist")
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex02_LogicFlowIO_results.html 11/32
1/27/24, 2:48 PM Ex02_LogicFlowIO_results
Interestingly, the code works fine, but this may be because the interpreter in the IDE is fixing things. You may find
that backslashes don't work in some IDEs since a backslash is interpreted as an "escape" character, with "\t"
representing a tab, etc., so I'm used to prefacing the path string with an r , so the path above would be
r"C:\py\ex01\data" .
⌨ You can also have other conditions to test if the first condition isn't met, using the elif statement. Note the use
of indentation.
import os.path
if os.path.exists(r"C:\py\ex01\data\test.txt"):
print("Test folder exists.")
print("Text file exists.")
elif os.path.exists(r"C:\py\ex01\data"):
print("Test folder exists")
print("... but text file doesn't.")
else:
print("Neither exist.")
In [ ]: #
Neither exist.
while
⌨ Try the following code, which illustrates a while loop:
i = 1
while i < 10:
print(i)
i = i + 1
In [ ]: #
1
2
3
4
5
6
7
8
9
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex02_LogicFlowIO_results.html 12/32
1/27/24, 2:48 PM Ex02_LogicFlowIO_results
The last bit of code employs a variable as a "counter" to keep track of how many times we've gone through the
loop. Looking at the last statement we can clearly see that it represents an assignment, not a statement of
equality x can never be equal to x + 1 , right? But we can assign a new value of x to be one greater than its
previous value. This is a very basic but extremely important concept in programming; if you don't feel comfortable
with this, ask for help before you continue.
Note: The expression i < 10 can be evaluated as True or False , so this allows us to process a set of code
after evaluating a condition. As we'll see, there are many situations where we will want to use conditional code --
a type of low-level decision that illustrates why we use computers
i = 1
while i < 10:
print(i)
i += 1
Note: an interesting experiment would be to un-indent the last line so it's not in the loop. But if you run it, you'll
want to interrupt the kernel using the black square, since you've created an infinite loop.
One advantage of the while loop is it lets us skip the whole section if the condition isn't met to begin with. It's
even tempting to use it instead of an if statement, but this is an easy way to get into an endless loop: the while
loop will keep repeating until the condition is false. When looping through datasets, a common task in GIS work,
the while structure is often useful. We'll see some examples when we get to using the geoprocessor.
In [ ]: #
1
2
3
4
5
6
7
8
9
for
⌨ Try the following code, which illustrates a for loop:
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex02_LogicFlowIO_results.html 13/32
1/27/24, 2:48 PM Ex02_LogicFlowIO_results
In [ ]: #
1 Hello World
2 Hello World
3 Hello World
4 Hello World
❔ This will also run it four times, but how does it differ?
In [ ]: #
0 Hello World
1 Hello World
2 Hello World
3 Hello World
⌨ Note the use of lists and the len function in the following.
In [ ]: #
geology
landuse
publands
0 geology TYPE-ID
1 landuse LU-CODE
2 publands PUBCODE
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex02_LogicFlowIO_results.html 14/32
1/27/24, 2:48 PM Ex02_LogicFlowIO_results
A very useful task for programming is to do something to every file (of a given type) in a folder. We might want to
reproject or clip every shapefile in a folder, for instance. We can use the os package to look at file names and
identify the files with its extension, like .shp for instance.
➦ Go to your data folder and create some more text files, all ending with .txt . It doesn't matter what's in
them; we're just going to use them to provide a list.
⌨ Then use this code to list the names of the text files. You can imagine that we could then do something with
each file, but for now we're going to list them.
import os
print(os.getcwd())
ws = "data"
ilist=os.listdir(ws)
txtfiles = [] # Start with an empty list
for i in ilist:
if i.endswith(".txt"):
txtfiles.append(i)
txtfiles
for f in txtfiles:
print(f)
In [ ]: #
c:\Users\900008452\Box\course\625\exer\Ex02_LogicFlowIO
Out[ ]: ['untitled.txt']
untitled.txt
Optional: If you have some shapefiles (including the various files that go with them), copy them into the data
folder and modify the code to list them. There's just one simple change to make in the code.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex02_LogicFlowIO_results.html 15/32
1/27/24, 2:48 PM Ex02_LogicFlowIO_results
lat = 37
decls = [-23.44,-20,-12,0,12,20,23.44,20,12,0,-12,-20]
.... code to complete, ending with the following list outputs ...
decls
sunangles
azimuths
In [ ]: #
Out[ ]: [-23.44, -20, -12, 0, 12, 20, 23.44, 20, 12, 0, -12, -20]
Out[ ]: [29.56, 33, 41, 53, 65, 73, 76.44, 73, 65, 53, 41, 33]
Out[ ]: [180, 180, 180, 180, 180, 180, 180, 180, 180, 180, 180, 180]
Debug trick: toggling comments... A useful method to try in the code editor is toggling on and off
commenting on a section of code: just select anywhere in a single or multiple lines of code and
press Ctrl-/ . That toggles it on or off. This is very handy for checking variables before running
something that uses them. For the above code, try this out by commenting out the two .append
lines and inserting lines that just print the sunangle and azimuth. You can then toggle on and off
those print statements as well. So far, our code isn't very complicated, but if you practice this,
you'll find it really helpful as things get more challenging...
key
➦ This directs you to do something specific, maybe in the operating system or answer something conceptual.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex02_LogicFlowIO_results.html 16/32
1/27/24, 2:48 PM Ex02_LogicFlowIO_results
In [ ]: # boilerplate
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
⌨ The following code defines a distance() function that uses the Pythagorean theorem to calculate the
straight line distance between any two UTM coordinates, which we'll provide as points a and b .
The function name is distance takes in two parameters, each of which is assumed to be a list or tuple
representing a coordinate pair of [x,y] or (x,y) .
Note the importance of the return command: In the R language, the value returned is simply the expression in
the defined function, so the last line would simply read dist , but Python requires you explicitly identify it with
the return command line.
Note that the def structure is similar to the flow control structures we just looked at: with a colon on the def
line and the lines of code indented.
In [ ]: #
⌨ Now create the inputs as two points a and b entered as hard-coded tuples, and process them with the new
function:
In [ ]: #
Out[ ]: 500.0
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex02_LogicFlowIO_results.html 17/32
1/27/24, 2:48 PM Ex02_LogicFlowIO_results
⌨ Then try the same thing, creating c and d , but create them as lists instead of tuples. Or with one a tuple,
the other a list; shouldn't matter.
In [ ]: #
Out[ ]: 500.0
⌨ A graph may help, and we'll jump ahead and use a little numpy and matplotlib. Feel free to copy and paste
this into the next code cell, since we haven't learned about numpy and matplotlib yet, so this isn't the time to
figure it out yet (unless you're really anxious...)
import matplotlib
import numpy as np
from matplotlib import pyplot as plt
x = np.array([a[0],a[0],b[0],a[0]])
y = np.array([a[1],b[1],b[1],a[1]])
fig, ax = plt.subplots()
plt.plot(x,y)
ax.axis('equal')
plt.text(*a,"a",size=24)
plt.text(*b,"b",size=24)
dXlab = (a[0],(b[1]-a[1])/2+a[1])
dYlab = ((b[0]-a[0])/2+a[0],b[1])
plt.text(*dXlab,"dX",size=24)
plt.text(*dYlab,"dY",size=24)
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex02_LogicFlowIO_results.html 18/32
1/27/24, 2:48 PM Ex02_LogicFlowIO_results
In [ ]: #
In [ ]: #
Out[ ]: 500.0
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex02_LogicFlowIO_results.html 19/32
1/27/24, 2:48 PM Ex02_LogicFlowIO_results
⌨ Modify the script to also return the angle in degrees between the two points. Note that we are using azimuth
type angles, with the 0° to the north and going around clockwise. rules for trigonometry-type Rename the
function to be distangle and return as a tuple with return dist, angle. (Hint: use math.degrees and
math.atan2(dx,dy) )
Note that we are using azimuth-type angles, used for mapping, with the 0° to the north and going around
clockwise, in contrast to the way you would have learned it in math classes with 0° to the right and going around
counter-clockwise. That's why we're specifying (dx,dy) instead of (dy,dx) as we would in standard use in
math classes. The azimuth in the triangle above is the angle at a .
In [ ]: #
distangle(a,b)
distangle(pt1=(520782, 4152673), pt0=(520382, 4152373))
In [ ]: #
In [ ]: #
⌨ Change the parameters for the distangle() call to look instead like:
distangle(pt0=a, pt1=b)
distangle(pt1=b, pt0=a)
distangle(pt1=a, pt0=b)
In [ ]: #
Note the difference in order! Function parameters have a particular order, but if we name them we can provide
them in a different order.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex02_LogicFlowIO_results.html 20/32
1/27/24, 2:48 PM Ex02_LogicFlowIO_results
plt.text(*a,"a",size=24)
plt.text(*b,"b",size=24)
We're seeing unpacking which is something we need to use with the pyplot .text method which we can see by
requesting help on the method something like help(plt.text) , which works because we've imported
matplotlib.pyplot as plt . The required first three parameters of the .text method are x, y, s for the
coordinates followed by the text string to plot. To provide a and b as x , y we need to unpack the tuple and
that's what * does.
Here's another example of the same thing where we can apply our distangle() function to add text showing
angles in degrees to a plot, and unpack the pt tuple.
⌨ Again, feel free to just copy and paste this into the cell; we'll learn this stuff next week.
import matplotlib
import numpy as np
from matplotlib import pyplot as plt
origin = (0,0)
for pt in [(1,0),(1,1),(0,1),(-1,1),(-1,0),(-1,-1),(0,-1),(1,-1)]:
dist, angle = distangle(origin,pt)
x = np.array([origin[0],pt[0]])
y = np.array([origin[1],pt[1]])
plt.plot(x,y)
plt.text(*pt, str(angle % 360), size=18)
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex02_LogicFlowIO_results.html 21/32
1/27/24, 2:48 PM Ex02_LogicFlowIO_results
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex02_LogicFlowIO_results.html 22/32
1/27/24, 2:48 PM Ex02_LogicFlowIO_results
⛬ In the last line of code above, there are two parameters provided after *pt . Describe what's going on with
the next parameter and how that relates to what you see on the graph. Maybe experiment with changing it.
⌨ Let's change our functions to use four inputs to demonstrate the need for unpacking.
def dist(x1,y1,x2,y2):
import math
dx = x2 - x1
dy = y2 - y1
return math.sqrt(dx**2 + dy**2)
In [ ]: #
dist(*a, *b)
In [ ]: #
Out[ ]: 500.0
Unpacking a dictionary
Unpacking a dictionary is a little bit more complicated but has an interesting application for functions. It does
require that the dictionary use the variable names expected by the function. One advantage of this is in providing
inputs in a different order:
Since functions allow inputs to be provided in a different order by including their variable names, such as
dist(x1=0, x2=1, y1=5, y2=5) working with the defined function where they are expected in a different
order:
def dist(x1,y1,x2,y2):
⌨ The following code illustrates this and shows the input provided by unpacking and the equivalent standard
input:
Xs = {"x1": 0, "x2": 1}
Ys = {"y1": 5, "y2": 5}
print("dictionary input: {}".format(dist(**Xs, **Ys)))
print("equivalent standard input: {}".format(dist(x1=0, x2=1, y1=5, y2=5)))
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex02_LogicFlowIO_results.html 23/32
1/27/24, 2:48 PM Ex02_LogicFlowIO_results
In [ ]: #
key
➦ This directs you to do something specific, maybe in the operating system or answer something conceptual.
2.4 Input/Output
In this section we'll look at input and output methods, starting with user input and output displays, then input and
output of files.
Output display
We'll start with output, since we've already been using a variety of output methods, including expressions in code
cells, the print statement and formatted output using ".format and f" methods. We'll look at these in more
detail. For now, don't run the boilerplate we've been using, and if you have, then restart the kernel.
⌨ Let's start with the snippet of the sunangle code, and then make it more useful. We'll start with our example
where we "hard code" data directly into it by assigning variables. We'll print the results using the expression
method.
lat = 40
decl = 23.44
sunangle = 90 - lat + decl
azimuth = 180
if sunangle > 90:
sunangle = 180 - sunangle
azimuth = 0
sunangle
azimuth
In [ ]: #
Out[ ]: 73.44
Out[ ]: 180
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex02_LogicFlowIO_results.html 24/32
1/27/24, 2:48 PM Ex02_LogicFlowIO_results
Note that without the InteractiveShell setting in our boilerplate, you only get one output from a code cell, and that
will be the last expression in the code cell, so the azimuth .
⌨ One solution is to change the multiple expressions to one, but converted to a tuple with: sunangle,
azimuth
In [ ]: #
⌨ Now change it back to two separate expressions, run the boilerplate, and then run the above code again to
see the difference.
In [ ]: #
Out[ ]: 73.44
Out[ ]: 180
This is still pretty minimal in the way of an output, since we have to figure out what expression produced each
output, and sometimes an expression output is missing some useful formatting.
⌨ For instance try this code that concatenates two strings and inserts a new line with \n , as an expression:
In [ ]: #
⌨ Now put that expression in a print() statement to see the line, we'll see that line formatting:
In [ ]: #
To be or not to be?
That is the question.
Formatted output
⌨ Going back to our sunangle code, change the expressions to print statements.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex02_LogicFlowIO_results.html 25/32
1/27/24, 2:48 PM Ex02_LogicFlowIO_results
In [ ]: #
73.44
180
Well, in this case that's really no different, since we're just printing a number, so you might think we could do
without that InteractiveShell boilerplate and just insert multiple print statements, but we'll find when we get
to pandas that printing a dataframe isn't as nice looking as just putting the dataframe name as an expression,
allowing Jupyter to format it in a different way. But that's later; we'll continue now with formatted output using
either the ".format or f" method which we've used a bit before, but need to explore further.
For noon sun angle we'll specify a number occupying 5 spaces, and 2 decimal places with :5.2f
For azimuth we'll specify 3 spaces with 0 decimal places with :3.0f
⌨ Try both of these methods that should produce the same result:
or
In [ ]: #
Input
Now let's get away from the hard-coded input. We probably don't want to have to alter our program every time
we want to run it. There are a variety of ways of providing input. One way is to use the input method.
⌨ Modify the code for getting the two inputs this way:
Look for a prompt when you run this for entering Latitude and Solar Declination.
In [ ]:
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex02_LogicFlowIO_results.html 26/32
1/27/24, 2:48 PM Ex02_LogicFlowIO_results
Try this with different values (Don't type values into the code, but respond to the prompts). Valid
values of latitude are -90 to 90, while valid solar declination values are -23.44 to 23.44. Try -70
for latitude and 23.44 for declination to see what the sun angle would be at 70°S on the June
solstice. Interpret what the above is showing us.
Note: the first coding problem assumes you have previously created a data folder. It doesn't matter what's in it,
but we'll need it to be there to create an output.
⌨ We'll start by simply displaying a set of three points each stored as tuples of values (id, name, x, y) , in
a container list, then just print these out:
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex02_LogicFlowIO_results.html 27/32
1/27/24, 2:48 PM Ex02_LogicFlowIO_results
Writing out a CSV text file is pretty easy using the csv module and its methods .writer that creates the
object that is opened by with open() , and .writerow for writing the header row of field names, then
.writerows for writing an interable list of row tuples. The newline='' option is needed to avoid creating
extra line ends as carriage returns.
⌨ The complete script will need the creation of the ptData above followed by:
import csv
with open("data/marblePts.csv",'w', newline='') as out:
csv_out=csv.writer(out)
csv_out.writerow(['ID','Name','Easting','Northing'])
csv_out.writerows(ptData)
In [ ]: #
Out[ ]: 13
➦ Check what you get by opening it in Excel. Once you've confirmed you got what you expected, close Excel.
You may have already discovered this, but a common problem with reading and writing data happens when two
programs are trying to access the same data, for instance if you have the CSV open in Excel and then try to write
it again from Python -- you'll get a message that it's locked, and it can be difficult to fix the problem, often
requiring closing out of both Jupyter and Excel, if not something more extreme.
Note the use of the with ...: structure. It's handy because it sets up an environment that
applies only within the indented code of the structure. We'll see it again when working in arcpy to
apply environment settings that we only want to use within the structure.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex02_LogicFlowIO_results.html 28/32
1/27/24, 2:48 PM Ex02_LogicFlowIO_results
infile = "data/marblePts.csv"
f = open(infile, "r")
firstline = True
for line in f:
if firstline: firstline=False
else:
values = line.split(",")
id = int(values[0])
name = values[1]
x = float(values[2])
y = float(values[3])
print("{},{},{},{}".format(id,name,x,y))
f.close()
In [ ]: #
Out[ ]: 'ID,Name,x,y\n'
Using csv
⌨ The csv module also has a reader method. A simple display of the data could work this way:
import csv
with open("data/marblePts.csv", newline="") as csvfile:
dta = csv.reader(csvfile, delimiter=",")
for row in dta:
print(row)
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex02_LogicFlowIO_results.html 29/32
1/27/24, 2:48 PM Ex02_LogicFlowIO_results
⌨ What did the csv.reader return: what does each row represent?
⌨ Modify the last line of code to instead display each line as a single string separated by commas with
", ".join(row)
In [ ]: #
import csv
with open("data/marblePts.csv", newline="") as csvfile:
dta = csv.reader(csvfile, delimiter=",")
firstline = True
for row in dta:
if firstline:
row
firstline=False
else:
id = int(row[0])
name = row[1]
x = float(row[2])
y = float(row[3])
print("{},{},{},{}".format(id,name,x,y))
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex02_LogicFlowIO_results.html 30/32
1/27/24, 2:48 PM Ex02_LogicFlowIO_results
When we get to pandas we'll be working with dataframes which are a better way of working with tabular data like
this, and we'll learn about methods of reading and writing converting CSVs into dataframes and vice versa. So
we don't need to look at these methods much longer. But before we leave, you'll note that the code above
already knew the variable names, and how many they are. We might want to detect what those variables are, if
they're stored in the first row of the data.
⌨ This is a good use for a dictionary, which makes it easy to get variable names from strings we read from
somewhere, like the first line of the data/marblePts.csv file. There's probably an easier way of doing this,
buth this works:
import csv
with open("data/marblePts.csv", newline="") as csvfile:
dta = csv.reader(csvfile, delimiter=",")
firstline = True
for row in dta:
if firstline:
fields = row
print(f"fields: {fields}")
firstline=False
marblePts=[]
valueTuples=[]
else:
for i in range(len(fields)):
valueTuples.append((fields[i],row[i]))
marblePts.append(dict(valueTuples))
marblePts
In [ ]: #
Out[ ]: [{'ID': '1', 'Name': 'Trail Jct Cave', 'x': '483986', 'y': '4600852'},
{'ID': '2', 'Name': 'Upper Meadow', 'x': '483473', 'y': '4601523'},
{'ID': '3', 'Name': 'Sky High Camp', 'x': '485339', 'y': '4600001'}]
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex02_LogicFlowIO_results.html 31/32
1/27/24, 2:48 PM Ex02_LogicFlowIO_results
❔ Why did we for i in range(len(fields)): instead of something like for fld in fields: which
would have also looped through the fields?
⌨ Provide some code that uses the marblePts data created. For instance, print out the Names, one per row.
In [ ]: #
key
➦ This directs you to do something specific, maybe in the operating system or answer something conceptual.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex02_LogicFlowIO_results.html 32/32
1/27/24, 2:49 PM Ex03_NumPyMatplotlib_results
In [ ]: # boilerplate
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
Learning matplotlib
The matplotlib package does a lot. You will find that it's pretty much the only graphics system in Python, yet there
is an enormous amount of graphical work done with it. Different applications will use customized backends
developed in matplotlib, and each of these include specialized routines and ways of working, but all within
matplotlib. We will be focusing on just what we need to productively use the package, but you should refer to
http://matplotlib.org (http://matplotlib.org) for a lot more information. Much of the documentation is fairly cryptic,
but one quick way of getting a sense of what you can do is to explore examples at
https://matplotlib.org/stable/gallery/index.html (https://matplotlib.org/stable/gallery/index.html) where you can also
see the code that generates them.
For our first Matplotlib plot, we'll create a simple line plot from thee pairs of coordinates, with x coming from one
list and y from the other. We'll import the matplotlib module and from it import the pyplot interface (similar
to MATLAB), which includes the plot function that plots a line plot by default.
import matplotlib
from matplotlib import pyplot
pyplot.plot([1,2,3],[4,5,1])
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex03_NumPyMatplotlib_results.html 1/30
1/27/24, 2:49 PM Ex03_NumPyMatplotlib_results
In [ ]: #
A slightly more efficient way to start the plot is to import pyplot as plt
⌨ Change your code to read as above, and then add a second line feature to the plt object with:
plt.plot([2,3,4],[3,4,0])
Note that the axes expanded a bit to include the new feature.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex03_NumPyMatplotlib_results.html 2/30
1/27/24, 2:49 PM Ex03_NumPyMatplotlib_results
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex03_NumPyMatplotlib_results.html 3/30
1/27/24, 2:49 PM Ex03_NumPyMatplotlib_results
In [ ]: #
⌨ Now change both methods back to .plot , but after the y coordinates list add a third parameter 'bo' to
the first to make blue solid circles and to the sevcond 'ro' to make red solid circles.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex03_NumPyMatplotlib_results.html 4/30
1/27/24, 2:49 PM Ex03_NumPyMatplotlib_results
In [ ]: #
Matplotlib expects numpy arrays as input or objects that can be converted to such with numpy.asarray() so in
the first plot, this was happening behind the scene -- the simple lists were converted to numpy arrays. This is
what it looks like more explicitly:
import numpy
from matplotlib import pyplot as plt
plt.plot(numpy.asarray([1,2,3]),numpy.asarray([4,5,1]))
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex03_NumPyMatplotlib_results.html 5/30
1/27/24, 2:49 PM Ex03_NumPyMatplotlib_results
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex03_NumPyMatplotlib_results.html 6/30
1/27/24, 2:49 PM Ex03_NumPyMatplotlib_results
In [ ]: #
NumPy
Ok, we jumped ahead a bit there, so we should formally introduce NumPy. There's a fair amount you can learn
about NumPy, "the fundamental package for scientific computing in Python"
(https://numpy.org/doc/stable/user/whatisnumpy.html (https://numpy.org/doc/stable/user/whatisnumpy.html)), but
we're only going to need to explore a relatively small part of it.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex03_NumPyMatplotlib_results.html 7/30
1/27/24, 2:49 PM Ex03_NumPyMatplotlib_results
They are immutable, having a fixed size when you create them. If you change their size, a new one will
actually be created.
The elements must all be of the same type. You might imagine these to be numerical, but the elements can
actually be objects of a complex structure, just each object has the same structure of every other one in the
array.
They facilitate numerical operations, allowing them to execute efficiently, so have been adopted by many
applications that use Python (such as ArcGIS) to crunch large data sets.
Mathematical operations use vectorization methods (similar to R), with element-by-element operations
coded simply. Say you have two ndarrays a and b (or one could be a constant or scalar variable) -- to
multiply them simply requires c = a * b
They can have more than one dimension, and the dimensions of ndarrays are called axes
The analogous data structure in R is a vector, which can also have multiple dimensions, where
they're called matrices or arrays, and similarly all elements must be the same type.
⌨ While not required, it's good practice for code readability to standardize on the main numpy object name and
call it np , so starting our code with import numpy as np is what we'll do. We'll start by creating a mixed-type
list, then try to build an array out of it.
import numpy as np
mixedList = [1, "x", 5]
myArray = np.asarray(mixedList)
In [ ]: #
myArray
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex03_NumPyMatplotlib_results.html 8/30
1/27/24, 2:49 PM Ex03_NumPyMatplotlib_results
Note that the dtype (data type) is <U11 , for which "U" refers to the unicode character data type,
with 11 part just referring to how many characters are in the string (and oddly 11 is the minimum).
When I first saw this, I thought the 11 referred to the number of bits, but if you put a string longer
than 11 characters in there, you'll find it uses a larger number. A while back, the main character
coding system used was ASCII, which stands for "American Standard Code for Information
Interchange" and it was a 7-bit code, capable of handling all of the Latin characters ('a':'z','A':'Z')
used in the U.S. (thus American standard code), Arabic numerals ('0':'9'), and other things
typically on your keyboard, and some special needs like line ends and tabs. That was fine for
normal computing where computer languages used English anyway, but hampered the use of
quite a few other languages with different characters. In the first extension of this to 8 bits, Greek
was added first, though mostly for mathematical symbols, and also accented letters like é (which
I typed using Alt 130 on the numeric keypad). Unicode expands this to character sets in all kinds
of other languages.
import numpy as np
a = np.arange(12)
a
In [ ]: #
a = a.reshape(3,4)
a
a.shape
a.ndim
a.size
a.dtype
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex03_NumPyMatplotlib_results.html 9/30
1/27/24, 2:49 PM Ex03_NumPyMatplotlib_results
In [ ]: #
Out[ ]: (3, 4)
Out[ ]: 2
Out[ ]: 12
Out[ ]: dtype('int32')
As we just saw, not only is the array numerical, but it's of a given type, int32 , so an integer occupying 32 bits
of memory each. Remember that each element in an ndarray has the same size.
⌨ Let's use a little vectorization to see what happens if we convert to a different data type:
b = a * 1.5
b
b.dtype
In [ ]: #
Out[ ]: dtype('float64')
b = np.array([1,2,3])
b
b.dtype
In [ ]: #
Out[ ]: dtype('int32')
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex03_NumPyMatplotlib_results.html 10/30
1/27/24, 2:49 PM Ex03_NumPyMatplotlib_results
c = np.array([1,2.0,3])
c
c.dtype
In [ ]: #
Out[ ]: dtype('float64')
⌨ Tuple:
t = np.array((1,2.0,3))
t
t.dtype
In [ ]: #
Out[ ]: dtype('float64')
So there's no difference between a numpy array created from a list than created from a tuple.
Dictionaries are a little different, and are created as objects.
⌨ Dictionary:
In [ ]: #
Out[ ]: dtype('O')
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex03_NumPyMatplotlib_results.html 11/30
1/27/24, 2:49 PM Ex03_NumPyMatplotlib_results
In [ ]: #
The array method transforms collections of collections into 2D arrays, collections of collections of collections
for 3D, etc.
np.zeros([3,4])
np.ones((3,4))
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex03_NumPyMatplotlib_results.html 12/30
1/27/24, 2:49 PM Ex03_NumPyMatplotlib_results
np.sin(a)
a**2
In [ ]: #
a.sum()
a.min()
a.max()
a.mean()
a.var()**0.5 # sd
In [ ]: #
Out[ ]: 66
Out[ ]: 0
Out[ ]: 11
Out[ ]: 5.5
Out[ ]: 3.452052529534663
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex03_NumPyMatplotlib_results.html 13/30
1/27/24, 2:49 PM Ex03_NumPyMatplotlib_results
Random numbers
There's a lot to the world of probability and statistics, and students are referred to those courses and that
literature for learning more. Generating random numbers (or rather pseudo-random numbers) is an important
part of that. We'll just use a couple of common methods -- creating uniform and normally distributed random
numbers -- but you should refer to https://numpy.org/doc/stable/reference/random/index.html
(https://numpy.org/doc/stable/reference/random/index.html) for more NumPy methods for this.
np.random.rand(2,3)
In [ ]: #
If you want to create the same sequence of random numbers for repeatability, you can set a random number
seed. The advantage of a seed is reproducibility; you'll always get the same sequence of random numbers for a
given seed. The source https://numpy.org/doc/stable/reference/random/index.html
(https://numpy.org/doc/stable/reference/random/index.html) recommends using the default_rng (random
number generator) which has one parameter: the random number generator "seed". We'll use this for the
following random number problems and use 42 , the answer to everything according to The Hitchhiker's Guide
to the Galaxy (Douglas Adams (1978)).
To initiate the random number generator as r with a seed, we use the .default_rng method of np.random :
⌨
r = np.random.default_rng(42)
In [ ]: #
From there any call to r to access random numbers will follow the sequence initiated. (To start an identical
sequence again, just run the r assignment above again.)
r = np.random.default_rng(42)
r.random(5)
r = np.random.default_rng(42) # to restart the same sequence
print("This should be the same as above:")
r.random(5)
print("But this continues the generation:")
r.random(5)
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex03_NumPyMatplotlib_results.html 14/30
1/27/24, 2:49 PM Ex03_NumPyMatplotlib_results
In [ ]: #
From here on, we'll either use r object to create random numbers from this seed or use
np.random if we don't care about the seed, so for exammple we might do one or the other of
the following to generate an ndarray of 5 uniformly distributed random numbers:
r.random(5)
np.random.random(5)
In [ ]: #
⌨ Let's use reshape to create a random set of XY values from a random 1D:
xy_data = r.random(12).reshape(6,2)
xy_data
xdata = xy_data[:,0]
ydata = xy_data[:,1]
xdata, ydata
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex03_NumPyMatplotlib_results.html 15/30
1/27/24, 2:49 PM Ex03_NumPyMatplotlib_results
In [ ]: #
⛬ After you've experiment with the above code, interpret the use of accessors in xy_data[:,0] :
⌨ Create a scatterplot of 30 random points (just use a 1D ndarray), with x values ranging from 20 to 30 and y
values ranging from 40 to 50. The basic algorithm for this is r * (max-min) + min if r is a random number
(or an array of random numbers) between 0 and 1, the minimum value is min , the maximum value is max .
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex03_NumPyMatplotlib_results.html 16/30
1/27/24, 2:49 PM Ex03_NumPyMatplotlib_results
⌨ We can also create normally distributed random numbers. Here are two ways of creating one using the
standard method of z scores (with a mean of 0 and a standard deviation of 1) within a 5x5 ndarray:
the first using the normal method which uses the parameters loc for mean, scale for standard deviation,
and size for sample size (including structure as a tuple if desired) with the usage normal(loc=0.0,
scale=1.0, size=None) , so if we want to default to the mean and standard deviation, we need to use a
specific size reference:
r.normal(size=(5,5))
the second using standard_normal which uses those same defaults and just needs the size, so
parameter specificity isn't required:
r.standard_normal((5,5))
In [ ]: #
In [ ]: #
⌨ Specifying a different mean and standard deviation is pretty easy with np.random.normal :
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex03_NumPyMatplotlib_results.html 17/30
1/27/24, 2:49 PM Ex03_NumPyMatplotlib_results
In [ ]: #
The mean (μ 'mu') is 100 and the standard deviation (σ 'sigma') is 10.
⌨ Create a scatter plot of 100 normally distributed random x and y values (again just using 1D arrays), using
the above mu and sigma values for both spatial dimensions.
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex03_NumPyMatplotlib_results.html 18/30
1/27/24, 2:49 PM Ex03_NumPyMatplotlib_results
Histogram.
⌨ Here's some code that creates a histogram. It's straightforward.
import numpy as np
mu, sigma = 100, 10 #
s = np.random.normal(mu, sigma, 1000)
plt.hist(s, 30, color="blue")
plt.show()
In [ ]: #
Out[ ]: (array([ 1., 2., 1., 1., 6., 15., 18., 27., 39., 57., 57., 76., 85.,
94., 81., 77., 76., 71., 63., 45., 40., 24., 11., 10., 10., 6.,
4., 2., 0., 1.]),
array([ 67.97687354, 70.21905859, 72.46124363, 74.70342868,
76.94561373, 79.18779878, 81.42998383, 83.67216888,
85.91435393, 88.15653898, 90.39872402, 92.64090907,
94.88309412, 97.12527917, 99.36746422, 101.60964927,
103.85183432, 106.09401936, 108.33620441, 110.57838946,
112.82057451, 115.06275956, 117.30494461, 119.54712966,
121.78931471, 124.03149975, 126.2736848 , 128.51586985,
130.7580549 , 133.00023995, 135.242425 ]),
<BarContainer object of 30 artists>)
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex03_NumPyMatplotlib_results.html 19/30
1/27/24, 2:49 PM Ex03_NumPyMatplotlib_results
Density plot
⌨ Creating a density plot is more complicated, so here's that completed code that you can puzzle at.
In [ ]: #
xy = np.array([(5,8,10),(12,16,6)])
print(xy)
plt.plot(xy[0],xy[1])
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex03_NumPyMatplotlib_results.html 20/30
1/27/24, 2:49 PM Ex03_NumPyMatplotlib_results
In [ ]: #
⌨ Let's convert the axes of our 2D array to create 1D arrays, then add a bit to our Matplotlib skills by creating a
title and label axes names:
x=xy[0]
type(x)
x.ndim
y=xy[1]
plt.plot(x,y)
plt.title('My Data')
plt.ylabel('Y axis')
plt.xlabel('X axis')
plt.show()
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex03_NumPyMatplotlib_results.html 21/30
1/27/24, 2:49 PM Ex03_NumPyMatplotlib_results
In [ ]: #
Out[ ]: numpy.ndarray
Out[ ]: 1
[X,Y] = np.meshgrid(np.arange(5),np.arange(5))
X
Y
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex03_NumPyMatplotlib_results.html 22/30
1/27/24, 2:49 PM Ex03_NumPyMatplotlib_results
In [ ]: #
❔ If we're looking at these 2D ndarrays as a raster, where's the origin? [What's the meaning of origin as x and y
values in any 2-dimensional graph?]
One application of this is to create simulated rasters or to support mathematical transformation of real DEM
rasters, as we used in a waveform landform study applying Fourier transforms:
Davis & Chojnacki (2017). Two-dimensional discrete Fourier transform analysis of karst and coral reef
morphologies. Transactions in GIS 21(3). DOI 10.1111/tgis.12277 .
https://www.researchgate.net/publication/316702784_Two-
dimensional_discrete_Fourier_transform_analysis_of_karst_and_coral_reef_morphologies_DAVIS_and_CHOJNAC
(https://www.researchgate.net/publication/316702784_Two-
dimensional_discrete_Fourier_transform_analysis_of_karst_and_coral_reef_morphologies_DAVIS_and_CHOJNAC
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex03_NumPyMatplotlib_results.html 23/30
1/27/24, 2:49 PM Ex03_NumPyMatplotlib_results
In [ ]: #
⌨ Change the code to display the cosine instead, and increase the range width from -10 to 10, same steps.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex03_NumPyMatplotlib_results.html 24/30
1/27/24, 2:49 PM Ex03_NumPyMatplotlib_results
In [ ]: #
In [ ]:
⌨ We'll start with a fairly simple way to read the first line to get the field names as a list. The following code
opens the data, removes any spaces (hopefully we don't have any in our field names), splits the line of text at
comma delimiters to create a list, then displays and closes the file.
f = open("npdata/exported_ptdata.csv", "r")
fields = f.readline().replace(" ","").split(",")
fields
f.close()
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex03_NumPyMatplotlib_results.html 25/30
1/27/24, 2:49 PM Ex03_NumPyMatplotlib_results
In [ ]: #
Out[ ]: ['X',
'Y',
'FID',
'AREA',
'PERIMETER',
'SAMPLES_',
'SAMPLES_ID',
'CATOT',
'MGTOT',
'ALK',
'SIO2',
'PH',
'TEMP',
'PCO2',
'IONSTR',
'TDS',
'CMOL',
'IONERR',
'CATXS',
'SATCALC',
'SATDOLO',
'SATQU',
'SATCX_5',
'SATCBLAC',
'NEGERR',
'CO2PERC',
'NEGPCO2\n']
➦ But for a better view of the data, open the exported_ptdata.csv from the npData folder into Excel.
❔ Knowing how Python counts things, what column numbers are X , Y , and CATOT in?
➦ Close the file in Excel so you don't create a schema lock error (where two programs are trying to access the
same data at the same time.)
⌨ We'll skip the header line with variable names, and read into an ndarray using the comma delimiter.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex03_NumPyMatplotlib_results.html 26/30
1/27/24, 2:49 PM Ex03_NumPyMatplotlib_results
In [ ]: #
Out[ ]: 2
From this, we can see that the data is an ndarray with 2 axes.
⌨ To populate X, Y, and Ca (we'll use the element symbol instead of CATOT), we need to replace the underline
in the following with the position of the field in the list of fields. I'll give you the first one.
X = my_data[:,0]
Y = my_data[:, _ ]
Ca = my_data[:, _ ]
In [ ]: #
⌨ Now we'll build the plot, starting with importing the module and selecting a style.
In [ ]: #
Some of the settings are specific to the plot itself and some are for the figure as a whole. The plot itself uses
"axes" and typically a subplot is defined called ax , and if two separate y axes are used, might be created as
ax1 and ax2. To see what is created by default we can see what is returned by:
plt.subplots()
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex03_NumPyMatplotlib_results.html 27/30
1/27/24, 2:49 PM Ex03_NumPyMatplotlib_results
In [ ]: #
... which should show an empty default plot, but as you can see returns a tuple: (<Figure size 432x288 with
1 Axes>, <AxesSubplot:>)
... so we can assign the figure and axis to fig, ax using the method we've used before:
fig, ax = plt.subplots()
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex03_NumPyMatplotlib_results.html 28/30
1/27/24, 2:49 PM Ex03_NumPyMatplotlib_results
Then we can create a scatter plot and legend by referencing the ax subplot. We'll set the color ( c= ) to the
variable Ca and color map ( cmap= ) to a yellow-orange-brown sequence "YlOrBr" .
scatter = ax.scatter(X,Y,c=Ca,cmap=("YlOrBr"))
legend1 = ax.legend(*scatter.legend_elements(),
loc="lower right", title="Ca mg/L")
In [ ]: #
To make the scatterplot plot like a map, we can just set the UTM coordinates (in metres) to have equally scaled
axes, and then add a title to the overall figure ( fig.suptitle ) and the plot ( ax.set_title ).
ax.axis('equal')
ax.set_title("Calcium concentrations")
fig.suptitle("Marble Mountains, CA")
In [ ]: #
down to here and put it all in one code cell, below. To get this all to display in Jupyter, we need to have the fig, ax
setting in the same code cell (I don't completely understand why), so we'll include all of the code that reads the
data and creates the plot. End it by writing out the figure as a (default) .png with:
plt.savefig("MarbleCA")
... which will save it in the same folder where this .ipynb file resides.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex03_NumPyMatplotlib_results.html 29/30
1/27/24, 2:49 PM Ex03_NumPyMatplotlib_results
In [ ]: #
➦ Again, there's a lot more to Matplotlib (as well as NumPy), so if you have time I'd recommend going through
the Basic, Pyplot and Image tutorials at https://matplotlib.org/stable/tutorials (https://matplotlib.org/stable/tutorials)
key
➦ This directs you to do something specific, maybe in the operating system or answer something conceptual.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex03_NumPyMatplotlib_results.html 30/30
1/27/24, 2:49 PM Ex04_Pandas_intro_results
Introduction to Pandas
In this section we'll be taking a look at the powerful data analysis python library called pandas. We'll start by
creating some dataframes. Then we'll explore various methods and properties available to us to access data
within our dataframes.
⌨ Enter the code below in your first cell, and since we'll also be using numpy, include that as well as pandas:
import pandas as pd
import numpy as np
In [ ]: #
import pandas as pd
elevation = pd.Series([52, 394, 510, 564, 725, 848, 1042, 1225, 1486, 1775, 1899, 2
551])
latitude = pd.Series([39.52, 38.91, 37.97, 38.70, 39.09, 39.25, 39.94, 37.75, 40.3
5, 39.33, 39.17, 38.21])
temperature = pd.Series((10.7, 9.7, 7.7, 9.2, 7.3, 6.7, 4.0, 5.0, 0.9, -1.1, -0.8,
-4.4)) # tuple also works, effect is same
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 1/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
In [ ]: #
Out[ ]: 0 10.7
1 9.7
2 7.7
3 9.2
4 7.3
5 6.7
6 4.0
7 5.0
8 0.9
9 -1.1
10 -0.8
11 -4.4
dtype: float64
Note that when we instantiate the series, the numeric indices are automatically created.
pd.Series(np.arange(5))
In [ ]: #
Out[ ]: 0 0
1 1
2 2
3 3
4 4
dtype: int32
In [ ]: #
Out[ ]: 0 2
1 2
2 2
3 2
4 2
dtype: int64
❔ What would you have gotten without the index setting, just pd.Series(2) ? (First think of a likely answer,
then try it.)
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 2/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
⌨ ... or some random numbers. The following creates r as a random number generator (rng) with a seed of
42.
r = np.random.default_rng(seed=42)
pd.Series(r.random(10))
In [ ]: #
Out[ ]: 0 0.773956
1 0.438878
2 0.858598
3 0.697368
4 0.094177
dtype: float64
Out[ ]: 0 0.773956
1 0.438878
2 0.858598
3 0.697368
4 0.094177
dtype: float64
⌨ We'll build an elev series this way, using the station name as the key for the index. Note that we'll create a
new series of elevation, but name it elev to retain both versions since we'll use each type below when we're
building dataframes. We'll do the same for latitude ( lat ) and temperature ( temp ).
elevDict = {"Oroville":52,
"Auburn":394,
"Sonora":510,
"Placerville":564,
"Colfax":725,
"Nevada City":848,
"Quincy":1042,
"Yosemite":1225,
"Sierraville":1516,
"Truckee":1775,
"Tahoe City":1899,
"Bodie":2551}
elev = pd.Series(elevDict)
elev
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 3/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
In [ ]: #
Out[ ]: Oroville 52
Auburn 394
Sonora 510
Placerville 564
Colfax 725
Nevada City 848
Quincy 1042
Yosemite 1225
Sierraville 1516
Truckee 1775
Tahoe City 1899
Bodie 2551
dtype: int64
⌨ A series is like a NumPy ndarray , so we can do similar operations, such as: elev[0]
In [ ]: #
Out[ ]: 52
In [ ]: #
Out[ ]: 564
In [ ]: #
Out[ ]: array([ 52, 394, 510, 564, 725, 848, 1042, 1225, 1516, 1775, 1899,
2551], dtype=int64)
In [ ]:
Vectorization of a series
Just as we saw with NumPy arrays, we can vectorize series, so for instance apply a mathematical operation or
function.
⌨ Given this, how would we derive elevft from elev (which is in metres), assuming we knew there are
0.3048 m in 1 foot.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 4/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
In [ ]: #
We're going to build a DataFrame from the three series and have the named indices of station names consistent
for each.
lat = latitude.copy()
temp = temperature.copy()
In [ ]: #
Note that if we had instead tried to copy them with lat = latitude etc., that would have
created a reference to the same series so anything we modified to the one affects the other;
using the copy method creates a separate object.
⌨ ... then assign the indices from elev to the other two (since we know they're in the same order):
lat.index = elev.index
temp.index = elev.index
... and then display each to confirm that they have these named indices.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 5/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
In [ ]: #
DataFrame
⌨ Now we'll create a dataframe providing each series in a dictionary key:value pair, where the key is the
variable or field name, and the value is the corresponding series. We made the field name the same as the
series name. We'll start with the numerically indexed series.
Note that we've made sure that they're all in the right order using the process we followed.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 6/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
In [ ]: #
Out[ ]:
temperature elevation latitude
0 10.7 52 39.52
⌨ Now we'll create a dataframe using the series with named indices. We had used abbreviated series names,
but we'll make the field names longer.
In [ ]: #
Out[ ]:
temperature elevation latitude
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 7/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
⌨ We can also access individual series, by using object.property format such as ...
sierra.elevation
In [ ]: #
Out[ ]: Oroville 52
Auburn 394
Sonora 510
Placerville 564
Colfax 725
Nevada City 848
Quincy 1042
Yosemite 1225
Sierraville 1516
Truckee 1775
Tahoe City 1899
Bodie 2551
Name: elevation, dtype: int64
sierra.elevation["Truckee"]
In [ ]: #
Out[ ]: 1775
⌨ What do you get when you apply the method .index to sierra ?
In [ ]: #
⌨ What do you get when you apply the method .columns to sierra ?
In [ ]: #
⛬ Interpret any similarities or differences for the .index and .columns results:
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 8/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
⌨ You may remember the len() function from base Python, which returns the length of strings and lists. With
numpy arrays it returns the total size, and is the same as what you get with .size on that array. Contrast what you
get with len() and .size on the sierra dataframe, and interpret it below.
In [ ]: sierra.size
len(sierra)
Out[ ]: 36
Out[ ]: 12
We'll take a look at loading data from external files into a dataframe. If you recall from the lecture there are
various file types we can load into a dataframe with the Pandas library. Below we'll examine reading in a csv file.
⌨ We'll start with a bit more of the Sierra climate data, and read in a bit longer data frame from a CSV file
sierraFeb = pd.read_csv('pdData/sierraFeb.csv')
sierraFeb
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 9/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
In [ ]: #
Out[ ]:
STATION_NAME COUNTY ELEVATION LATITUDE LONGITUDE PRECIPITATION TEMPERATU
GROVELAND 2,
0 Tuolumne 853.4 37.8444 -120.2258 176.02
CA US
CANYON DAM,
1 Plumas 1389.9 40.1705 -121.0886 164.08
CA US
KERN RIVER
2 Kern 823.9 35.7831 -118.4389 67.06
PH 3, CA US
DONNER
3 MEMORIAL ST Nevada 1809.6 39.3239 -120.2331 167.39
PARK, CA US
BOWMAN DAM,
4 Nevada 1641.3 39.4539 -120.6556 276.61
CA US
PACIFIC El
77 1051.6 38.7583 -120.5030 220.22 N
HOUSE, CA US Dorado
MAMMOTH
LAKES
78 Mono 2378.7 37.6478 -118.9617 NaN
RANGER
STATION, CA US
COLGATE
80 POWERHOUSE, Yuba 181.4 39.3308 -121.1922 168.91 N
CA US
BODIE
CALIFORNIA
81 STATE Mono 2551.2 38.2119 -119.0142 39.62
HISTORIC
PARK, CA US
82 rows × 7 columns
sierraFeb.set_index("STATION_NAME")
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 10/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
In [ ]: #
Out[ ]:
COUNTY ELEVATION LATITUDE LONGITUDE PRECIPITATION TEMPERATURE
STATION_NAME
GROVELAND 2,
Tuolumne 853.4 37.8444 -120.2258 176.02 6.1
CA US
CANYON DAM,
Plumas 1389.9 40.1705 -121.0886 164.08 1.4
CA US
KERN RIVER
Kern 823.9 35.7831 -118.4389 67.06 8.9
PH 3, CA US
DONNER
MEMORIAL ST Nevada 1809.6 39.3239 -120.2331 167.39 -0.9
PARK, CA US
BOWMAN DAM,
Nevada 1641.3 39.4539 -120.6556 276.61 2.9
CA US
PACIFIC El
1051.6 38.7583 -120.5030 220.22 NaN
HOUSE, CA US Dorado
MAMMOTH
LAKES
RANGER Mono 2378.7 37.6478 -118.9617 NaN -2.3
STATION, CA
US
COLGATE
POWERHOUSE, Yuba 181.4 39.3308 -121.1922 168.91 NaN
CA US
BODIE
CALIFORNIA
STATE Mono 2551.2 38.2119 -119.0142 39.62 -4.4
HISTORIC
PARK, CA US
82 rows × 6 columns
⌨ ... and from that we can reference a single column as a series sierraFeb.PRECIPITATION
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 11/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
In [ ]: #
Out[ ]: 0 176.02
1 164.08
2 67.06
3 167.39
4 276.61
...
77 220.22
78 NaN
79 207.26
80 168.91
81 39.62
Name: PRECIPITATION, Length: 82, dtype: float64
⌨ We can either use it from the DataFrame, or pull the series out as an individual series:
elevm = sierraFeb.ELEVATION
elevft = elevm / 0.3048
elevft
In [ ]: #
Out[ ]: 0 2799.868766
1 4560.039370
2 2703.083990
3 5937.007874
4 5384.842520
...
77 3450.131234
78 7804.133858
79 2379.921260
80 595.144357
81 8370.078740
Name: ELEVATION, Length: 82, dtype: float64
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 12/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
In [ ]: #
Out[ ]:
STATION_NAME COUNTY ELEVATION LATITUDE LONGITUDE PRECIPITATION TEMPERATU
GROVELAND 2,
0 Tuolumne 853.4 37.8444 -120.2258 176.02
CA US
CANYON DAM,
1 Plumas 1389.9 40.1705 -121.0886 164.08
CA US
KERN RIVER
2 Kern 823.9 35.7831 -118.4389 67.06
PH 3, CA US
DONNER
3 MEMORIAL ST Nevada 1809.6 39.3239 -120.2331 167.39
PARK, CA US
BOWMAN DAM,
4 Nevada 1641.3 39.4539 -120.6556 276.61
CA US
PACIFIC El
77 1051.6 38.7583 -120.5030 220.22 N
HOUSE, CA US Dorado
MAMMOTH
LAKES
78 Mono 2378.7 37.6478 -118.9617 NaN
RANGER
STATION, CA US
COLGATE
80 POWERHOUSE, Yuba 181.4 39.3308 -121.1922 168.91 N
CA US
BODIE
CALIFORNIA
81 STATE Mono 2551.2 38.2119 -119.0142 39.62
HISTORIC
PARK, CA US
82 rows × 8 columns
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 13/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
⌨ We'll look at this for JSON files by first creating a JSON file (just so we have something to read, but also to
show how we do this)...
sierraFeb.to_json("pdData/sierraFeb.json")
In [ ]: #
sierraFebNew = pd.read_json("pdData/sierraFeb.json")
sierraFebNew
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 14/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
In [ ]: #
Out[ ]:
STATION_NAME COUNTY ELEVATION LATITUDE LONGITUDE PRECIPITATION TEMPERATU
GROVELAND 2,
0 Tuolumne 853.4 37.8444 -120.2258 176.02
CA US
CANYON DAM,
1 Plumas 1389.9 40.1705 -121.0886 164.08
CA US
KERN RIVER
2 Kern 823.9 35.7831 -118.4389 67.06
PH 3, CA US
DONNER
3 MEMORIAL ST Nevada 1809.6 39.3239 -120.2331 167.39
PARK, CA US
BOWMAN DAM,
4 Nevada 1641.3 39.4539 -120.6556 276.61
CA US
PACIFIC El
77 1051.6 38.7583 -120.5030 220.22 N
HOUSE, CA US Dorado
MAMMOTH
LAKES
78 Mono 2378.7 37.6478 -118.9617 NaN
RANGER
STATION, CA US
COLGATE
80 POWERHOUSE, Yuba 181.4 39.3308 -121.1922 168.91 N
CA US
BODIE
CALIFORNIA
81 STATE Mono 2551.2 38.2119 -119.0142 39.62
HISTORIC
PARK, CA US
82 rows × 8 columns
⌨ Just to see what the JSON file looks like, we could open it in a text editor (like Notepad) separately, or we can
just use the standard Python read method:
infile = "pdData/sierraFeb.json"
f = open(infile, "r")
f.readline()
f.close()
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 15/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 16/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 18/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
600.0656167979,"7":6796.9160104987,"8":189.9606299213,"9":513.1233595801,"1
0":4694.8818897638,"11":1891.0761154856,"12":2399.9343832021,"13":1850.065616
7979,"14":1140.0918635171,"15":3140.0918635171,"16":4509.842519685,"17":5275.
9186351706,"18":8970.1443569554,"19":4875.0,"20":4211.9422572178,"21":6470.14
43569554,"22":3319.8818897638,"23":3015.0918635171,"24":5750.0,"25":170.93175
85302,"26":3419.9475065617,"27":350.0656167979,"28":4850.0656167979,"29":173
4.9081364829,"30":2915.0262467192,"31":3825.1312335958,"32":2220.144356955
4,"33":658.1364829396,"34":9645.0131233596,"35":5823.1627296588,"36":4102.034
1207349,"37":2709.9737532808,"38":9064.9606299213,"39":1708.0052493438,"40":6
313.9763779528,"41":435.0393700787,"42":4765.0918635171,"43":2089.895013123
4,"44":1291.9947506562,"45":2645.0131233596,"46":2780.8398950131,"47":294.947
5065617,"48":4975.0656167979,"49":4529.8556430446,"50":959.9737532808,"51":18
5.0393700787,"52":441.9291338583,"53":4877.9527559055,"54":1674.8687664042,"5
5":2774.9343832021,"56":5046.9160104987,"57":799.8687664042,"58":3870.0787401
575,"59":1585.9580052493,"60":410.1049868766,"61":3000.9842519685,"62":615.15
7480315,"63":3950.1312335958,"64":2160.1049868766,"65":3808.0708661417,"66":6
620.0787401575,"67":2000.0,"68":1750.0,"69":6229.9868766404,"70":2430.1181102
362,"71":9580.0524934383,"72":4018.0446194226,"73":8153.8713910761,"74":5575.
1312335958,"75":7020.0131233596,"76":6734.9081364829,"77":3450.1312335958,"7
8":7804.1338582677,"79":2379.9212598425,"80":595.1443569554,"81":8370.0787401
575}}'
sierraFebNew = sierraFebNew.set_index("STATION_NAME")
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 19/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
In [ ]: #
Out[ ]:
COUNTY ELEVATION LATITUDE LONGITUDE PRECIPITATION TEMPERATURE
STATION_NAME
GROVELAND 2,
Tuolumne 853.4 37.8444 -120.2258 176.02 6.1
CA US
CANYON DAM,
Plumas 1389.9 40.1705 -121.0886 164.08 1.4
CA US
KERN RIVER
Kern 823.9 35.7831 -118.4389 67.06 8.9
PH 3, CA US
DONNER
MEMORIAL ST Nevada 1809.6 39.3239 -120.2331 167.39 -0.9
PARK, CA US
BOWMAN DAM,
Nevada 1641.3 39.4539 -120.6556 276.61 2.9
CA US
PACIFIC El
1051.6 38.7583 -120.5030 220.22 NaN
HOUSE, CA US Dorado
MAMMOTH
LAKES
RANGER Mono 2378.7 37.6478 -118.9617 NaN -2.3
STATION, CA
US
COLGATE
POWERHOUSE, Yuba 181.4 39.3308 -121.1922 168.91 NaN
CA US
BODIE
CALIFORNIA
STATE Mono 2551.2 38.2119 -119.0142 39.62 -4.4
HISTORIC
PARK, CA US
82 rows × 7 columns
Transpose
See what happens when you transpose sierraFebNew with .transpose()
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 20/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
In [ ]: #
Out[ ]:
BRUSH
DONNER
CANYON KERN BOWMAN CREEK
GROVELAND MEMORIAL
STATION_NAME DAM, CA RIVER PH DAM, CA RANGER GRO
2, CA US ST PARK,
US 3, CA US US STATION,
CA US
CA US
7 rows × 82 columns
sf_weather_daily
Make sure you have the "sf_weather_daily.csv" file downloaded from iLearn and in the pdData folder which is
in the folder where this Jupyter notebook resides. (Note you can still path to it as an argument you pass to the
function, i.e "C:\Geog625\projects\py...")
⌨ We'll create an object called sf_weather and load the sf weather csv into a dataframe.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 21/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
In [ ]: #
Out[ ]:
date_time maxtempC mintempC maxtempF mintempF tempC tempF moonrise moonset
1/1/2019
0 11 5 51.8 41.0 11 51.8 3:17 AM 2:10 PM
12:00
1/2/2019
1 10 5 50.0 41.0 9 48.2 4:17 AM 2:47 PM
12:00
1/3/2019
2 11 5 51.8 41.0 10 50.0 5:17 AM 3:28 PM
12:00
1/4/2019
3 12 6 53.6 42.8 11 51.8 6:13 AM 4:13 PM
12:00
1/5/2019
4 13 8 55.4 46.4 13 55.4 7:06 AM 5:02 PM
12:00
12/27/2019
360 15 8 59.0 46.4 12 53.6 8:44 AM 6:34 PM
12:00
12/28/2019
361 12 8 53.6 46.4 11 51.8 9:30 AM 7:33 PM
12:00
12/29/2019
362 11 10 51.8 50.0 11 51.8 10:09 AM 8:32 PM
12:00
12/30/2019
363 17 8 62.6 46.4 13 55.4 10:42 AM 9:31 PM
12:00
12/31/2019 10:28
364 15 9 59.0 48.2 13 55.4 11:12 AM
12:00 PM
Notice how in the dataframe you see kind of a truncated version of all your data. Somewhere around maybe line
4 you see ..., well there is a function you can use to remedy this if you'd like to see all your data at once. We'll be
using the Pandas "set_option" function to expose all rows to this. You can also expose additional columns if you
have a lot. I've included sample code below, but commented out the option for width and columns, only provided
for reference.
pd.set_option('display.max_columns', 100)
pd.set_option('display.width', 1000)
⌨ Make some of these changes and call up your sf_weather dataframe to see the difference.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 22/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
In [ ]: #
Out[ ]:
date_time maxtempC mintempC maxtempF mintempF tempC tempF moonrise moonset
1/1/2019
0 11 5 51.8 41.0 11 51.8 3:17 AM 2:10 PM
12:00
1/2/2019
1 10 5 50.0 41.0 9 48.2 4:17 AM 2:47 PM
12:00
1/3/2019
2 11 5 51.8 41.0 10 50.0 5:17 AM 3:28 PM
12:00
1/4/2019
3 12 6 53.6 42.8 11 51.8 6:13 AM 4:13 PM
12:00
1/5/2019
4 13 8 55.4 46.4 13 55.4 7:06 AM 5:02 PM
12:00
12/27/2019
360 15 8 59.0 46.4 12 53.6 8:44 AM 6:34 PM
12:00
12/28/2019
361 12 8 53.6 46.4 11 51.8 9:30 AM 7:33 PM
12:00
12/29/2019
362 11 10 51.8 50.0 11 51.8 10:09 AM 8:32 PM
12:00
12/30/2019
363 17 8 62.6 46.4 13 55.4 10:42 AM 9:31 PM
12:00
12/31/2019 10:28
364 15 9 59.0 48.2 13 55.4 11:12 AM
12:00 PM
⌨ We can set the row label to the 'date_time' field with .set_index() . Note that you'll still have integer
indices, but this provides another way to reference a row, and that row label will stick with the data even after you
subset it.
sf_weather = sf_weather.set_index('date_time')
sf_weather
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 23/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
In [ ]: #
Out[ ]: pandas.core.frame.DataFrame
Answer: pandas.core.frame.DataFrame
⌨ If we wanted to examine the first five lines of sf_weather, we could use the ".head()" method.
sf_weather.head()
In [ ]: #
Out[ ]:
maxtempC mintempC maxtempF mintempF tempC tempF moonrise moonset sun
date_time
1/1/2019 7
11 5 51.8 41.0 11 51.8 3:17 AM 2:10 PM
12:00
1/2/2019 7
10 5 50.0 41.0 9 48.2 4:17 AM 2:47 PM
12:00
1/3/2019 7
11 5 51.8 41.0 10 50.0 5:17 AM 3:28 PM
12:00
1/4/2019 7
12 6 53.6 42.8 11 51.8 6:13 AM 4:13 PM
12:00
1/5/2019 7
13 8 55.4 46.4 13 55.4 7:06 AM 5:02 PM
12:00
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 24/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
⌨ How would we display the first 10 lines of sf_weather? Figure this out by typing
help(sf_weather.head)
Note that even though sf_weather is an object we created, its object type has associated methods.
In [ ]: #
Out[ ]:
maxtempC mintempC maxtempF mintempF tempC tempF moonrise moonset sun
date_time
1/1/2019 7
11 5 51.8 41.0 11 51.8 3:17 AM 2:10 PM
12:00
1/2/2019 7
10 5 50.0 41.0 9 48.2 4:17 AM 2:47 PM
12:00
1/3/2019 7
11 5 51.8 41.0 10 50.0 5:17 AM 3:28 PM
12:00
1/4/2019 7
12 6 53.6 42.8 11 51.8 6:13 AM 4:13 PM
12:00
1/5/2019 7
13 8 55.4 46.4 13 55.4 7:06 AM 5:02 PM
12:00
1/6/2019 7
13 8 55.4 46.4 12 53.6 7:54 AM 5:54 PM
12:00
1/7/2019 7
13 7 55.4 44.6 13 55.4 8:37 AM 6:48 PM
12:00
1/8/2019 7
13 9 55.4 48.2 13 55.4 9:15 AM 7:44 PM
12:00
1/9/2019 7
13 10 55.4 50.0 13 55.4 9:50 AM 8:40 PM
12:00
1/10/2019 7
12 9 53.6 48.2 12 53.6 10:20 AM 9:37 PM
12:00
⌨ If we wanted to examine the end of the sf_weather data, then we could use the .tail() method. Try this
with sf_weather using otherwise the same syntax as .head and get the last 10 lines.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 25/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
In [ ]: #
Out[ ]:
maxtempC mintempC maxtempF mintempF tempC tempF moonrise moonset sun
date_time
12/22/2019
14 10 57.2 50.0 12 53.6 3:33 AM 2:33 PM
12:00
12/23/2019
13 5 55.4 41.0 11 51.8 4:41 AM 3:11 PM
12:00
12/24/2019
11 8 51.8 46.4 11 51.8 5:48 AM 3:54 PM
12:00
12/25/2019
12 6 53.6 42.8 11 51.8 6:52 AM 4:43 PM
12:00
12/26/2019
14 8 57.2 46.4 12 53.6 7:52 AM 5:37 PM
12:00
12/27/2019
15 8 59.0 46.4 12 53.6 8:44 AM 6:34 PM
12:00
12/28/2019
12 8 53.6 46.4 11 51.8 9:30 AM 7:33 PM
12:00
12/29/2019
11 10 51.8 50.0 11 51.8 10:09 AM 8:32 PM
12:00
12/30/2019
17 8 62.6 46.4 13 55.4 10:42 AM 9:31 PM
12:00
12/31/2019 10:28
15 9 59.0 48.2 13 55.4 11:12 AM
12:00 PM
⌨ You can call a series (column) of your dataframe as a property to see the values. Let's say we wanted to see
the values of the maxtempF column you would use this syntax to access maxtempF as a variable max_temp :
max_temp = sf_weather.maxtempF
max_temp
In [ ]: #
Out[ ]: date_time
1/1/2019 12:00 11
1/2/2019 12:00 10
1/3/2019 12:00 11
1/4/2019 12:00 12
1/5/2019 12:00 13
..
12/27/2019 12:00 15
12/28/2019 12:00 12
12/29/2019 12:00 11
12/30/2019 12:00 17
12/31/2019 12:00 15
Name: maxtempC, Length: 365, dtype: int64
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 26/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
In [ ]: #
Out[ ]: dtype('int64')
Answer: pandas.core.series.Series
⌨ In the following examples, we'll take a closer look at how we can explore the properties of dataframe. For
example, if you want to see the number of rows or columns in your dataframe, use the ".shape" property.
sf_weather.shape
In [ ]: #
⌨ What if we wanted to see the data types of each of fields? Sometimes this is very useful if you're trying to
pass a field to a function that only excepts a certain datatype, for example integer, but you keep encountering
errors. Examing the datatypes of your fields could be useful.
sf_weather.dtypes
In [ ]: #
In this section we can explore how to recreate dataframes by selecting only certain parts of your larger
dataframe. One thing that's always important to remember that if you want to create a new version of your
dataframe that you'll need to assign your operation to a new object.
Series selection
⌨ This example shows how we can create a new dataframe with only the four columns; maxtempF ,
mintempF , location , and sunset . Obviously we'd still have the date_time field because it is our index.
In [ ]: #
Out[ ]:
maxtempC mintempC location sunset
date_time
⌨ Create a new dataframe called sf_mintemp_moon that only contains the columns: date_time ,
mintempC , tempC , moonrise , and location . However, don't actually include date_time in your request
since it's the named index and will already be there, and in fact create an error if you include it.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 28/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
In [ ]: #
Out[ ]:
mintempC tempC moonrise location
date_time
Using loc to use named indices and iloc for numeric indices
⌨ loc uses the label of the column or the label of the row and iloc uses the index of the row or the index of the
column. In this example below we're accessing the row with index date January 1st, 2019.
sf_weather.loc['1/1/2019 12:00']
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 29/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
In [ ]: #
Out[ ]: maxtempC 11
mintempC 5
maxtempF 51.8
mintempF 41.0
tempC 11
tempF 51.8
moonrise 3:17 AM
moonset 2:10 PM
sunrise 7:25 AM
sunset 5:02 PM
FeelsLikeC 11
WindGustKmph 22
winddirDegree 49
windspeedKmph 12
location san_francisco
Name: 1/1/2019 12:00, dtype: object
print(sf_weather.iloc[1:4])
In [ ]: #
Out[ ]:
maxtempC mintempC maxtempF mintempF tempC tempF moonrise moonset sun
date_time
1/2/2019 7
10 5 50.0 41.0 9 48.2 4:17 AM 2:47 PM
12:00
1/3/2019 7
11 5 51.8 41.0 10 50.0 5:17 AM 3:28 PM
12:00
1/4/2019 7
12 6 53.6 42.8 11 51.8 6:13 AM 4:13 PM
12:00
⌨ You can also specify a range with named indices using loc.
Then reference part of this subset with iloc , and we'll see that we have new integer indices, while the row
labels stick with what they were, illustrating why creating row labels with .set_index was useful.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 30/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 31/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
Out[ ]:
maxtempC mintempC maxtempF mintempF tempC tempF moonrise moonset sun
date_time
2/1/2019 7
13 10 55.4 50.0 13 55.4 5:02 AM 2:58 PM
12:00
2/2/2019 7
12 10 53.6 50.0 12 53.6 5:51 AM 3:49 PM
12:00
2/3/2019 7
12 10 53.6 50.0 12 53.6 6:35 AM 4:42 PM
12:00
2/4/2019
9 8 48.2 46.4 9 48.2 7:15 AM 5:38 PM
12:00
2/5/2019 7
10 6 50.0 42.8 10 50.0 7:51 AM 6:33 PM
12:00
2/6/2019 7
10 5 50.0 41.0 9 48.2 8:22 AM 7:30 PM
12:00
2/7/2019 7
10 5 50.0 41.0 10 50.0 8:52 AM 8:26 PM
12:00
2/8/2019 7
11 7 51.8 44.6 10 50.0 9:21 AM 9:23 PM
12:00
2/9/2019 10:20 7
11 8 51.8 46.4 11 51.8 9:49 AM
12:00 PM
2/10/2019 11:19 7
10 7 50.0 44.6 9 48.2 10:17 AM
12:00 PM
2/11/2019 No 7
11 6 51.8 42.8 10 50.0 10:47 AM
12:00 moonset
2/12/2019 12:19 7
11 7 51.8 44.6 11 51.8 11:21 AM
12:00 AM
2/13/2019 7
14 11 57.2 51.8 14 57.2 11:59 AM 1:22 AM
12:00
2/14/2019 7
14 11 57.2 51.8 11 51.8 12:43 PM 2:26 AM
12:00
2/15/2019 6
11 9 51.8 48.2 11 51.8 1:37 PM 3:31 AM
12:00
2/16/2019 6
11 9 51.8 48.2 10 50.0 2:38 PM 4:33 AM
12:00
2/17/2019 6
9 7 48.2 44.6 9 48.2 3:47 PM 5:32 AM
12:00
2/18/2019 6
12 5 53.6 41.0 10 50.0 5:00 PM 6:24 AM
12:00
2/19/2019 6
11 5 51.8 41.0 10 50.0 6:15 PM 7:11 AM
12:00
2/20/2019 6
11 6 51.8 42.8 11 51.8 7:29 PM 7:51 AM
12:00
2/21/2019 6
11 6 51.8 42.8 10 50.0 8:40 PM 8:28 AM
12:00
2/22/2019 6
10 5 50.0 41.0 9 48.2 9:50 PM 9:03 AM
12:00
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 32/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
date_time
2/23/2019 6
12 5 53.6 41.0 11 51.8 10:57 PM 9:37 AM
12:00
2/24/2019 No 6
11 8 51.8 46.4 11 51.8 10:11 AM
12:00 moonrise
2/25/2019 10:47 6
11 4 51.8 39.2 10 50.0 12:01 AM
12:00 AM
2/26/2019 6
12 3 53.6 37.4 12 53.6 1:04 AM 11:26 AM
12:00
2/27/2019 12:08 6
14 9 57.2 48.2 14 57.2 2:03 AM
12:00 PM
2/28/2019 12:55 6
14 10 57.2 50.0 13 55.4 2:58 AM
12:00 PM
sierra
In [ ]: #
Out[ ]:
temperature elevation latitude
⌨ How would you get the sierra data from 'Sonora' to 'Colfax'?
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 33/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
In [ ]: #
Out[ ]:
temperature elevation latitude
⌨ If you wanted to select multiple rows specifically (such as the first day of the month) then you can pass those
row indices. Note that the list is a single input to .loc .
In [ ]:
Out[ ]:
maxtempC mintempC maxtempF mintempF tempC tempF moonrise moonset sun
date_time
1/1/2019 7
11 5 51.8 41.0 11 51.8 3:17 AM 2:10 PM
12:00
2/1/2019 7
13 10 55.4 50.0 13 55.4 5:02 AM 2:58 PM
12:00
3/1/2019 6
12 8 53.6 46.4 12 53.6 3:49 AM 1:45 PM
12:00
⌨ Using the same method, get the sierra data for Sonora and Sierraville.
In [ ]: #
Out[ ]:
temperature elevation latitude
⌨ Back to sf_weather, the first set of labels you pass will only return the rows you're interested in, but if you
wanted to refine your data data, you can also pass a set of column labels as the second argument:
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 34/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
In [ ]: #
Out[ ]:
maxtempC mintempC location
date_time
⌨ Create a dataframe that contains data for July 1-5. Show only the columns mintempF , location ,
moonrise , and WindGustKmph .
In [ ]: #
Out[ ]:
mintempF location moonrise WindGustKmph
date_time
⌨ Next, we can explore the .iloc[] method to select data. To differentiate these two methonds I like to think of the
"i" in the iloc method as "index", meaning we need to pass an index number to return the row or column. First set
of indexes are rows and second set of indexes are columns:
sf_weather.iloc[0]
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 35/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
In [ ]: #
Out[ ]: maxtempC 11
mintempC 5
maxtempF 51.8
mintempF 41.0
tempC 11
tempF 51.8
moonrise 3:17 AM
moonset 2:10 PM
sunrise 7:25 AM
sunset 5:02 PM
FeelsLikeC 11
WindGustKmph 22
winddirDegree 49
windspeedKmph 12
location san_francisco
Name: 1/1/2019 12:00, dtype: object
⌨ Let's see how we can get the first 5 rows of our data from, but using the .iloc[] method:
sf_weather.iloc[:5]
In [ ]: #
Out[ ]:
maxtempC mintempC maxtempF mintempF tempC tempF moonrise moonset sun
date_time
1/1/2019 7
11 5 51.8 41.0 11 51.8 3:17 AM 2:10 PM
12:00
1/2/2019 7
10 5 50.0 41.0 9 48.2 4:17 AM 2:47 PM
12:00
1/3/2019 7
11 5 51.8 41.0 10 50.0 5:17 AM 3:28 PM
12:00
1/4/2019 7
12 6 53.6 42.8 11 51.8 6:13 AM 4:13 PM
12:00
1/5/2019 7
13 8 55.4 46.4 13 55.4 7:06 AM 5:02 PM
12:00
⌨ Now, let's see how we can return the first 3 rows and the first 10 columns in the dataframe:
sf_weather.iloc[[0,1,2],:10]
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 36/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
In [ ]: #
Out[ ]:
maxtempC mintempC maxtempF mintempF tempC tempF moonrise moonset sun
date_time
1/1/2019 7
11 5 51.8 41.0 11 51.8 3:17 AM 2:10 PM
12:00
1/2/2019 7
10 5 50.0 41.0 9 48.2 4:17 AM 2:47 PM
12:00
1/3/2019 7
11 5 51.8 41.0 10 50.0 5:17 AM 3:28 PM
12:00
In [ ]: #
Out[ ]:
tempF moonrise moonset sunrise sunset FeelsLikeC WindGustKmph winddirDegr
date_time
As you can see above, using the iloc method might be a little easier to select multiple rows/columns by using the
slicing of indexes method. This way we wouldn't have to pass each column or row name by label.
⌨ Next, let's create a smaller dataframe by assigning the first twenty rows of sf_weather with .head(n=20)
to begin_sf_weather ...
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 37/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
In [ ]: #
Out[ ]:
maxtempC mintempC maxtempF mintempF tempC tempF moonrise moonset sun
date_time
1/1/2019 7
11 5 51.8 41.0 11 51.8 3:17 AM 2:10 PM
12:00
1/2/2019 7
10 5 50.0 41.0 9 48.2 4:17 AM 2:47 PM
12:00
1/3/2019 7
11 5 51.8 41.0 10 50.0 5:17 AM 3:28 PM
12:00
1/4/2019 7
12 6 53.6 42.8 11 51.8 6:13 AM 4:13 PM
12:00
1/5/2019 7
13 8 55.4 46.4 13 55.4 7:06 AM 5:02 PM
12:00
1/6/2019 7
13 8 55.4 46.4 12 53.6 7:54 AM 5:54 PM
12:00
1/7/2019 7
13 7 55.4 44.6 13 55.4 8:37 AM 6:48 PM
12:00
1/8/2019 7
13 9 55.4 48.2 13 55.4 9:15 AM 7:44 PM
12:00
1/9/2019 7
13 10 55.4 50.0 13 55.4 9:50 AM 8:40 PM
12:00
1/10/2019 7
12 9 53.6 48.2 12 53.6 10:20 AM 9:37 PM
12:00
1/11/2019 10:33 7
12 8 53.6 46.4 12 53.6 10:50 AM
12:00 PM
1/12/2019 11:30 7
13 8 55.4 46.4 12 53.6 11:18 AM
12:00 PM
1/13/2019 No 7
13 8 55.4 46.4 12 53.6 11:46 AM
12:00 moonset
1/14/2019 12:28 7
12 8 53.6 46.4 12 53.6 12:15 PM
12:00 AM
1/15/2019 7
11 11 51.8 51.8 11 51.8 12:48 PM 1:29 AM
12:00
1/16/2019 7
13 9 55.4 48.2 13 55.4 1:24 PM 2:32 AM
12:00
1/17/2019 7
13 10 55.4 50.0 13 55.4 2:07 PM 3:38 AM
12:00
1/18/2019 7
12 8 53.6 46.4 12 53.6 2:57 PM 4:45 AM
12:00
1/19/2019 7
15 9 59.0 48.2 15 59.0 3:56 PM 5:52 AM
12:00
1/20/2019 7
13 10 55.4 50.0 12 53.6 5:03 PM 6:54 AM
12:00
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 38/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
begin_sf_weather.to_csv("pdData/begin_sf_weather.csv")
In [ ]: #
⌨ Using either the .loc[] method or the .iloc[] method create a CSV (using the .to_csv method) from the
sf_weather dataframe that contains rows 15-20 ( 15:21 ) and the first 10 columns. Name the output
ten1520.csv .
In [ ]: #
⌨ The following examples illustrate this. First we'll create a simple DataFrame with x, y, z columns and a, b, c
rows:
df = pd.DataFrame({"x":pd.Series({'a':1,'b':2,'c':3}),
"y":pd.Series({'a':4,'b':5,'c':6}),
"z":pd.Series({'a':7,'b':8,'c':9})})
df
In [ ]: #
Out[ ]:
x y z
a 1 4 7
b 2 5 8
c 3 6 9
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 39/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
In [ ]: #
Out[ ]:
x y
a 1 4
b 2 5
c 3 6
In [ ]: #
Out[ ]:
x y z
b 2 5 8
c 3 6 9
As you can see above, the methods do look a lot alike, but working with rows uses some form of loc .
df[['x','y']].loc[['b','c']]
In [ ]: #
Out[ ]:
x y
b 2 5
c 3 6
Descriptive statistics
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 40/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
We can use various functions exposed to us through pandas to run get descriptive statistics for our series. A list
of some of these statistics, most of which are obvious what they do:
count
sum
mean
median
min
max
mode : returns a series of modes
prod : product
mad mean absolute deviation
sem standard error of the mean
std sample standard deviation
var
sem standard error of the mean
skew skewness (3rd moment)
kurt kurtosis (4th moment)
cumsum cumulative sum
cumprod
cummax
cummin
⌨ First, let's create a series with modes and get its mean with:
aSeries = pd.Series([1,3,3,5,7,7,9,11,11,13])
aSeries.mean()
... and also get the mode (which may return multiple values).
In [ ]: #
Out[ ]: 7.0
Out[ ]: 0 3
1 7
2 11
dtype: int64
In [ ]: #
Out[ ]: 1.2649110640673518
In [ ]: #
Out[ ]: 4.0
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 41/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
In [ ]: #
Out[ ]: 0 1
1 4
2 7
3 12
4 19
5 26
6 35
7 46
8 57
9 70
dtype: int64
⌨ If we use these methods with a DataFrame, all series are described using the specified statistic:
sf_weather.mean()
In [ ]: #
C:\Users\900008452\AppData\Local\Temp\ipykernel_24480\513922073.py:2: FutureW
arning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_o
nly=None') is deprecated; in a future version this will raise TypeError. Sel
ect only valid columns before calling the reduction.
sf_weather.mean()
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 42/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
In [ ]: #
C:\Users\900008452\AppData\Local\Temp\ipykernel_24480\194938136.py:2: FutureW
arning: Dropping of nuisance columns in DataFrame reductions (with 'numeric_o
nly=None') is deprecated; in a future version this will raise TypeError. Sel
ect only valid columns before calling the reduction.
sf_weather.std()
⌨ If we pass just the column to the function we can get the values returned on the column. For example, if I
wanted to see the max time for sunset I would pass this function:
sf_weather['sunset'].max()
In [ ]: #
⌨ Then to see the earliest sunset all year we'd use min instead
In [ ]: #
⌨ What if we wanted to see the mean and standard deviation of lowest temperature all year:
In [ ]: #
⌨ Now use the .describe() method on the same DataFrame series to show all these statistics for
mintempF . (Hint: the command looks the same as deriving the .mean() or .std() , just uses
.describe() .
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 43/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
In [ ]: #
There are some advanced operations with generating graphs using some other libraries, but Pandas provides
some basic graphing capabilities right out of the box without needing to import additional libraries.
⌨ Below is a simple example of generating a new dataframe that contains only the January datafrom from our
sf_weather dataframe and then plots it:
jan_sf_weather = sf_weather.head(n=31)
jan_sf_weather
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 44/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 45/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
Out[ ]:
maxtempC mintempC maxtempF mintempF tempC tempF moonrise moonset sun
date_time
1/1/2019 7
11 5 51.8 41.0 11 51.8 3:17 AM 2:10 PM
12:00
1/2/2019 7
10 5 50.0 41.0 9 48.2 4:17 AM 2:47 PM
12:00
1/3/2019 7
11 5 51.8 41.0 10 50.0 5:17 AM 3:28 PM
12:00
1/4/2019 7
12 6 53.6 42.8 11 51.8 6:13 AM 4:13 PM
12:00
1/5/2019 7
13 8 55.4 46.4 13 55.4 7:06 AM 5:02 PM
12:00
1/6/2019 7
13 8 55.4 46.4 12 53.6 7:54 AM 5:54 PM
12:00
1/7/2019 7
13 7 55.4 44.6 13 55.4 8:37 AM 6:48 PM
12:00
1/8/2019 7
13 9 55.4 48.2 13 55.4 9:15 AM 7:44 PM
12:00
1/9/2019 7
13 10 55.4 50.0 13 55.4 9:50 AM 8:40 PM
12:00
1/10/2019 7
12 9 53.6 48.2 12 53.6 10:20 AM 9:37 PM
12:00
1/11/2019 10:33 7
12 8 53.6 46.4 12 53.6 10:50 AM
12:00 PM
1/12/2019 11:30 7
13 8 55.4 46.4 12 53.6 11:18 AM
12:00 PM
1/13/2019 No 7
13 8 55.4 46.4 12 53.6 11:46 AM
12:00 moonset
1/14/2019 12:28 7
12 8 53.6 46.4 12 53.6 12:15 PM
12:00 AM
1/15/2019 7
11 11 51.8 51.8 11 51.8 12:48 PM 1:29 AM
12:00
1/16/2019 7
13 9 55.4 48.2 13 55.4 1:24 PM 2:32 AM
12:00
1/17/2019 7
13 10 55.4 50.0 13 55.4 2:07 PM 3:38 AM
12:00
1/18/2019 7
12 8 53.6 46.4 12 53.6 2:57 PM 4:45 AM
12:00
1/19/2019 7
15 9 59.0 48.2 15 59.0 3:56 PM 5:52 AM
12:00
1/20/2019 7
13 10 55.4 50.0 12 53.6 5:03 PM 6:54 AM
12:00
1/21/2019 7
12 8 53.6 46.4 12 53.6 6:15 PM 7:50 AM
12:00
1/22/2019 7
13 8 55.4 46.4 12 53.6 7:30 PM 8:39 AM
12:00
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 46/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
date_time
1/23/2019 7
12 7 53.6 44.6 12 53.6 8:42 PM 9:22 AM
12:00
1/24/2019 7
13 9 55.4 48.2 12 53.6 9:53 PM 9:59 AM
12:00
1/25/2019 10:33 7
14 7 57.2 44.6 13 55.4 11:01 PM
12:00 AM
1/26/2019 No 7
15 9 59.0 48.2 14 57.2 11:06 AM
12:00 moonrise
1/27/2019 7
15 8 59.0 46.4 14 57.2 12:06 AM 11:39 AM
12:00
1/28/2019 12:12 7
14 10 57.2 50.0 13 55.4 1:10 AM
12:00 PM
1/29/2019 12:49 7
13 10 55.4 50.0 13 55.4 2:11 AM
12:00 PM
1/30/2019 7
13 10 55.4 50.0 13 55.4 3:11 AM 1:28 PM
12:00
1/31/2019 7
14 11 57.2 51.8 13 55.4 4:08 AM 2:11 PM
12:00
⌨ Plot the data setting the y axis to the maxtempF field from our data frame:
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 47/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 48/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 49/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
Out[ ]:
maxtempC mintempC maxtempF mintempF tempC tempF moonrise moonset sun
date_time
2/1/2019 7
13 10 55.4 50.0 13 55.4 5:02 AM 2:58 PM
12:00
2/2/2019 7
12 10 53.6 50.0 12 53.6 5:51 AM 3:49 PM
12:00
2/3/2019 7
12 10 53.6 50.0 12 53.6 6:35 AM 4:42 PM
12:00
2/4/2019
9 8 48.2 46.4 9 48.2 7:15 AM 5:38 PM
12:00
2/5/2019 7
10 6 50.0 42.8 10 50.0 7:51 AM 6:33 PM
12:00
2/6/2019 7
10 5 50.0 41.0 9 48.2 8:22 AM 7:30 PM
12:00
2/7/2019 7
10 5 50.0 41.0 10 50.0 8:52 AM 8:26 PM
12:00
2/8/2019 7
11 7 51.8 44.6 10 50.0 9:21 AM 9:23 PM
12:00
2/9/2019 10:20 7
11 8 51.8 46.4 11 51.8 9:49 AM
12:00 PM
2/10/2019 11:19 7
10 7 50.0 44.6 9 48.2 10:17 AM
12:00 PM
2/11/2019 No 7
11 6 51.8 42.8 10 50.0 10:47 AM
12:00 moonset
2/12/2019 12:19 7
11 7 51.8 44.6 11 51.8 11:21 AM
12:00 AM
2/13/2019 7
14 11 57.2 51.8 14 57.2 11:59 AM 1:22 AM
12:00
2/14/2019 7
14 11 57.2 51.8 11 51.8 12:43 PM 2:26 AM
12:00
2/15/2019 6
11 9 51.8 48.2 11 51.8 1:37 PM 3:31 AM
12:00
2/16/2019 6
11 9 51.8 48.2 10 50.0 2:38 PM 4:33 AM
12:00
2/17/2019 6
9 7 48.2 44.6 9 48.2 3:47 PM 5:32 AM
12:00
2/18/2019 6
12 5 53.6 41.0 10 50.0 5:00 PM 6:24 AM
12:00
2/19/2019 6
11 5 51.8 41.0 10 50.0 6:15 PM 7:11 AM
12:00
2/20/2019 6
11 6 51.8 42.8 11 51.8 7:29 PM 7:51 AM
12:00
2/21/2019 6
11 6 51.8 42.8 10 50.0 8:40 PM 8:28 AM
12:00
2/22/2019 6
10 5 50.0 41.0 9 48.2 9:50 PM 9:03 AM
12:00
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 50/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
date_time
2/23/2019 6
12 5 53.6 41.0 11 51.8 10:57 PM 9:37 AM
12:00
2/24/2019 No 6
11 8 51.8 46.4 11 51.8 10:11 AM
12:00 moonrise
2/25/2019 10:47 6
11 4 51.8 39.2 10 50.0 12:01 AM
12:00 AM
2/26/2019 6
12 3 53.6 37.4 12 53.6 1:04 AM 11:26 AM
12:00
2/27/2019 12:08 6
14 9 57.2 48.2 14 57.2 2:03 AM
12:00 PM
2/28/2019 12:55 6
14 10 57.2 50.0 13 55.4 2:58 AM
12:00 PM
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 51/52
1/27/24, 2:49 PM Ex04_Pandas_intro_results
key
➦ This directs you to do something specific, maybe in the operating system or answer something conceptual.
In [ ]:
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex04_Pandas_intro_results.html 52/52
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
We'll also look at the various methods for combining and merging multiple dataframes.
From that data source, we'll look at Bay Area unemployment data and we'll work on manipulating that data to be
able to analyze questions, however we won't focus as much on the questions (those are for you to consider) as
much as on the transformation methods. We'll loosely look at transformations that help to understand the effect
of COVID on unemployment (and one of the variables is pandemic specific), but there's a lot more you can do
with this and other data at the site. In the geopandas lab, we'll look at 311 data from that same source to look at
parking incidents by neighborhood, and rely on latitude and longitude fields to map them.
import pandas as pd
import matplotlib.pyplot as plt
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 1/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
Parsing Dates
We'll start by simply reading unemployment data, including COVID-related claims, without any special settings or
transformations. Explore information about this dataset at https://data.sfgov.org/Economy-and-
Community/Unemployment-Insurance-Weekly-Claims-for-Bay-Area-/d98w-yij4 (https://data.sfgov.org/Economy-
and-Community/Unemployment-Insurance-Weekly-Claims-for-Bay-Area-/d98w-yij4) where you'll see that:
UI_Claims are the "Number of new weekly Unemployment Insurance (UI) claims filed with EDD (Includes
new, additional, transitional, and PUA claims)"
EDD: California Economic Development Department
PUA_Claims : "Breakout of the number of new weekly Pandemic Unemployment Assistance (PUA) claims
filled [sic] with EDD."
bayArea_unemployment_noparse = pd.read_csv('pdData/Unemployment_BA_Counties_2020_20
2105.csv')
bayArea_unemployment_noparse
In [ ]: #
Out[ ]:
Week_Ending County UI_Claims PUA_Claims
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 2/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
⌨ Next, we'll create a dataframe that contains the unemployment claims filed during Covid, for each county in
the Bay Area, but parse the dates as dates.
bayArea_unemployment = pd.read_csv('pdData/Unemployment_BA_Counties_2020_202105.cs
v', parse_dates= ["Week_Ending"])
bayArea_unemployment
In [ ]: #
Out[ ]:
Week_Ending County UI_Claims PUA_Claims
Answer: We want to use Week_Ending as a date. Without this, it's just a string.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 3/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
In [ ]: #
Out[ ]:
Week_Ending County UI_Claims
⌨ Notice that if we re-examine the original dataframe it still contains the columns we dropped, remember it's
important to reassign that operation you do onto a dataframe to a new dataframe:
bayArea_unemployment
In [ ]: #
Out[ ]:
Week_Ending County UI_Claims PUA_Claims
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 4/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
bayArea_unemployment.drop
(To get more detailed information, put the above into help() )
⌨ ... then noting that axis=0 (rows) is the default, just provide the numeric index 0 as the sole input to
bayArea_unemployment.drop( ... ) , even though it may be confusing to say that the index 0 is a label
(remember that if the parameter isn't specified, the first thing provided is assigned to the first parameter, which in
this case is labels .)
In [ ]: #
Out[ ]:
Week_Ending County UI_Claims PUA_Claims
⌨ Checking the usage again with shift-tab, and noting the index parameter which applies to rows, we can see
that we can use the .drop() method to drop a list of rows, so use this to drop index = [0,1,2,3] :
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 5/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
In [ ]: #
Out[ ]:
Week_Ending County UI_Claims PUA_Claims
⌨ Then we can drop the first 50 rows by using a slicing method on the index. Assign the result to
bayArea_unemployment_drop50 with our same input, but instead of listing the indices, use a slice: (index =
bayArea_unemployment.index[:50])
In [ ]: #
Out[ ]:
Week_Ending County UI_Claims PUA_Claims
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 6/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
Insert a column
This is a common need when doing analysis: creating a new variable (column) from existing data. There are
multiple ways of doing this.
⌨ One simple way of inserting a new column and populating it with data is to essentially assign something to a
new variable (column) that we name in the [] accessor of the data frame. It inserts that new column at the end
of the dataframe. In this example every row within our new column will have the same value (which we've
provided hard-coded, but this method would also work if you were to derive that from another source/input), and
we might be later merging rows with data from other regions.
In [ ]: #
Out[ ]:
Week_Ending County UI_Claims PUA_Claims Region Name
⌨ In addition we can use the .insert() method to insert a field into a dataframe at a specific location, in this
case as the second column (at the 1 position):
state = "California"
bayArea_unemployment.insert(1, "State", state)
bayArea_unemployment
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 7/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
In [ ]: #
Out[ ]:
Week_Ending State County UI_Claims PUA_Claims Region Name
⌨ Insert another column called Country and set it to "United States" . Put Country just after State .
In [ ]: #
Out[ ]:
Week_Ending Country State County UI_Claims PUA_Claims Region Name
652 2021-05-29 United States California San Francisco 3878 233 Bay Area
653 2021-05-29 United States California San Mateo 2346 140 Bay Area
654 2021-05-29 United States California Santa Clara 5307 419 Bay Area
655 2021-05-29 United States California Solano 1580 140 Bay Area
656 2021-05-29 United States California Sonoma 1410 126 Bay Area
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 8/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
Adding rows
⌨ Adding a new row by using the append method. The new row of data should be formatted with key:value
pairs representing each column heading and row value, respectively:
In [ ]: #
➦ Then check the usage of .append using either shift-tab or help. Note what it says about the usage and the first
parameter other :
... and what it says about the second parameter ignore_index which by default is False but we want to set
to True.
⌨ Note that we'll assign the output to the same object (while the documentation says it will be a new object), so
it replaces it:
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 9/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
In [ ]: #
C:\Users\900008452\AppData\Local\Temp\ipykernel_10588\4119025374.py:2: Future
Warning: The frame.append method is deprecated and will be removed from panda
s in a future version. Use pandas.concat instead.
bay_area_unemployment = bayArea_unemployment.append(new_row, ignore_index=T
rue)
Out[ ]:
Region
Week_Ending Country State County UI_Claims PUA_Claims
Name
2020-01-11 United
0 California Alameda 1487 0 Bay Area
00:00:00 States
2020-01-11 United
2 California Marin 144 0 Bay Area
00:00:00 States
2020-01-11 United
3 California Napa 162 0 Bay Area
00:00:00 States
2021-05-29 United
653 California San Mateo 2346 140 Bay Area
00:00:00 States
2021-05-29 United
654 California Santa Clara 5307 419 Bay Area
00:00:00 States
2021-05-29 United
655 California Solano 1580 140 Bay Area
00:00:00 States
2021-05-29 United
656 California Sonoma 1410 126 Bay Area
00:00:00 States
United
657 2020-11-28 California Marin 450 50 Bay Area
States
Answer: dict
Filtering
Filtering lets you select rows, based either on indices or values in fields.
⌨ Moving forward, recreate the dataframe bayArea_unemployment from the source CSV. Use the code we
created earlier that reads the CSV and parses the dates.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 10/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
In [ ]: #
Out[ ]:
Week_Ending County UI_Claims PUA_Claims
⌨ Define a filter (mask) and assign it to the alameda variable, this is called creating a mask. We are creating the
mask that we can apply to the dataframe that will conduct the desired filter:
alameda = bayArea_unemployment["County"]=="Alameda"
alameda
In [ ]: #
Out[ ]: 0 True
1 False
2 False
3 False
4 False
...
652 False
653 False
654 False
655 False
656 False
Name: County, Length: 657, dtype: bool
⌨ Pass the filter to the larger dataframe to get only Alameda data returned:
alameda_unemployment = bayArea_unemployment[alameda]
alameda_unemployment
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 11/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
In [ ]: #
Out[ ]:
Week_Ending County UI_Claims PUA_Claims
73 rows × 4 columns
alameda_unemployment = bayArea_unemployment[bayArea_unemployment.County.eq("Alamed
a")]
alameda_unemployment
In [ ]: #
Out[ ]:
Week_Ending County UI_Claims PUA_Claims
73 rows × 4 columns
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 12/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
In [ ]: #
Out[ ]:
Week_Ending County UI_Claims PUA_Claims
73 rows × 4 columns
The .isin method is nice when you have a list of values for a given field you want.
Let's try this with county names. We'll create a list of two county names with
bayArea_unemployment.County.isin(marin_sf)
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 13/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
In [ ]:
Out[ ]: 0 False
1 False
2 True
3 False
4 True
...
652 True
653 False
654 False
655 False
656 False
Name: County, Length: 657, dtype: bool
So this is a mask of rows, which can be used to filter the rows by applying it to bayArea_unemployment[ ... ] .
We'll then assign the output to marin_sf_unemployment .
It may seem a little weird to seem to access the same data twice with
bayArea_unemployment[bayArea_unemployment.County.isin(marin_sf)] but this is commonly done and
it's just a way of nesting operations.
⌨ To maybe help visualize this, the following methods are equivalent, so use either one:
theMask = bayArea_unemployment.County.isin(marin_sf)
marin_sf_unemployment = bayArea_unemployment[theMask]
marin_sf_unemployment = bayArea_unemployment[bayArea_unemployment.County.isin(marin
_sf)]
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 14/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
In [ ]: #
Out[ ]:
Week_Ending County UI_Claims PUA_Claims
⌨ Filtering on a not ( != )condition, meaning returning everything that does not meet a value or values:
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 15/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
In [ ]: #
Out[ ]:
Week_Ending County UI_Claims PUA_Claims
In [ ]: #
Out[ ]:
Week_Ending County UI_Claims PUA_Claims
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 16/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
Start by building the sierra data we built in the previous exercise. Use the version with dictionaries so you'll
have named row indices for the stations, and end up with columns in the order elevation,temperature,latitude.
In [ ]: #
Out[ ]:
elevation temperature latitude
We don't have a categorical variable in our data, so we'll create one "highElev" that has True for elevations >
1000 and False for not. (In R, we'd call this a factor.)
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 17/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
In [ ]: #
Out[ ]:
elevation temperature latitude highElev
Now we'll create a small dataframe where a summary statistic is derived for each variable by group. We'll just
derive the mean values, but we could instead derive other statistics.
sierraElevGroup = sierra.groupby("highElev").mean()
sierraElevGroup
In [ ]: #
Out[ ]:
elevation temperature latitude
highElev
#sorting dataframe
sorted_unemployment = bayArea_unemployment.sort_values(by='UI_Claims', ascending=Tr
ue)
sorted_unemployment
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 18/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
In [ ]: #
Out[ ]:
Week_Ending County UI_Claims PUA_Claims
Iterating a dataframe
⌨ One other way to scroll through a dataframe by row, probably to do something with each step, is to iterate it
with a for loop and the .iterrows() method, which you should learn about with either Shift-tab or help() to
understand what the following is doing. Each iteration provides the row index and the row of data as a series with
each of field names as indices. For example, to display the rows of sierra:
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 19/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
In [ ]:
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 20/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
Oroville
elevation 52
temperature 10.7
latitude 39.52
highElev False
Name: Oroville, dtype: object
Auburn
elevation 394
temperature 9.7
latitude 38.91
highElev False
Name: Auburn, dtype: object
Sonora
elevation 510
temperature 7.7
latitude 37.97
highElev False
Name: Sonora, dtype: object
Placerville
elevation 564
temperature 9.2
latitude 38.7
highElev False
Name: Placerville, dtype: object
Colfax
elevation 725
temperature 7.3
latitude 39.09
highElev False
Name: Colfax, dtype: object
Nevada City
elevation 848
temperature 6.7
latitude 39.25
highElev False
Name: Nevada City, dtype: object
Quincy
elevation 1042
temperature 4.0
latitude 39.94
highElev True
Name: Quincy, dtype: object
Yosemite
elevation 1225
temperature 5.0
latitude 37.75
highElev True
Name: Yosemite, dtype: object
Sierraville
elevation 1516
temperature 0.9
latitude 40.35
highElev True
Name: Sierraville, dtype: object
Truckee
elevation 1775
temperature -1.1
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 21/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
latitude 39.33
highElev True
Name: Truckee, dtype: object
Tahoe City
elevation 1899
temperature -0.8
latitude 39.17
highElev True
Name: Tahoe City, dtype: object
Bodie
elevation 2551
temperature -4.4
latitude 38.21
highElev True
Name: Bodie, dtype: object
Now we'll look at our bayArea_unemployment data, which is much larger so we might want to only print (or do
something else with) by setting a condition. Note in the code below how you're doing something if the
condition is true. In this case, we're just printing something out, but you might imagine a situation where you do
something else with each selected record, which is where iteration becomes most useful. We'll see examples of
doing this in arcpy when we're using cursors to manipulate individual geometries.
In [ ]: #
Alameda 46056
Contra Costa 32426
San Francisco 26889
Santa Clara 49472
Alameda 38645
Contra Costa 26648
San Francisco 24040
Santa Clara 38511
Alameda 27574
Santa Clara 27906
Alameda 22842
Santa Clara 22920
Alameda 26991
Santa Clara 25577
Alameda 26274
Alameda 28748
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 22/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
#groupby in a dataframe
unemployment_by_county = bayArea_unemployment.groupby("County")
unemployment_by_county.groups
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 23/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
In [ ]: #
Out[ ]: {'Alameda': [0, 9, 18, 27, 36, 45, 54, 63, 72, 81, 90, 99, 108, 117, 126, 13
5, 144, 153, 162, 171, 180, 189, 198, 207, 216, 225, 234, 243, 252, 261, 270,
279, 288, 297, 306, 315, 324, 333, 342, 351, 360, 369, 378, 387, 396, 405, 41
4, 423, 432, 441, 450, 459, 468, 477, 486, 495, 504, 513, 522, 531, 540, 549,
558, 567, 576, 585, 594, 603, 612, 621, 630, 639, 648], 'Contra Costa': [1, 1
0, 19, 28, 37, 46, 55, 64, 73, 82, 91, 100, 109, 118, 127, 136, 145, 154, 16
3, 172, 181, 190, 199, 208, 217, 226, 235, 244, 253, 262, 271, 280, 289, 298,
307, 316, 325, 334, 343, 352, 361, 370, 379, 388, 397, 406, 415, 424, 433, 44
2, 451, 460, 469, 478, 487, 496, 505, 514, 523, 532, 541, 550, 559, 568, 577,
586, 595, 604, 613, 622, 631, 640, 649], 'Marin': [2, 11, 20, 29, 38, 47, 56,
65, 74, 83, 92, 101, 110, 119, 128, 137, 146, 155, 164, 173, 182, 191, 200, 2
09, 218, 227, 236, 245, 254, 263, 272, 281, 290, 299, 308, 317, 326, 335, 34
4, 353, 362, 371, 380, 389, 398, 407, 416, 425, 434, 443, 452, 461, 470, 479,
488, 497, 506, 515, 524, 533, 542, 551, 560, 569, 578, 587, 596, 605, 614, 62
3, 632, 641, 650], 'Napa': [3, 12, 21, 30, 39, 48, 57, 66, 75, 84, 93, 102, 1
11, 120, 129, 138, 147, 156, 165, 174, 183, 192, 201, 210, 219, 228, 237, 24
6, 255, 264, 273, 282, 291, 300, 309, 318, 327, 336, 345, 354, 363, 372, 381,
390, 399, 408, 417, 426, 435, 444, 453, 462, 471, 480, 489, 498, 507, 516, 52
5, 534, 543, 552, 561, 570, 579, 588, 597, 606, 615, 624, 633, 642, 651], 'Sa
n Francisco': [4, 13, 22, 31, 40, 49, 58, 67, 76, 85, 94, 103, 112, 121, 130,
139, 148, 157, 166, 175, 184, 193, 202, 211, 220, 229, 238, 247, 256, 265, 27
4, 283, 292, 301, 310, 319, 328, 337, 346, 355, 364, 373, 382, 391, 400, 409,
418, 427, 436, 445, 454, 463, 472, 481, 490, 499, 508, 517, 526, 535, 544, 55
3, 562, 571, 580, 589, 598, 607, 616, 625, 634, 643, 652], 'San Mateo': [5, 1
4, 23, 32, 41, 50, 59, 68, 77, 86, 95, 104, 113, 122, 131, 140, 149, 158, 16
7, 176, 185, 194, 203, 212, 221, 230, 239, 248, 257, 266, 275, 284, 293, 302,
311, 320, 329, 338, 347, 356, 365, 374, 383, 392, 401, 410, 419, 428, 437, 44
6, 455, 464, 473, 482, 491, 500, 509, 518, 527, 536, 545, 554, 563, 572, 581,
590, 599, 608, 617, 626, 635, 644, 653], 'Santa Clara': [6, 15, 24, 33, 42, 5
1, 60, 69, 78, 87, 96, 105, 114, 123, 132, 141, 150, 159, 168, 177, 186, 195,
204, 213, 222, 231, 240, 249, 258, 267, 276, 285, 294, 303, 312, 321, 330, 33
9, 348, 357, 366, 375, 384, 393, 402, 411, 420, 429, 438, 447, 456, 465, 474,
483, 492, 501, 510, 519, 528, 537, 546, 555, 564, 573, 582, 591, 600, 609, 61
8, 627, 636, 645, 654], 'Solano': [7, 16, 25, 34, 43, 52, 61, 70, 79, 88, 97,
106, 115, 124, 133, 142, 151, 160, 169, 178, 187, 196, 205, 214, 223, 232, 24
1, 250, 259, 268, 277, 286, 295, 304, 313, 322, 331, 340, 349, 358, 367, 376,
385, 394, 403, 412, 421, 430, 439, 448, 457, 466, 475, 484, 493, 502, 511, 52
0, 529, 538, 547, 556, 565, 574, 583, 592, 601, 610, 619, 628, 637, 646, 65
5], 'Sonoma': [8, 17, 26, 35, 44, 53, 62, 71, 80, 89, 98, 107, 116, 125, 134,
143, 152, 161, 170, 179, 188, 197, 206, 215, 224, 233, 242, 251, 260, 269, 27
8, 287, 296, 305, 314, 323, 332, 341, 350, 359, 368, 377, 386, 395, 404, 413,
422, 431, 440, 449, 458, 467, 476, 485, 494, 503, 512, 521, 530, 539, 548, 55
7, 566, 575, 584, 593, 602, 611, 620, 629, 638, 647, 656]}
❔ Using type() , what type of object did we create for the .groupby result and the .groups result? :
⛬ Interpret the readout above, and consider whether it's any different from a regular dictionary. Remember that a
list can be a value. Review our earlier discussion about the two uses of dictionaries in GIS data analysis. What
does this one represent? :
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 24/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
In [ ]: #
Out[ ]: pandas.io.formats.printing.PrettyDict
Various summary statistics can be derived, such as mean, median, std, sum, max, min, by applying that to the
groupby object just created. For instance, the mean is unemployment_by_count.mean() .
⌨ Derive the sum of the claims, probably the most useful, since it'll be the total claims per county, and assign it
to unemployment_by_county .
In [ ]: #
Out[ ]:
UI_Claims PUA_Claims
County
⌨ We can also combine the operations to go directly from the unemployment data to the sum:
bayArea_unemployment.groupby("County").sum()
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 25/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
In [ ]: #
Out[ ]:
UI_Claims PUA_Claims
County
⌨ ... then we can sort to see the county with the most unemployment claims in order:
sorted_unemployment_by_county = unemployment_by_county.sort_values(by='UI_Claims',
ascending=False)
sorted_unemployment_by_county
In [ ]: #
Out[ ]:
UI_Claims PUA_Claims
County
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 26/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
Calculating fields:
In [ ]: #
Out[ ]:
Week_Ending County UI_Claims PUA_Claims PUA_UI_DIFF
Assigning a Boolean
⌨ With the assign method, we can assign True or False by testing a condition, in this case whether there
are any PUA_claims:
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 27/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
In [ ]: #
Out[ ]:
Week_Ending County UI_Claims PUA_Claims has_pua_claims
⌨ Write an assign statement that calculates a true value on a column called "high_pua_claims" where the the
value is greater than 500.
In [ ]: #
Out[ ]:
Week_Ending County UI_Claims PUA_Claims high_pua_claims
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 28/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
In this section we'll take a look at conducting concats, merges, and joins on various tables. When you work with
data, sometimes the data might exist in multiple tables and being able to conduct joins on your data is powerful.
In this section we'll be looking at population data from two different time periods, that include multiple geographic
regions. The tables that begin with "us_pop" contain population data from all 50 states within the United States.
And the tables that begin with "americas_pop" contain population data from various Countries throughout North,
Central, and South America.
⌨ First, we'll create four separate dataframes from our source data. Make sure you've downloaded these csv
files from iLearn and have placed them your py\Ex10 folder:
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 29/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 30/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
Out[ ]:
Region 2000 2001 2002 2003 2004 2005 2
Name
North
Mexico 98899845 100298153 101684758 103081020 104514932 106005203 107560
America
North
Canada 30588383 30880073 31178263 31488048 31815494 32164309 32536
America
South
Colombia 39629968 40255967 40875360 41483869 42075955 42647723 43200
America
South
Venezuela 24192446 24646472 25100408 25551624 25996594 26432447 26850
America
Costa Central
3962372 4034074 4100925 4164053 4225155 4285502 4345
Rica America
Central
Guatemala 11650743 11924946 12208848 12500478 12796925 13096028 13397
America
South
Brazil 174790340 177196054 179537520 181809246 184006481 186127103 188167
America
South
Argentina 36870787 37275652 37681749 38087868 38491972 38892931 39289
America
South
Chile 15342353 15516113 15684409 15849652 16014971 16182721 16354
America
Central
Belize 247315 255063 262378 269425 276504 283800 291
America
El Central
5887936 5927006 5962136 5994077 6023797 6052123 6079
Salvador America
Central
Honduras 6574509 6751912 6929265 7106319 7282953 7458985 7634
America
Central
Nicaragua 5069302 5145366 5219328 5292118 5364935 5438690 5513
America
Central
Panama 3030328 3089648 3149188 3209048 3269356 3330217 3391
America
South
Bolivia 8418264 8580235 8742814 8905823 9069039 9232306 9395
America
South
Ecuador 12681123 12914667 13143465 13369678 13596388 13825847 14059
America
South
Guyana 746715 745206 744789 745143 745737 746163 746
America
South
Paraguay 5323201 5428444 5531962 5632983 5730549 5824096 5913
America
South
Peru 26459944 26799285 27100968 27372226 27624213 27866145 28102
America
South
Suriname 470949 476579 482235 487942 493679 499464 505
America
South
Uruguay 3319736 3325473 3326040 3323668 3321476 3321803 3325
America
South
Venezuela 24192446 24646472 25100408 25551624 25996594 26432447 26850
America
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 31/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
americas_pop_00_09.dtypes
In [ ]: #
Then we can examine all the fields within our dataframes. In the earlier Pandas lab, we looked at indices and
columns as a pandas.core.indexes.base.Index object.
⌨ Use .columns to get the column names of americas_pop_00_09 . We're not needing this now, but as a
refresher, use .index to see those.
In [ ]: #
In [ ]: #
⌨ Let's make a dataframe that just contains the south american populations from 2000 - 2009
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 32/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
In [ ]: #
Region object
2000 int64
2001 int64
2002 int64
2003 int64
2004 int64
2005 int64
2006 int64
2007 int64
2008 int64
2009 int64
dtype: object
Out[ ]:
Region 2000 2001 2002 2003 2004 2005 20
Name
South
Colombia 39629968 40255967 40875360 41483869 42075955 42647723 432008
America
South
Venezuela 24192446 24646472 25100408 25551624 25996594 26432447 26850
America
South
Brazil 174790340 177196054 179537520 181809246 184006481 186127103 1881673
America
South
Argentina 36870787 37275652 37681749 38087868 38491972 38892931 392898
America
South
Chile 15342353 15516113 15684409 15849652 16014971 16182721 163545
America
South
Bolivia 8418264 8580235 8742814 8905823 9069039 9232306 93954
America
South
Ecuador 12681123 12914667 13143465 13369678 13596388 13825847 140593
America
South
Guyana 746715 745206 744789 745143 745737 746163 7463
America
South
Paraguay 5323201 5428444 5531962 5632983 5730549 5824096 59132
America
South
Peru 26459944 26799285 27100968 27372226 27624213 27866145 281020
America
South
Suriname 470949 476579 482235 487942 493679 499464 5052
America
South
Uruguay 3319736 3325473 3326040 3323668 3321476 3321803 33254
America
South
Venezuela 24192446 24646472 25100408 25551624 25996594 26432447 26850
America
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 33/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
⌨ Next we're going to create a new dataframe for the purpose of plotting our data:
south_america_forplot = south_america_00_09.drop(columns=["Region"])
south_america_forplot
In [ ]: #
Out[ ]:
2000 2001 2002 2003 2004 2005 2006
Name
south_america_forplot.plot()
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 34/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 35/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
In [ ]: #
south_america_forplot.transpose().plot(title='South America Population 2000 -
2009', figsize=(20,15))
Create plot
⌨ Now create the same plot that we did above, but this time create it with the data from Central America.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 36/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 37/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
In [ ]: #
Using concat and groupby to derive the total population by year for the North
America 2000-2009
The concat method lets you combine rows from datsets that are structured similarly.
For our task, we'll use data from two different tables (Americas and US) for the first decade of the 21st century.
We'll use:
americas_pop_00_09 , which contains population for countries in North, South and Central America
(excluding the United States)
us_pop_00_09 , which has United States population by state.
⌨ And for that decade, start with pulling out North America from the Americas data:
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 38/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
In [ ]: #
Out[ ]:
Region 2000 2001 2002 2003 2004 2005 2006
Name
North
Mexico 98899845 100298153 101684758 103081020 104514932 106005203 107560153
America
North
Canada 30588383 30880073 31178263 31488048 31815494 32164309 32536987
America
us_pop_00_09
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 39/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 40/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
Out[ ]:
Region 2000 2001 2002 2003 2004 2005 2006
Name
North
Alabama 4452173 4467634 4480089 4503491 4530729 4569805 4628981
America
North
Alaska 627963 633714 642337 648414 659286 666946 675302
America
North
Arizona 5160586 5273477 5396255 5510364 5652404 5839077 6029141
America
North
Arkansas 2678588 2691571 2705927 2724816 2749686 2781097 2821761
America
North
California 33987977 34479458 34871843 35253159 35574576 35827943 36021202
America
North
Colorado 4326921 4425687 4490406 4528732 4575013 4631888 4720423
America
North
Connecticut 3411777 3432835 3458749 3484336 3496094 3506956 3517460
America
North
Delaware 786373 795699 806169 818003 830803 845150 859268
America
District of North
572046 574504 573158 568502 567754 567136 570681
Columbia America
North
Florida 16047515 16356966 16689370 17004085 17415318 17842038 18166990
America
North
Georgia 8227303 8377038 8508256 8622793 8769252 8925922 9155813
America
North
Hawaii 1213519 1225948 1239613 1251154 1273569 1292729 1309731
America
North
Idaho 1299430 1319962 1340372 1363380 1391802 1428241 1468669
America
North
Illinois 12434161 12488445 12525556 12556006 12589773 12609903 12643955
America
North
Indiana 6091866 6127760 6155967 6196638 6233007 6278616 6332669
America
North
Iowa 2929067 2931997 2934234 2941999 2953635 2964454 2982644
America
North
Kansas 2693681 2702162 2713535 2723004 2734373 2745299 2762931
America
North
Kentucky 4049021 4068132 4089875 4117170 4146101 4182742 4219239
America
North
Louisiana 4471885 4477875 4497267 4521042 4552238 4576628 4302665
America
North
Maine 1277072 1285692 1295960 1306513 1313688 1318787 1323619
America
North
Maryland 5311034 5374691 5440389 5496269 5546935 5592379 5627367
America
North
Massachusetts 6361104 6397634 6417206 6422565 6412281 6403290 6410084
America
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 41/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
Name
North
Michigan 9952450 9991120 10015710 10041152 10055315 10051137 10036081
America
North
Minnesota 4933692 4982796 5018935 5053572 5087713 5119598 5163555
America
North
Mississippi 2848353 2852994 2858681 2868312 2889010 2905943 2904978
America
North
Missouri 5607285 5641142 5674825 5709403 5747741 5790300 5842704
America
North
Montana 903773 906961 911667 919630 930009 940102 952692
America
North
Nebraska 1713820 1719836 1728292 1738643 1749370 1761497 1772693
America
North
Nevada 2018741 2098399 2173791 2248850 2346222 2432143 2522658
America
New North
1239882 1255517 1269089 1279840 1290121 1298492 1308389
Hampshire America
North
New Jersey 8430621 8492671 8552643 8601402 8634561 8651974 8661679
America
North
New Mexico 1821204 1831690 1855309 1877574 1903808 1932274 1962137
America
North
New York 19001780 19082838 19137800 19175939 19171567 19132610 19104631
America
North
North Carolina 8081614 8210122 8326201 8422501 8553152 8705407 8917270
America
North
North Dakota 642023 639062 638168 638817 644705 646089 649422
America
North
Ohio 11363543 11387404 11407889 11434788 11452251 11463320 11481213
America
North
Oklahoma 3454365 3467100 3489080 3504892 3525233 3548597 3594090
America
North
Oregon 3429708 3467937 3513424 3547376 3569463 3613202 3670883
America
North
Pennsylvania 12284173 12298970 12331031 12374658 12410722 12449990 12510809
America
North
Rhode Island 1050268 1057142 1065995 1071342 1074579 1067916 1063096
America
North
South Carolina 4024223 4064995 4107795 4150297 4210921 4270150 4357847
America
North
South Dakota 755844 757972 760020 763729 770396 775493 783033
America
North
Tennessee 5703719 5750789 5795918 5847812 5910809 5991057 6088766
America
North
Texas 20944499 21319622 21690325 22030931 22394023 22778123 23359580
America
North
Utah 2244502 2283715 2324815 2360137 2401580 2457719 2525507
America
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 42/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
Name
North
Vermont 609618 612223 615442 617858 619920 621215 622892
America
North
Virginia 7105817 7198362 7286873 7366977 7475575 7577105 7673725
America
North
Washington 5910512 5985722 6052349 6104115 6178645 6257305 6370753
America
North
West Virginia 1807021 1801481 1805414 1812295 1816438 1820492 1827912
America
North
Wisconsin 5373999 5406835 5445162 5479203 5514026 5546166 5577655
America
North
Wyoming 494300 494657 500017 503453 509106 514157 522667
America
In [ ]: #
Out[ ]:
Region 2000 2001 2002 2003 2004 2005 2006
North
0 282162411 284968955 287625193 290107933 292805298 295516599 298379912 3012
America
⌨ Notice when we did the group by and sum, it set the index to the Region column and we lost "Name", the
previous index. So in order to concat the two tables, let's set the index in our north_america_00_09 table to
"Region".
all_us_00_09.set_index("Region")
In [ ]: #
Out[ ]:
2000 2001 2002 2003 2004 2005 2006 2
Region
North
282162411 284968955 287625193 290107933 292805298 295516599 298379912 301231
America
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 43/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
north_america_00_09.set_index("Region")
In [ ]: #
Out[ ]:
2000 2001 2002 2003 2004 2005 2006 20
Region
North
98899845 100298153 101684758 103081020 104514932 106005203 107560153 1091705
America
North
30588383 30880073 31178263 31488048 31815494 32164309 32536987 329307
America
⌨ Now we can concat the two datasets to get Canada, Mexico and the US:
In [ ]: #
Out[ ]:
2000 2001 2002 2003 2004 2005 2006 2
Region
North
98899845 100298153 101684758 103081020 104514932 106005203 107560153 109170
America
North
30588383 30880073 31178263 31488048 31815494 32164309 32536987 32930
America
North
282162411 284968955 287625193 290107933 292805298 295516599 298379912 301231
America
In [ ]:
Out[ ]:
Region 2000 2001 2002 2003 2004 2005 2006
North
0 411650639 416147181 420488214 424677001 429135724 433686111 438477052 4433
America
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 44/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
In [ ]: #
Out[ ]:
Region 2000 2001 2002 2003 2004 2005 2
Name
North
Mexico 98899845 100298153 101684758 103081020 104514932 106005203 107560
America
North
Canada 30588383 30880073 31178263 31488048 31815494 32164309 32536
America
South
Colombia 39629968 40255967 40875360 41483869 42075955 42647723 43200
America
South
Venezuela 24192446 24646472 25100408 25551624 25996594 26432447 26850
America
Central
Costa Rica 3962372 4034074 4100925 4164053 4225155 4285502 4345
America
North
Virginia 7105817 7198362 7286873 7366977 7475575 7577105 7673
America
North
Washington 5910512 5985722 6052349 6104115 6178645 6257305 6370
America
West North
1807021 1801481 1805414 1812295 1816438 1820492 1827
Virginia America
North
Wisconsin 5373999 5406835 5445162 5479203 5514026 5546166 5577
America
North
Wyoming 494300 494657 500017 503453 509106 514157 522
America
73 rows × 11 columns
⌨ Then we'll concat data tables for the same regions from 2010 to 2019:
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 45/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
In [ ]: #
Out[ ]:
Region 2010 2011 2012 2013 2014 2015 2
Name
North
Mexico 114092963 115695473 117274155 118827161 120355128 121858258 123333
America
North
Canada 34147564 34539159 34922030 35296528 35664337 36026676 36382
America
South
Colombia 45222700 45662748 46075718 46495493 46967696 47520667 48175
America
South
Venezuela 28439940 28887874 29360837 29781040 30042968 30081829 2985
America
Central
Costa Rica 4577378 4633086 4688000 4742107 4795396 4847804 4899
America
North
Virginia 8023699 8101155 8185080 8252427 8310993 8361808 8410
America
North
Washington 6742830 6826627 6897058 6963985 7054655 7163657 7294
America
West North
1854239 1856301 1856872 1853914 1849489 1842050 183
Virginia America
North
Wisconsin 5690475 5705288 5719960 5736754 5751525 5760940 5772
America
North
Wyoming 564487 567299 576305 582122 582531 585613 584
America
73 rows × 11 columns
⌨ Now that we have these two new datasets, we can run a pd.merge() operation, which has the usage:
With the merge function, the first two parameters will always be the left and right dataframes. For our purpose,
we want to set "left_index=True" and "right_index=True" to specify that the indices will be our key values and we
can retrain the Countries as the index. Finally, we pass in "how='right'" to indicate a right join.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 46/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
In [ ]: #
Out[ ]:
Region_x 2000 2001 2002 2003 2004 2005 2006
Name
North
Alabama 4452173 4467634 4480089 4503491 4530729 4569805 4628981
America
North
Alaska 627963 633714 642337 648414 659286 666946 675302
America
South
Argentina 36870787 37275652 37681749 38087868 38491972 38892931 39289878
America
North
Arizona 5160586 5273477 5396255 5510364 5652404 5839077 6029141
America
North
Arkansas 2678588 2691571 2705927 2724816 2749686 2781097 2821761
America
North
Virginia 7105817 7198362 7286873 7366977 7475575 7577105 7673725
America
North
Washington 5910512 5985722 6052349 6104115 6178645 6257305 6370753
America
West North
1807021 1801481 1805414 1812295 1816438 1820492 1827912
Virginia America
North
Wisconsin 5373999 5406835 5445162 5479203 5514026 5546166 5577655
America
North
Wyoming 494300 494657 500017 503453 509106 514157 522667
America
75 rows × 22 columns
.join()
⌨ Above was the merge method, which gives you a lot of ability to do different types of joins and to join on
different fields if you want. Next, we'll look at the very straightforward and simple Join operation.
It is useful to use the ".join()" method if you know you want to join on the index field and your data is relatively
clean and straightforward, with the usage:
The DataFrame.join() method lets us use dot notation on our left table, then pass in the right table and how as an
argument. This eliminates the need to specify the right and left index arguments like we did in the previous
function. If on=None, the join key will be the row index. Let’s observe how the nulls are affecting our analysis by
taking a look at the DataFrame head.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 47/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
In [ ]: #
Out[ ]:
Region_left 2000 2001 2002 2003 2004 2005 2006
Name
North
Alabama 4452173 4467634 4480089 4503491 4530729 4569805 4628981
America
North
Alaska 627963 633714 642337 648414 659286 666946 675302
America
South
Argentina 36870787 37275652 37681749 38087868 38491972 38892931 39289878
America
North
Arizona 5160586 5273477 5396255 5510364 5652404 5839077 6029141
America
North
Arkansas 2678588 2691571 2705927 2724816 2749686 2781097 2821761
America
North
Virginia 7105817 7198362 7286873 7366977 7475575 7577105 7673725
America
North
Washington 5910512 5985722 6052349 6104115 6178645 6257305 6370753
America
West North
1807021 1801481 1805414 1812295 1816438 1820492 1827912
Virginia America
North
Wisconsin 5373999 5406835 5445162 5479203 5514026 5546166 5577655
America
North
Wyoming 494300 494657 500017 503453 509106 514157 522667
America
75 rows × 22 columns
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 48/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
In [ ]: #
joined_plot = joined_df.groupby("Region_left").sum().transpose().rename_axis
("year")
joined_plot.plot()
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 49/50
1/27/24, 2:49 PM Ex05_Pandas_Transformations_results
key
➦ This directs you to do something specific, maybe in the operating system or answer something conceptual.
In [ ]:
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex05_Pandas_Transformations_results.html 50/50
1/27/24, 2:49 PM Ex06_GeoPandas_results
Introduction to GeoPandas
Before you begin this exercise make sure you have all of the installation working; this should have happened the
first week. Also, make sure you have downloaded the data from iLearn and placed the data in the "geodata"
folder where the project folder is with the python files". To complete this exercise make sure you have this data in
that geodata folder:
owid-covid-data.csv
World_Map.shp
BA_Counties.shp
sf_neighborhoods.shp
SF_Nov20_311.csv
In this exercise we will be working on the GeoPandas library. GeoPandas is an effective open source python
library for analyzing large amounts of tabular and in particular, spatial data. GeoPandas adds a spatial geometry
data type to Pandas and enables spatial operations on these types, using shapely
(https://github.com/Toblerity/Shapely). GeoPandas leverages Pandas together with several core open source
geospatial packages and practices to provide a uniquely simple and convenient framework for handling
geospatial feature data, operating on both geometries and attributes jointly, and as with Pandas, largely
eliminating the need to iterate over features (rows).
GeoPandas builds on mature, stable and widely used packages (Pandas, shapely, etc). It is being supported
more and more as a preferred Python data structure for geospatial vector data, and a useful tool for exploratory
spatial data analysis.
import pandas as pd
import matplotlib.pyplot as plt
In [ ]: #
In this first section we will be using Global COVID data to make a map of cases and deaths by country. The data
were downloaded from [Our World in Data] on 5 May 2022 (https://ourworldindata.org/coronavirus-source-data
(https://ourworldindata.org/coronavirus-source-data)) as owid-covid-data.csv . You should have downloaded
this file from iLearn and placed it in your working project directory for this exercise, in the geodata folder.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 1/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
In [ ]: #
Out[ ]:
iso_code continent location date total_cases new_cases new_cases_smoothed to
2020-
0 AFG Asia Afghanistan 5.0 5.0 NaN
02-24
2020-
1 AFG Asia Afghanistan 5.0 0.0 NaN
02-25
2020-
2 AFG Asia Afghanistan 5.0 0.0 NaN
02-26
2020-
3 AFG Asia Afghanistan 5.0 0.0 NaN
02-27
2020-
4 AFG Asia Afghanistan 5.0 0.0 NaN
02-28
2022-
166093 ZWE Africa Zimbabwe 236380.0 577.0 401.286
02-28
2022-
166094 ZWE Africa Zimbabwe 236871.0 491.0 413.000
03-01
2022-
166095 ZWE Africa Zimbabwe 237503.0 632.0 416.286
03-02
2022-
166096 ZWE Africa Zimbabwe 237503.0 0.0 362.286
03-03
2022-
166097 ZWE Africa Zimbabwe 238739.0 1236.0 467.429
03-04
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 2/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
all_data['date'] = pd.to_datetime(all_data['date'])
all_data = all_data.filter(['location', "date", "new_cases", "new_deaths"], axis=
1)
year_grpd = all_data.groupby(['location',all_data['date'].dt.year]).sum()
year_grpd.columns = ['TotalCases', 'TotalDeaths']
year_grpd
In [ ]: #
Out[ ]:
TotalCases TotalDeaths
location date
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 3/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
⌨ Use another .groupby() and .sum() to end up with total cases and deaths by location.
grouped_data = year_grpd.groupby('location').sum()
grouped_data
In [ ]: #
Out[ ]:
TotalCases TotalDeaths
location
⌨ If we want to rename our dataframe's index then we can use the .index.names property to assign the desired
value to our index column (note that this is different from .set_index which makes a field the index; we're just
renaming the existing index). In this example below, we're going to use the word 'NAME' for the location field.
world_covid_cases.index.names = ['NAME']
world_covid_cases
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 4/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
In [ ]: #
Out[ ]:
TotalCases TotalDeaths
NAME
⌨ Enter and execute the code that would show all the datatypes ( .dtype ) in the world_covid_cases object
above.
In [ ]: #
Load spatial data for geopandas and merge with our COVID
data
⌨ Before we move forward in our data manipulation, let's load the shapefile of all countries in the world and
then plot this shape file. We'll now start using geopandas so we'll need to import it.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 5/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
In [ ]: #
In [ ]: #
Out[ ]:
NAME geometry
243 South Georgia South Sandwich Islands MULTIPOLYGON (((-27.32584 -59.42722, -27.29806...
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 6/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
⌨ Next, we'd like to compare the values that exist within our World shape country names with the names in our
World Covid list. First, we'll loop over each Name within the "world_covid_cases" dataframe, by passing the
index field to the .tolist() method. Then we'll generate a list from the World Shpae file and only print out the
values that do not exist within both lists.
world_data_list = world_data['NAME'].tolist()
for item in world_covid_cases.index.tolist():
if not(item in world_data_list):
print(item + ' is not in the world data list')
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 7/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
Just to see the types of names in our World shape file let's loop over and print each item in our "world_data_list".
⌨ While some of the missing records are continental and other summaries, we have identified some of the
major countries that we need to replace in our world_data in order to make sure we can conduct a merge on our
index field from the world_covid_cases Geodataframe.
In [ ]: #
⌨ Now we can use the .merge() method to join our tabular World covid cases to our World shapefile.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 8/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
In [ ]: #
Out[ ]:
NAME geometry TotalCases TotalDeaths
⌨ Then we can plot our total Covid Cases World Wide. You might notice some parameters below that we have
not discussed in class. There is a lot of variables and information about how to plot using the matplotlib library.
Here is some documentation in the GeoPandas documentation plotting with GeoPandas
(https://geopandas.org/docs/user_guide/mapping.html)
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 9/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
In [ ]: #
❔ Can you describe what the "f" and "ax1" represent in the above block of code?
⛬
❔ While renaming some countries resulted in a pretty convincing map, yet there remain some issues resulting
from geopolitical differences. For instance, what's happening with Western Sahara?
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 10/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
In [ ]: #
Out[ ]:
OBJECTID Acres County Shape_Leng Shape_Area geometry
MULTIPOLYGON
0 1 4.763008e+05 Alameda 351719.536494 1.927521e+09 (((6033505.488
2112049.179, 60331...
MULTIPOLYGON
Contra
1 2 4.811194e+05 292676.346086 1.947021e+09 (((6027405.300
Costa
2153185.401, 60272...
MULTIPOLYGON
2 3 3.360223e+05 Marin 419122.293142 1.359834e+09 (((5988028.191
2211673.341, 59879...
POLYGON ((6021779.628
3 4 5.054966e+05 Napa 278766.105739 2.045672e+09 2506831.778,
6021843.051...
MULTIPOLYGON
San
4 5 3.021589e+04 106843.644006 1.222794e+08 (((6006624.799
Francisco
2128525.324, 60061...
MULTIPOLYGON
5 6 2.902726e+05 San Mateo 306693.656488 1.174692e+09 (((6068478.537
2017484.502, 60691...
POLYGON ((6136976.940
Santa
6 7 8.312778e+05 380752.936529 3.364062e+09 1994399.503,
Clara
6137343.878...
MULTIPOLYGON
7 8 5.437972e+05 Solano 330514.170440 2.200669e+09 (((6121744.455
2214795.253, 61216...
POLYGON ((5987868.159
8 9 1.015939e+06 Sonoma 400983.408337 4.111360e+09 2233054.492,
5987850.662...
We may have not used len() before with dataframes, but it's a base Python function that can
return the length of strings and lists, or with numpy the total size, but with dataframes returns the
number of records (rows) which for spatial data is the number of features. What do you get if you
use .size with a dataframe?
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 11/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
In [ ]: #
Out[ ]: 9
ba_counties.type.unique()
In [ ]: #
ba_counties.iloc[0]
In [ ]: #
Out[ ]: OBJECTID 1
Acres 476300.784182
County Alameda
Shape_Leng 351719.536494
Shape_Area 1927520887.16
geometry (POLYGON ((6033505.487613098 2112049.179188581...
Name: 0, dtype: object
⌨ Use this with world_data and ba_counties and check http://epsg.io (http://epsg.io) to look up others
(there are many thousands of these, since projections also have parameters that produce many variants, and
these are needed for optimal display of geospatial data).
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 12/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
In [ ]: #
⌨ We can reproject our data in GeoPandas. Here we'll reproject our NAD83 data to UTM Zone 10 N, which has
an EPSG code of 32610.
ba_counties_UTM = ba_counties.to_crs('EPSG:32610')
In [ ]: #
ba_counties_UTM.crs
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 13/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
In [ ]: #
⌨ Let's explore this field. The following code will show the first 5 values in the geometry field: this is actually a
GeoSeries...
ba_counties_UTM['geometry'][0:5]
Note: if you open a shapefile in ArcGIS, you probably won't see a geometry field. Instead you'll
see a Shape field, which represents the same thing, but it doesn't display its specific contents in
the attribute table.
In [ ]: #
⌨ We can show just the first value, which will appear as a shape.
ba_counties_UTM['geometry'][0]
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 14/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
In [ ]: #
Out[ ]:
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 15/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 16/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
Out[ ]:
Out[ ]:
Out[ ]:
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 17/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
Out[ ]:
Out[ ]:
Out[ ]:
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 18/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
Out[ ]:
Out[ ]:
Out[ ]:
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 19/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
⌨ Extract one feature geometry to a variable and show its type (not the same as dtype).
thePoly = ba_counties_UTM['geometry'][0]
type(thePoly)
In [ ]: #
Out[ ]: shapely.geometry.multipolygon.MultiPolygon
❔ So what is shapely ? Search online about shapely and geopandas to research your answer.
⌨ Show thePoly and you should see the individual polygon (not surprising based on what we saw before).
thePoly
In [ ]: #
Out[ ]:
⌨ Look at other geometric properties for thePoly using the properties .area and .length .
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 20/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
⌨ Assign the result of the property .boundary to theBoundary , display it and list its geometry type .
In [ ]: #
Out[ ]:
Out[ ]: shapely.geometry.multilinestring.MultiLineString
Centroid
⌨ Create the centroid with the .centroid property, print it to see its value, and show its type,
In [ ]: #
Out[ ]: shapely.geometry.point.Point
print(theCentroid)
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 21/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
Buffer
⌨ We can also do something you've certainly done before -- create a buffer polygon. Use the .buffer()
method and provide 1000 (1 km) as the sole distance parameter.
In [ ]: #
Out[ ]:
Distance Analysis
We can also compute distances fairly easily with GeoPandas objects.
⌨ Here we'll compute the distance (in km) of each feature to the center point of all ( unary_union ) features.
In [ ]: #
⌨ Then get the mean distance using the mean() method -- note that it's a method not a property.
theDistances.mean()
In [ ]: #
Out[ ]: 21.463153699118223
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 22/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
theDistances.hist(figsize=(15,3))
In [ ]: #
Here we will buffer the centroid of a feature and then intersect that with the feature.
You might have expected to see a map, but what we've just created is a record of the spatial
dataframe, so it shows in as a row. Earlier we just pulled out the geometry.
In [ ]: #
Out[ ]:
OBJECTID Acres County Shape_Leng Shape_Area geometry
MULTIPOLYGON
0 1 476300.784182 Alameda 351719.536494 1.927521e+09 (((559203.322 4181745.002,
559093...
feature_geometry = Alameda['geometry']
type(feature_geometry)
features_geometry = ba_counties_UTM['geometry']
type(features_geometry)
⛬ Interpretation:
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 23/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
In [ ]: #
Out[ ]: geopandas.geoseries.GeoSeries
Out[ ]: geopandas.geoseries.GeoSeries
Replacing geometries
We can modify the geometry of out data in various ways.
⌨ Here's a simple example that will replace the shape with a 5000-m buffered centroid.
ba_counties_copy = ba_counties_UTM.copy()
ba_counties_copy['geometry'] = ba_counties_copy['geometry'].centroid.buffer(5000)
ba_counties_copy.plot()
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 24/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
⌨ Use pd.read_csv() to read this file and assign it to a new dataframe sf_311 .
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 25/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
In [ ]: #
Out[ ]:
CaseID Opened Closed Updated Status Status_Notes Responsible_Ag
11/13/2020 11/13/2020
0 13138553 NaN Open NaN Port Author
7:21 14:09
Case Resolved -
11/1/2020 11/1/2020 11/1/2020
2 13091973 Closed POLICE MATTER. 311 Supervisor Q
0:08 1:29 1:29
PLEASE NOTIFY ...
Case Resolved -
11/18/2020 11/18/2020 11/18/2020 Parking Enforc
3 13155970 Closed Officer responded to
6:27 7:27 7:27 Dispatch Q
request u...
Case Resolved -
11/24/2020 11/24/2020 11/24/2020 Parking Enforc
4 13180413 Closed Officer responded to
14:19 17:06 17:06 Dispatch Q
request u...
Comment Noted - To
11/2/2020 11/4/2020 11/4/2020
47991 13096891 Closed view the status,
11:07 10:08 10:08
see\n13097...
Case Transferred -
11/2/2020 11/3/2020 11/3/2020
47992 13096288 Closed DPH Environmental
9:58 13:28 13:28
Health\n1...
Case is a Duplicate -
11/1/2020 11/2/2020 11/2/2020
47993 13094573 Closed \n13094578,11/01/2020
19:52 8:00 8:00
0...
Cancelled -
11/1/2020 11/17/2020 11/17/2020
47994 13093656 Closed \n13041136,10/19/2020
14:13 9:20 9:20
11:25:00 AM...
Comment Noted - To
11/1/2020 11/6/2020 11/6/2020
47995 13092881 Closed view the status,
10:22 7:37 7:37
see\n11470...
⌨ Display all the field names within our dataframe using the .columns property.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 26/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
In [ ]: #
⌨ Next, because we might not be too familiar with our input data, we can do a .groupby() operation on
"Request_Type" and sort the results to see the 10 most common request types in our 311 data.
sf_311.groupby('Request_Type').Request_Type.count().sort_values(ascending=False).he
ad(10)
❔ Interpret the various methods we've used in this statement and how they provide the result we're seeking.
⛬
In [ ]: #
Out[ ]: Request_Type
Bulky Items 10085
General Cleaning 7666
request_for_service 2180
Encampment Reports 1886
Human or Animal Waste 1855
Other_Illegal_Parking 1725
Graffiti on Other_enter_additional_details_below 1536
City_garbage_can_overflowing 1479
Illegal Postings - Affixed_Improperly 1399
Parking_on_Sidewalk 1188
Name: Request_Type, dtype: int64
⌨ Detect missing coordinates if any of the records are missing either Latitude or Longitude .
In [ ]: #
missing_coordinates.sum()
In [ ]: #
Out[ ]: 46
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 27/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
⌨ Here we can reassign the sf_311 dataframe with just those that have coordinates by selecting with .notna()
for either Latitude, Longitude, or Point (which holds a tuple of latitude and longitude, probably redundant for the
other two fields.)
In [ ]: #
⌨ Use the missing_coordinates method from above to confirm that there are no missing coordinates.
In [ ]: #
Out[ ]: 0
⌨ Next, use gpd.readfile() and .plot() to assign a new GeoDataFrame sf_neighborhoods from
geodata/sf_neighborhoods.shp and plot it.
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 28/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
sf_neighborhoods
In [ ]: #
Out[ ]:
link name geometry
POLYGON ((-122.49346
0 http://en.wikipedia.org/wiki/Sea_Cliff,_San_Fr... Seacliff
37.78352, -122.49373 37.7...
POLYGON ((-122.48715
1 None Lake Street
37.78379, -122.48729 37.7...
POLYGON ((-122.47241
3 None Presidio Terrace
37.78735, -122.47100 37.7...
POLYGON ((-122.47263
4 http://www.sfgate.com/neighborhoods/sf/innerri... Inner Richmond
37.78631, -122.46683 37.7...
POLYGON ((-122.43519
112 http://en.wikipedia.org/wiki/Corona_Heights,_S... Corona Heights
37.76267, -122.43532 37.7...
POLYGON ((-122.45196
113 http://en.wikipedia.org/wiki/Haight-Ashbury Ashbury Heights
37.76148, -122.45210 37.7...
POLYGON ((-122.43734
114 http://en.wikipedia.org/wiki/Eureka_Valley,_Sa... Eureka Valley
37.76235, -122.43704 37.7...
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 29/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
In [ ]: #
In [ ]: #
⌨ Realizing that the four extent values are ordered this way -- [xmin,y_min,x_max,y_max] -- assign those four
values as variables x_min , etc.
In [ ]: #
⌨ In the code above we created 4 variables that contains the x/y max and min. Next, we can set bounding
extent that is too far outside our sf_neighborhoods data. We'll do this to exclude any 311 calls that are beyond
our sf_neighborhoods extents.
In [ ]: #
⌨ Now, we can see how many records are beyond our extent.
outside.sum()
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 30/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
In [ ]: #
Out[ ]: 8
⌨ Let's filter our data, below is a new operation that you might not have seen in class yet. The ~ in the code
below is a boolean operator in Pandas that mean "not". So, in this code we are saying return the values from the
sf_311 dataframe that are not outside our extent.
sf_311 = sf_311[~outside]
In [ ]: #
Once we've filtered out our data, we can now convert our Pandas dataframe into a GeoPandas Geodataframe.
Here notice how we pass the .points_from_xy() method into the geometry option for loading into a Geopandas
Geodataframe.
In [ ]: #
Before we plot our points, we'll want to add a basemap, and for that we need to use the contextily app. This
needs some introduction...
However, one common source of error with basemaps is when a service provider goes off-line, or starts charging
for their tiles, and I've had this happen with contextily when its default basemap provider Stamen started
charging. So we'll explicitly tell it to access OpenStreetMap .
⌨ Before we plot our points we'll want to import the contextily python library as "cx", so we can add a basemap
to our plot.
import contextily as cx
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 31/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
⌨ Since the default projection is 4326 (GCS WGS84), we don't need to use .set_crs to set that, but we'd like to
put it in 3857 ("Web Mercator") using .to_crs() to make sure the data will draw on top of our base map tiles from
contextily.
sf_311 = sf_311.to_crs("EPSG:3857")
Note that we reassigned sf_311 to be in this new crs. If we needed to work with it further in GCS,
we'd need to run all of our code again to build it. We might have instead assigned it to something
like sf_311_webm to separate it, but there's no particular advantage.
In [ ]: #
sf_311.crs
In [ ]: #
⌨ Set the basemap and basemap extent from contextily. Note that we'll specify the basemap tile provider with
source=ctx.providers.OpenStreetMap.Mapnik . To see other services that may be available, see
https://contextily.readthedocs.io/en/latest/providers_deepdive.html
(https://contextily.readthedocs.io/en/latest/providers_deepdive.html).
Note that we needed to use .total_bounds again here since these need to be in web mercator.
In [ ]:
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 32/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 33/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
In [ ]: #
Out[ ]: Text(0.5, 1.0, 'All 311 Calls November 2020 San Francisco')
Out[ ]: <AxesSubplot: title={'center': 'All 311 Calls November 2020 San Francisco'}>
❔ In the above block of code, what does the "alpha" parameter control?
⛬
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 34/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
⌨ Before we can decide how to best filter our data down, let's take a look at all the unique values in the
"Category" field.
sf_311.Category.unique()
In [ ]: #
⌨ We can create a new geodataframe that only contains the Category 'Parking Enforcement' and call it
"parking_issues".
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 35/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 36/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
Status_Notes \
1 Case Resolved -
Police Officer responded to re...
3 Case Resolved -
Officer responded to request u...
4 Case Resolved -
Officer responded to request u...
5 Case Resolved -
Officer responded to request u...
6 Case Resolved -
Police Officer responded to re...
... ...
47687 Case is a Duplicate - This issue has already b...
47688 Case Resolved - Officer responded to request u...
47689 Case Resolved - Officer responded to request u...
47737 Case Resolved - Police Officer responded to re...
47924 Case Resolved - Police Officer responded to re...
Responsible_Agency Category \
1 Parking Enforcement Dispatch Queue Parking Enforcement
3 Parking Enforcement Dispatch Queue Parking Enforcement
4 Parking Enforcement Dispatch Queue Parking Enforcement
5 Parking Enforcement Dispatch Queue Parking Enforcement
6 Parking Enforcement Dispatch Queue Parking Enforcement
... ... ...
47687 Parking Enforcement Dispatch Queue Parking Enforcement
47688 Parking Enforcement Dispatch Queue Parking Enforcement
47689 Parking Enforcement Dispatch Queue Parking Enforcement
47737 Parking Enforcement Dispatch Queue Parking Enforcement
47924 Parking Enforcement Dispatch Queue Parking Enforcement
Request_Type Request_Details \
1 Other_Illegal_Parking Parking Enforcement
3 Other_Illegal_Parking Blue - Chrysler - 5lrx746
4 Other_Illegal_Parking Blue - Honda - 8duj830
5 Other_Illegal_Parking Blue - Honda - Aw70c09
6 Other_Illegal_Parking Blue - Honda - Aw70c09
... ... ...
47687 Parking_on_Sidewalk silver - cadillac escalade - 7wxn339
47688 Parking_on_Sidewalk silver - cadillac escalade - 7wxn339
47689 Parking_on_Sidewalk silver - cadillac escalade - 7wxn339
47737 Other_Illegal_Parking White - Chevy van - None
47924 Other_Illegal_Parking Blue - Ford suburban - Aksj
Address Street \
1 57 INNES CT, SAN FRANCISCO, CA, 94124 INNES CT
3 51 INNES CT, SAN FRANCISCO, CA, 94124 INNES CT
4 51 INNES CT, SAN FRANCISCO, CA, 94124 INNES CT
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 37/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
5 51 INNES CT, SAN FRANCISCO, CA, 94124 INNES CT
6 51 INNES CT, SAN FRANCISCO, CA, 94124 INNES CT
... ... ...
47687 1380 LA PLAYA, SAN FRANCISCO, CA, 94122 LA PLAYA
47688 1380 LA PLAYA, SAN FRANCISCO, CA, 94122 LA PLAYA
47689 1380 LA PLAYA, SAN FRANCISCO, CA, 94122 LA PLAYA
47737 Intersection of 48TH AVE and FULTON ST 48TH AVE
47924 640 GREAT HWY, SAN FRANCISCO, CA, 94121 GREAT HWY
geometry
1 POINT (-13621842.872 4540942.599)
3 POINT (-13621935.635 4540982.659)
4 POINT (-13621936.748 4540981.687)
5 POINT (-13621942.125 4540967.506)
6 POINT (-13621942.537 4540979.113)
... ...
47687 POINT (-13637639.542 4545677.756)
47688 POINT (-13637639.542 4545677.756)
47689 POINT (-13637639.542 4545677.756)
47737 POINT (-13637675.843 4547178.350)
47924 POINT (-13637895.109 4547770.075)
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 38/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
⌨ Let's plot the parking issues only and then set the extent of our map to the total extent of the parking issues
geodataframe.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 39/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
In [ ]: #
Out[ ]: Text(0.5, 1.0, 'Parking Related 311 Calls November 2020 San Francisco')
Next, we'll do a spatial join between our 311 parking issues and San Francisco neighborhoods. But before we
can run that opertaion, let's set the projection of our sf_neighborhoods to match our 311 data.
sf_neighborhoods = sf_neighborhoods.to_crs('EPSG:3857')
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 40/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
In [ ]: #
Here we can calculate the count of parking issues by neighborhood and then add a field to our neighborhoods
data and call it "parking_incidents". Then examine the sf_neighborhoods geodataframe.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 41/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
In [ ]: #
Out[ ]:
link name geometry parking_incidents
POLYGON
((-13635909.066
0 http://en.wikipedia.org/wiki/Sea_Cliff,_San_Fr... Seacliff 20
4548889.167,
-13635939...
POLYGON
Lake ((-13635207.246
1 None 14
Street 4548926.811,
-13635222...
POLYGON
Presidio
((-13634141.858
2 http://www.nps.gov/prsf/index.htm National 15
4552759.779,
Park
-13634091...
POLYGON
Presidio ((-13633566.376
3 None 1
Terrace 4549428.413,
-13633409...
POLYGON
Inner ((-13633590.339
4 http://www.sfgate.com/neighborhoods/sf/innerri... 13
Richmond 4549283.086,
-13632945...
POLYGON
Corona ((-13629422.779
112 http://en.wikipedia.org/wiki/Corona_Heights,_S... 5
Heights 4545953.181,
-13629437...
POLYGON
Ashbury ((-13631289.585
113 http://en.wikipedia.org/wiki/Haight-Ashbury 91
Heights 4545785.263,
-13631305...
POLYGON
Eureka ((-13629662.606
114 http://en.wikipedia.org/wiki/Eureka_Valley,_Sa... 0
Valley 4545908.690,
-13629628...
POLYGON
St. Francis ((-13633472.736
115 http://en.wikipedia.org/wiki/St._Francis_Wood,... 9
Wood 4542016.274,
-13633109...
POLYGON
Sherwood ((-13632062.910
116 http://en.wikipedia.org/wiki/Neighborhoods_in_... 0
Forest 4542836.706,
-13632047...
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 42/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
Here were can plot all our 311 parking issues by neighborhood.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 43/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
In [ ]: #
Out[ ]: (-13637895.109407913,
-13621842.872231368,
4538284.832547026,
4552426.083463726)
Turn in your iPython notebook once you have completed this exercise.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 44/45
1/27/24, 2:49 PM Ex06_GeoPandas_results
➦ This directs you to do something specific, maybe in the operating system or answer something conceptual.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex06_GeoPandas_results.html 45/45
1/27/24, 2:50 PM Ex07_arcpyIntro_results
Introduction to arcpy
In this notebook, we'll start to explore using ArcGIS via the arcpy module, which will provide us access to all of
the geoprocessing capabilities of ArcGIS. We'll look at:
This exercise can either be run within ArcGIS in a notebook window (where outputs will go to maps) or from an
IDE running Jupyter notebooks outside (I recommend VS-code), as long as you're continuing to use Python
kernel cloned from the ArcGIS Pro installation.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex07_arcpyIntro_results.html 1/37
1/27/24, 2:50 PM Ex07_arcpyIntro_results
⌨ Importing arcpy is always going to be required, and is such an essential requirement that this step has
already been done for you in the notebook environment provided by ArcGIS Pro. If you're working in ArcGIS Pro,
it doesn't hurt to include it anyway just so your notebook will work in either place.
import arcpy
In [ ]:
⌨ First we'll create a shortcut to the environment setting object arcpy.env by simply shortening that to env ,
by either:
or
env = arcpy.env
In [ ]: #
⌨ As you know by now with working with ArcGIS, there are many environment settings, and we'll explore others
along the way. But one essential environment setting provides access to your data, which needs the location of
the workspace.
print(env.workspace)
In [ ]: #
C:\Users\900008452\Box\course\625\exer\Ex07_08_arcpy\arcpy.gdb
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex07_arcpyIntro_results.html 2/37
1/27/24, 2:50 PM Ex07_arcpyIntro_results
If you're running this from the Notebook interface in ArcGIS Pro, you're going to see the workspace associated
with the ArcGIS Pro project, probably a geodatabase. However, if you are running this in an IDE like VS-code,
you probably see None displayed. We will want to be able to work from either location, so we're going to need
to pay attention to how we store our data, which should also use relative paths, and work with the
env.workspace setting. The os module will help for that...
project folder
hmb.gdb
city
elev
geology
...
pen.gdb
cities
faultcov_arc
geol
landusePen
water
other geodatabases / other data folders
various notebook .ipynb files, like this one
...
If you've set things up right, the .ipynb files should be in the main project folder for an ArcGIS project. In my
case, this folder is Ex07_08_arcpy and the default workspace is arcpy.gdb , but yours may differ. It doesn't
really matter what the project folder is named or what its path is if we work with relative paths. But you should
have copied various geodatabases such as hmb.gdb , pen.gdb and marbles.gdb , and since they should be
at the same level, we can navigate to them by going up to the project folder and then down into a parallel
geodatabase.
We're going to need to have the path to the project folder to get our code working without a lot of manual editing.
One approach that would work if you are working in ArcGIS is to use os.path.dirname with the current
workspace to provide the folder that holds the geodatabase, and so that would be the project folder:
import os
proj = os.path.dirname(env.workspace)
However this isn't going to work for code run from Notebooks outside ArcGIS, so we need a method that works in
either location.
⌨ So instead we'll use another os method: os.getcwd() which returns the path to the folder we started in,
either in the IDE or to open the .aprx (but ).
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex07_arcpyIntro_results.html 3/37
1/27/24, 2:50 PM Ex07_arcpyIntro_results
proj = os.getcwd()
proj
This mostly doesn't work if you're using ArcGIS Pro from the start menu, so I've learned to
always start Pro by opening an existing project by opening it from the .aprx in a Windows folder.
The exception is if you're creating a new project in Pro; this will generally provide the path you'd
expect.
In [ ]: #
Out[ ]: 'C:\\Users\\900008452\\Box\\course\\625\\exer\\Ex07_08_arcpy'
Before continuing, make sure that this shows the folder where this notebook is stored, which
should be where your geodatabases and other data are located, as an absolute path. Also, make
sure you understand where your files and data are stored (including their absolute paths) and
clearly understand how relative paths help you with this.
You'll also want to understand the various path delimiters like / , \ and \\ : remember that \
is an escape character, thus the need for using \\ to provide the \ that Windows tends to use.
We're going to want to deal with multiple workspaces, so we will need to learn some methods. Basically, the two
options are:
1. Set the workspace with env.workspace ; for instance env.workspace = proj + "\pen.gdb"
2. Use with to access the workspace just within a code block
I'm increasingly using the with method, since once you get used to it, it can simplify your code and help you
avoid making mistakes when it's set wrong. In general, environment settings can create problems (it's almost one
of the "Big Three" but that would make it the "Big Four"), so we should look at the with method right away.
We'll start with a common need: to access lists of data in folders, one of the most important methods in GIS
programming, since GIS is so data-centric.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex07_arcpyIntro_results.html 4/37
1/27/24, 2:50 PM Ex07_arcpyIntro_results
ListFeatureClasses
ListRasters
ListFiles
ListWorkspaces
ListFields
ListTables
ListDatasets
Navigating through lists like these is the first types of operations where we can see Python coding helping us do
our work, in this case for managing our data, letting a script perform tedious repeated operations.
arcpy.ListFeatureClasses()
You should know by now to pay attention to capitalization. Everything but Windows filenames are case-sensitive.
Typically arcpy methods including geoprocessing functions will use "Camel case" where words in the middle of
a method name (like Feature and Classes ) are capitalized so the entire method name has humps like a
camel: ListFeatureClasses . (Sometimes the camel-case analogy is taken even further so you might have a
name looking like myCamelCase with no hump at the start, but that's not the case with arcpy methods.)
with arcpy.EnvManager(workspace='pen.gdb'):
arcpy.ListFeatureClasses()
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex07_arcpyIntro_results.html 5/37
1/27/24, 2:50 PM Ex07_arcpyIntro_results
Assuming you've set up your folders right, and the "pen.gdb" folder is in the folder where your
notebook .ipynb file is located, the above should display the list of Now that we have a list, we
can loop through it to perform some operation on it, probably with the same with structure, but
that depends on what you're doing.
⌨ But we haven't looked at geoprocessing tools yet, so we'll just list the name and length (of the name), in a
for loop structure; later we'll use a loop structure run geoprocessing tools on each feature class (or a selection)
in a given workspace:
with arcpy.EnvManager(workspace='pen.gdb'):
for f in arcpy.ListFeatureClasses():
print(f"name: {f} length: {len(f)}")
In [ ]: #
⌨ But we can use ListFeatureClasses to select those that are a particular type of geometry, like point, line
or polygon:
with arcpy.EnvManager(workspace='pen.gdb'):
for dtype in ["point", "line", "polygon"]:
print(dtype)
for f in arcpy.ListFeatureClasses(feature_type=dtype):
print(f"name: {f}")
In [ ]: #
point
line
name: faultcov_arc
polygon
name: cities
name: geol
name: water
name: landusePen
name: urban
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex07_arcpyIntro_results.html 6/37
1/27/24, 2:50 PM Ex07_arcpyIntro_results
⌨ Now let's look at the other method -- actually changing the workspace -- to "hmb.gdb" with env.workspace
= "hmb.gdb" and then try the same code as above to see those feature classes. We'll reset the workspace
back to the original at the end.
env.workspace = "hmb.gdb"
for dtype in ["point", "line", "polygon"]:
print(dtype)
for f in arcpy.ListFeatureClasses(feature_type=dtype):
print(f"name: {f}")
env.workspace = proj
In [ ]: #
point
name: pourpoint
name: landing
line
name: streams
name: faults
name: roads
polygon
name: areaclip
name: publands
name: geolstr200
name: geol
name: stbuff200
Sometimes you'll find that you want to actually change the workspace, for instance when you're
doing a lot of steps in that workspace. It all depends on code readability, which is generally
helped by having less code to look at. You'll need to decide which works best for your situation,
but we'll mostly make use of the with structure.
⌨ Now use ListRasters to see the list of rasters. In this case, we'll do this all on one line, but we'll start by
inventing a method ENV to access the arcpy.EnvManager method; we can keep using that ENV shortcut
later to shorten our code.
ENV = arcpy.EnvManager
with ENV(workspace="hmb.gdb"): arcpy.ListRasters()
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex07_arcpyIntro_results.html 7/37
1/27/24, 2:50 PM Ex07_arcpyIntro_results
From here on, you'll need to remember how to handle the workspace setting, either by changing
and resetting it, or using with structures, which I recommend. I'll just provide you with any new
code you'll need.
Selecting a subset
⌨ We can select a subset based on the name, to display all of those starting with a e .
arcpy.ListRasters("e*")
Later we'll look at other properties of data sets with the .Describe method that might allow us to just look at
features of a given property.
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex07_arcpyIntro_results.html 8/37
1/27/24, 2:50 PM Ex07_arcpyIntro_results
In [ ]: #
Workspace:
C:\Users\900008452\Box\course\625\exer\Ex07_08_arcpy\hmb.gdb
Rasters:
['watergrd', 'geology', 'landuse', 'pub', 'newdev', 'city', 'elev', 'elev30',
'numpyarraytoraster_f258ed9f_eec4_429b_8182_dbd92283d1c0_284017208', 'numpyar
raytoraster_dbbe317f_ddb6_440b_8a5e_36af55bdad7b_284017208', 'numpyarraytoras
ter_13b48e1f_9a2f_4694_b3b6_e22866798da1_2453623504', 'numpyarraytoraster_95f
a8d4c_0930_4ab2_a851_e38fbd3359cf_2453623504', 'numpyarraytoraster_445f664f_3
8d8_4a15_be0b_f770a0be3df1_401611712', 'numpyarraytoraster_fc5bc198_377a_4502
_ae67_77a6c4662b91_401611712', 'numpyarraytoraster_249bae79_4b16_4f0b_b510_93
f6befada75_2888410860', 'numpyarraytoraster_2d8b9707_cc96_47d1_92ad_c7a458a74
fee_2888410860', 'numpyarraytoraster_83003b86_b865_4c3b_a796_98bd8e4c52b3_288
8410860', 'numpyarraytoraster_0270248f_2469_4a08_b20a_0d192898984f_247041092
4', 'numpyarraytoraster_8cb26adc_892d_437c_a498_b9f3ef151698_2470410924', 'nu
mpyarraytoraster_d5ff6afb_d750_428f_bb6e_a6ad6c302c54_2470410924', 'steepurba
n', 'numpyarraytoraster_29fe2ea7_0b00_4e3f_a528_bcae4b549928_2385214608', 'hi
ghElev_np', 'numpyarraytoraster_3513d5c7_61af_488f_ba7d_c431c1bf698b_23852146
08', 'trimmed_elev']
Elevation rasters:
['elev', 'elev30']
FeatureClasses:
['streams', 'areaclip', 'faults', 'pourpoint', 'publands', 'roads', 'landin
g', 'geolstr200', 'geol', 'stbuff200']
List of fields
⌨ The following code creates a list of fields from a data table. Look up ListFields in the help system so you can
understand why, while it creates no error, printing the field object might not be that useful. Consider what the list
is composed of, in contrast to what we found for feature classes and rasters.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex07_arcpyIntro_results.html 9/37
1/27/24, 2:50 PM Ex07_arcpyIntro_results
In [ ]: #
Fields are complex objects, not simple strings. Since ListFields creates a list of objects that are
not easy to print, like the text strings returned by all of the other lists, this doesn't create an error,
and similarly can't be displayed interactively like the ones above. Also note that this complex
object is assigned to the variable fld.
⌨ Fix the last line by adding the property .name to the fld object and try it again. Then replace the print
statement with: print("{} is type {} with length {}".format(fld.name, fld.type, fld.length))
In [ ]: #
Alternative method to get a list of fields, using list comprehension, the syntax:
So for our example, we don't need an if condition but we can do it this way:
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex07_arcpyIntro_results.html 10/37
1/27/24, 2:50 PM Ex07_arcpyIntro_results
List of workspaces
⌨ Our project folder has multiple workspaces. Let's use .ListWorkspaces to display a list, but let's do more than
that and navigate through all of them and list each feature class of a given data type and each raster. We'll limit
the type of workspace to "FileGDB".
for ws in arcpy.ListWorkspaces("*","FileGDB"):
print(ws)
with ENV(workspace = ws):
if arcpy.ListFeatureClasses():
print("There are {} FeatureClasses:".format(len(arcpy.ListFeatureClasse
s())))
for dtype in ["point", "line", "polygon"]:
fs = arcpy.ListFeatureClasses(feature_type=dtype)
if fs:
print("\t{}".format(dtype))
for f in fs:
print("\t\t{}".format(f))
rass = arcpy.ListRasters()
if rass:
print("There are {} Rasters:".format(len(rass)))
for ras in rass:
print("\t{}".format(ras))
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex07_arcpyIntro_results.html 11/37
1/27/24, 2:50 PM Ex07_arcpyIntro_results
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex07_arcpyIntro_results.html 12/37
1/27/24, 2:50 PM Ex07_arcpyIntro_results
C:\Users\900008452\Box\course\625\exer\Ex07_08_arcpy\arcpy.gdb
There are 1 FeatureClasses:
point
samplePts_MeanCenter
There are 5 Rasters:
Extract_newd1
Extract_pub1
numpyarraytoraster_c6525966_f1c5_431b_b716_12ff456bdbc6_1806419640
numpyarraytoraster_254a82f0_c9f2_4f87_ab4b_2f25d21acad4_1787618684
numpyarraytoraster_e0f844ec_c5ee_44b1_af76_37c43839e3a8_1787618684
C:\Users\900008452\Box\course\625\exer\Ex07_08_arcpy\bozo.gdb
C:\Users\900008452\Box\course\625\exer\Ex07_08_arcpy\hmb.gdb
There are 10 FeatureClasses:
point
pourpoint
landing
line
streams
faults
roads
polygon
areaclip
publands
geolstr200
geol
stbuff200
There are 25 Rasters:
watergrd
geology
landuse
pub
newdev
city
elev
elev30
numpyarraytoraster_f258ed9f_eec4_429b_8182_dbd92283d1c0_284017208
numpyarraytoraster_dbbe317f_ddb6_440b_8a5e_36af55bdad7b_284017208
numpyarraytoraster_13b48e1f_9a2f_4694_b3b6_e22866798da1_2453623504
numpyarraytoraster_95fa8d4c_0930_4ab2_a851_e38fbd3359cf_2453623504
numpyarraytoraster_445f664f_38d8_4a15_be0b_f770a0be3df1_401611712
numpyarraytoraster_fc5bc198_377a_4502_ae67_77a6c4662b91_401611712
numpyarraytoraster_249bae79_4b16_4f0b_b510_93f6befada75_2888410860
numpyarraytoraster_2d8b9707_cc96_47d1_92ad_c7a458a74fee_2888410860
numpyarraytoraster_83003b86_b865_4c3b_a796_98bd8e4c52b3_2888410860
numpyarraytoraster_0270248f_2469_4a08_b20a_0d192898984f_2470410924
numpyarraytoraster_8cb26adc_892d_437c_a498_b9f3ef151698_2470410924
numpyarraytoraster_d5ff6afb_d750_428f_bb6e_a6ad6c302c54_2470410924
steepurban
numpyarraytoraster_29fe2ea7_0b00_4e3f_a528_bcae4b549928_2385214608
highElev_np
numpyarraytoraster_3513d5c7_61af_488f_ba7d_c431c1bf698b_2385214608
trimmed_elev
C:\Users\900008452\Box\course\625\exer\Ex07_08_arcpy\HMBcity.gdb
There are 2 Rasters:
elev
elev30
C:\Users\900008452\Box\course\625\exer\Ex07_08_arcpy\marbles.gdb
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex07_arcpyIntro_results.html 13/37
1/27/24, 2:50 PM Ex07_arcpyIntro_results
There are 11 FeatureClasses:
point
co2july95
samples
marblePts
line
streams
trails
contours10m
cont
polygon
geology
veg
water
watrshed
There are 3 Rasters:
elev
elev10
geolgrd
C:\Users\900008452\Box\course\625\exer\Ex07_08_arcpy\pen.gdb
There are 6 FeatureClasses:
line
faultcov_arc
polygon
cities
geol
water
landusePen
urban
There are 1 Rasters:
landusePenras
C:\Users\900008452\Box\course\625\exer\Ex07_08_arcpy\SF.gdb
There are 3 FeatureClasses:
point
SF_Schools
BA_TransitStops
line
BA_BikeRoutes
C:\Users\900008452\Box\course\625\exer\Ex07_08_arcpy\testPath.gdb
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex07_arcpyIntro_results.html 14/37
1/27/24, 2:50 PM Ex07_arcpyIntro_results
➦ Go to the help for the Buffer and Clip tools of the Analysis Toolbox, and then look at their scripting syntax.
Note that the syntax gives each an alias that identifies which toolbox it comes from, in this case analysis.
🌎 Make sure you have a map open and available to display results along the way.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex07_arcpyIntro_results.html 15/37
1/27/24, 2:50 PM Ex07_arcpyIntro_results
Before we use these tools, let's learn about some shortcuts and other ways to get help.
Toolbox shortcuts
Each of the geoprocessing toolboxes has an alias, which is useful since they become part of the name of the
tool. This has been needed since tools in different toolboxes may have the same name (though ArcGIS is moving
away from that). So the full name of a tool includes its alias, such as the Clip tool in the Analysis toolbox is
called with Clip_analysis or more fully arcpy.Clip_analysis and Slope in the Spatial Analysis toolbox
(alias sa ) is called as arcpy.Slope_sa . We can shorten the alias even further by creating variable shortcuts
with unrequired but standard codes by definining them as for instance from arcpy import analysis as AN .
Here are some of the toolbox aliases and codes:
Analysis analysis AN
Conversion conversion CV
AN.Clip()
For Spatial Analyst, we can make the nice Map Algebra syntax work by importing everything and not having to
use a shortcut with:
⌨ Write some boilerplate that creates all of these toolbox shortcuts, and also imports arcpy and os , creates
the shortcut to arcpy.env as env , and sets env.overwriteOutput = True . Then assign a new variable
proj as os.getcwd() -- this will be useful in specifying the path to the project folder. And finally set the
workspace to hmb.gdb .
In [ ]: #
In [ ]: env.workspace
Out[ ]: 'hmb.gdb'
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex07_arcpyIntro_results.html 16/37
1/27/24, 2:50 PM Ex07_arcpyIntro_results
Note that in the above code, we just set the workspace, since we are only working in that one
place, so that makes for less indented code. It would also work by wrapping everything in a
with block. Your choice.
In [ ]:
INPUTS:
in_features (Feature Layer / Scene Layer / Building Scene Layer / File):
The features that will be clipped.
clip_features (Feature Layer):
The features that will be used to clip the input features.
cluster_tolerance {Linear Unit}:
The minimum distance separating all feature coordinates as well as t
he
distance a coordinate can move in x or y (or both). Set the value
higher for data with less coordinate accuracy and lower for data wit
h
extremely high accuracy.Changing this parameter's value may cause
failure or unexpected
results. It is recommended that you do not modify this parameter. It
has been removed from view on the tool dialog box. By default, the
input feature class's spatial reference x,y tolerance property is
used.
OUTPUTS:
out_feature_class (Feature Class / File):
The dataset that will be created.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex07_arcpyIntro_results.html 17/37
1/27/24, 2:50 PM Ex07_arcpyIntro_results
⌨ Note the parameter names, and call the Clip function with in_features = proj + "/pen.gdb/geol" ,
clip_features to "areaclip" , and out_feature_class to "geol" .
In [ ]: #
What's the msg = part for? It's simply to avoid the code chunk output for displaying the word
"Messages". All geoprocessing tools, like Clip , are objects that have a value, and normally that
value is a set of messages of how the tool ran. Sometimes you want to see these, to help debug
a problem that doesn't cause the tool to fail, what's called a "logic error". But when everything is
working fine, we can avoid seeing the "Messages" header displayed by simply assigning the tool
result to a variable we'll just call msg . Don't confuse these with tool outputs.
It's fine to not use the msg = if you don't mind seeing "Messages" (and I won't include that trick
in code suggestions later in this notebook), but later on we'll be running code that run a lot of
tools, so we'll see a lot of "Messages" printed out unless we use this trick.
🌎 Now see what you got on the map (this is where it's handy to be in ArcGIS Pro), and make sure you
understand what was done, what was named and where inputs were accessed and outputs were stored.
⌨ Now do the same thing for AN.Buffer -- use help to check the parameter names etc. -- and run it to create an
output feature class "stbuff200" with "streams" (in hmb.gdb) as input features and a buffer distance of 200. Check
the naming of parameters carefully. 🌎 Then check the output on the map.
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex07_arcpyIntro_results.html 18/37
1/27/24, 2:50 PM Ex07_arcpyIntro_results
⌨ In ArcGIS Pro, add 'elev', 'landuse', and 'geology' rasters from hmb to a new map. We'll be accessing these
rasters by name and start by assigning them to new raster objects.
elev = Raster('elev')
landuse = Raster('landuse')
geology = Raster('elev')
elev
Note that only the last object provided in the code cell is displayed. This is similar to what we've seen before with
dataframes. This is telling us that code cells work somewhat like functions that return one item.
In [ ]: #
Out[ ]:
⌨ Use a series of map algebra statements to end with a steepurban raster object that represents slopes > 10
and landuse < 20 (urban):
Notes:
The landuse raster required assigning to a raster object before we could query it essentially with landuse
< 20 , so this required Raster(landuse) < 20 . This may be surprising, since we didn't have to make a
raster object out of elev before deriving the slope of it; this is because the Slope tool is expecting an
elevation raster so it converts it for you.
Raster objects are retained in memory, but not retained in your workspace, requiring you to save them to
have them available later without having to run the code again.
You should see a result displayed only after you've saved the result as shown above. The temporary rasters
steep and urban won't similarly display, since we haven't saved them.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex07_arcpyIntro_results.html 19/37
1/27/24, 2:50 PM Ex07_arcpyIntro_results
In [ ]: #
Out[ ]:
We'll try to a couple of operations, one not using crs but creating a graph from data, the other going both ways --
from ArcGIS raster object to ndArray and back again -- and seeing how we maintain the crs.
⌨ Import numpy as np, and then create a 2D ndarray from the elevation raster object with:
elev2D = arcpy.RasterToNumPyArray(elev)
Ok, now we have a 2D ndarray. We can make a histogram out of it, but this will require making it a 1D ndarray,
which I only discovered by trying to create a histogram from the original 2D ndarray. (You might try this to see
what you get.) The conversion to a 1D array uses the .reshape numpy method but we don't to have to figure
out the size, so the size parameter is set to -1 to get the total cells. (You could also use .flatten )... then
convert it to a 1D ndarray in order to make a histogram out of it.
elev1D = np.reshape(elev2D,-1)
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex07_arcpyIntro_results.html 20/37
1/27/24, 2:50 PM Ex07_arcpyIntro_results
In [ ]: #
Converting there and back again, using np map algebra along the way
First we need to get the spatial reference, lower left corner and cell size from elev :
sr = elev.spatialReference
lowleft = elev.extent.lowerLeft
cellsize = elev.meanCellHeight
In [ ]: #
Out[ ]: 60.0
elev2D = arcpy.RasterToNumPyArray(elev)
In [ ]: #
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex07_arcpyIntro_results.html 21/37
1/27/24, 2:50 PM Ex07_arcpyIntro_results
Then we'll do some map algebra, but to the ndarray with numpy:
highthreshold = 300
highElev_np = elev2D > highthreshold
In [ ]: #
fig = plt.figure()
plt.imshow(highElev_np)
plt.colorbar()
plt.title("high elev (elev > {})".format(highthreshold))
plt.show()
In [ ]: #
So it looks like this worked fine. The 1s represent True, and 0s False. Interestingly, however, these are stored as
True and False in the ndarray, though the legend shows it as numeric.
We could have just done this operation in ArcGIS, but the process of there and back again is the same, so it will
serve as an example. To bring this back into ArcGIS, however, this format is not recognized, so I figured out a
trick of converting the Trues and Falses to 1s and 0s.
In [ ]: #
Then we can bring it back in and apply the various environment settings we grabbed earlier:
In [ ]: #
For some reason, we need to make the result discrete, so the following works for that:
highElev = Con(highElev0==1,1,0)
highElev.save("highElev_np")
highElev
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex07_arcpyIntro_results.html 22/37
1/27/24, 2:50 PM Ex07_arcpyIntro_results
In [ ]: #
Out[ ]:
Here's a simpler example that doesn't need unique values, and stands to represent a process involving
continuous numerical data:
In [ ]: #
Out[ ]:
⌨ Start by running the boilerplate we used in the last notebook with module imports and shortcuts. Also initially
set the workspace to hmb.gdb .
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex07_arcpyIntro_results.html 23/37
1/27/24, 2:50 PM Ex07_arcpyIntro_results
In [ ]: #
Describe
➦ Start by exploring the help system to get a sense of the scope of ArcGIS objects you can get information
about. Google ArcGIS Pro Describe should get you to somewhere like this, and you may want to change to
the version of ArcGIS Pro we're using (though there won't be many differences in Describe.
https://pro.arcgis.com/en/pro-app/latest/arcpy/functions/describe.htm (https://pro.arcgis.com/en/pro-
app/latest/arcpy/functions/describe.htm)
In contrast to geoprocessing tools, where we can use help() from a code cell to get what we
need to run the tool, help(arcpy.Describe) doesn't provide us with much help because what
we need to know about are the objects we want to describe and what their properties are. The
link to the help system is thus the best way we have to access this information. It's similar to
environment settings. For both objects and the environment there are many settings, though I'm
usually just looking for a few of them, like the extent (which is part of the environment and
associated with objects like feature classes and rasters) and its various parts, like XMin, etc.
There you should be able to find your way to explore properties of various data types that will be useful for us:
Dataset
FeatureClass
File
Folder
Layer
Raster Band
Raster Dataset
Table
and lots of others. You should use this resource as we're learning about various Describe properties. We'll start
with a problem needing to get the properties of a raster dataset in order to be able to trim it.
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex07_arcpyIntro_results.html 24/37
1/27/24, 2:50 PM Ex07_arcpyIntro_results
➦ Go to the same Describe help page we were just looking at, and go to the Raster Dataset properties. (There is
no set of properties for anything called simply a "raster", so Raster Dataset seems the closest.).
bandCount
compressionType
format
permanent
sensorType
These all look useful, but aren't what we're looking for. But we see in the heading of the Raster Dataset
properties section a message that says that Dataset properties and Raster Band properties are also supported.
Datasets include rasters and feature classes, and both have an extent (something that is also apparent if you
go to the Geoprocessing Environments dialogs in ArcGIS Pro.)
⌨ We can access the extent property for any data set, including rasters, via the Describe object. We'll start
by assigning that object to a variable:
dsc = arcpy.Describe("elev")
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex07_arcpyIntro_results.html 25/37
1/27/24, 2:50 PM Ex07_arcpyIntro_results
In [ ]: #
Out[ ]:
catalogPath C:\Users\900008452\Box\course\625\exer\Ex07_08_arcpy\hmb.gdb\elev
dataType RasterDataset
bandCount 1
format FGDBR
fields
spatialReference
spatialReference.GCS
⌨ You should see displayed nothing of clearly immediate use, simply that it's a "geoprocessing describe data
object", however we can go one step further by first referring to the Describe help system and seeing that
extent is one of them, so we'll create another variable ext and then display its value with:
ext = dsc.extent
ext
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex07_arcpyIntro_results.html 26/37
1/27/24, 2:50 PM Ex07_arcpyIntro_results
In [ ]: #
Out[ ]:
XMin (Left) 545692.537124
spatialReference
In [ ]: #
... and if we don't want to use the variables we could get them with:
arcpy.Describe("elev").extent.XMin
etc., although that's slower since it has to call Describe for each of the values we're looking for.
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex07_arcpyIntro_results.html 27/37
1/27/24, 2:50 PM Ex07_arcpyIntro_results
There are similar ways of accessing other Describe properties, and this will be very useful for
working with data in arcpy. Extent properties just happen to be one that I need the most. In the
following code we need to write, we'll use these four extent variables ext.XMin etc. to trim a
raster.
⌨ We'll do the above first. It should be pretty easy to understand as well as code. End the code cell by
displaying rect to see if the numbers and format looks right.
In [ ]: #
In [ ]: #
Out[ ]:
⌨ Finally use the rect string to Clip the elev raster to that extent, producing "trimmed_elev" and use
Raster() to create a raster object to display it (no need to save the raster object -- we're done with it so just
want to display it. Use help() with DM.Clip to get the parameters. The first three parameters is all you'll need,
but use the explicit method for clarity.
⛬ Interpret what the above is doing, including comparing with what you see with Raster("elev") .
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex07_arcpyIntro_results.html 28/37
1/27/24, 2:50 PM Ex07_arcpyIntro_results
In [ ]: #
Out[ ]:
⌨ Still with the workspace set to hmb.gdb , see what you get with:
arcpy.Describe("streams").DataType
arcpy.Describe("landing").ShapeType
arcpy.Describe("landuse").DataType
arcpy.Describe("geology").DataType
In [ ]: #
FeatureClass
Point
RasterDataset
RasterDataset
⛬ Interpret what the above is showing us. Can you see how it might be useful? Remember how a program
needs eyes.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex07_arcpyIntro_results.html 29/37
1/27/24, 2:50 PM Ex07_arcpyIntro_results
Create a new map from the image to see what it looks like. You can change which bands display as red,
green or blue using Symbology settings. You should also be able to understand what is meant by
"landsatHMB201707.tif/Band_1" below from exploring the data.
Use Describe to see how many bands each has and what the cell size is, and fill in the blanks below. The
key Describe properties are .bandCount , a property of the image itself, and .meanCellWidth or
.meanCellHeightSet , properties of an individual band. For instance, to get the cell size of the Landsat
imagery you'll use:
arcpy.Describe("imagery/landsatHMB201707.tif/Band_1").meanCellWidth
arcpy.Describe("landsatHMB201707.tif").bandCount
In [ ]: #
30.0
7
Then display sr to see what spatial reference we assigned, and use arcpy.ListFeatureClasses() to
confirm it got created.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex07_arcpyIntro_results.html 30/37
1/27/24, 2:50 PM Ex07_arcpyIntro_results
In [ ]: #
Out[ ]:
name (Projected Coordinate System) NAD_1983_UTM_Zone_10N
In [ ]: #
Exists
An even more basic piece of information about a dataset is whether it exists or not. Testing for the existence of a
dataset can also help us avoid errors, since setting overwriteOutput to True (1) doesn't work for some tools (or at
least hasn't always worked in the past). A very useful technique is to detect the existence of a particular dataset
using arcpy.Exists , which can be used to detect any type of data.
if arcpy.Exists("empty"): DM.Delete("empty")
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex07_arcpyIntro_results.html 31/37
1/27/24, 2:50 PM Ex07_arcpyIntro_results
if arcpy.Exists("stbuff200"): DM.Delete("stbuff200")
AN.Buffer("streams", "stbuff200", 200)
Write code that does the above (in the hmb.gdb workspace), and confirm that it works by running it a couple of
times. Include some statements that display the feature classes ( print(arcpy.ListFeatureClasses()) )
after deleting and then after creating it anew.
In [ ]: #
stbuff200 exists
In [ ]: #
In [ ]: #
if not arcpy.ListFields("streams","stream_class"):
DM.AddField('streams',"stream_class", 'LONG')
We'll use it later in the Data Management and Cursors section to first check whether a field exists
before we try to create it.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex07_arcpyIntro_results.html 32/37
1/27/24, 2:50 PM Ex07_arcpyIntro_results
In [ ]: #
OBJECTID
Shape
FNODE_
TNODE_
LPOLY_
RPOLY_
LENGTH
STREAMS_
STREAMS_ID
ST_CODE
Shape_Length
stream_class
The approach we just used was to only create the field if it doesn't exist. What if we wanted to
delete it if it does exist? To avoid errors, we'll similarly need to confirm it exists before we delete
it.
⌨ Modify the above code to create the stream_class field after first deleting it if exists. The basic usage with
explicit parameter names is DM.DeleteField(in_table, drop_field) . To confirm that it's working, use the
for loop twice: after deleting it and then after creating it anew.
In [ ]: #
OBJECTID
Shape
FNODE_
TNODE_
LPOLY_
RPOLY_
LENGTH
STREAMS_
STREAMS_ID
ST_CODE
Shape_Length
OBJECTID
Shape
FNODE_
TNODE_
LPOLY_
RPOLY_
LENGTH
STREAMS_
STREAMS_ID
ST_CODE
Shape_Length
stream_class
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex07_arcpyIntro_results.html 33/37
1/27/24, 2:50 PM Ex07_arcpyIntro_results
⌨ We'll start by creating a new geodatabase HMBcity.gdb in the project folder, so parallel to our current
hmb.gdb . For this we can use the path to the project folder stored as the variable proj and also define a
newPath variable which holds the path to the new geodatabase, and delete it first if it already exists before
creating a new one.
newWS = "HMBcity.gdb"
proj = os.getcwd()
newPath = proj + "/" + newWS
if arcpy.Exists(newPath):
DM.Delete(newPath)
DM.CreateFileGDB(proj, newWS)
Note that in this process, we'll keep hmb.gdb as our workspace, and we'll then reference
newPath for where we want to send the outputs.
In [ ]: #
⌨ Now we'll loop through a list of rasters created with the ListRasters method assigning each to ras , (hint:
for ras in ...) and use the ExtractByMask tool to clip, storing the clipped city area raster in the hmbcity
folder.
Inside the for loop, use an if structure to only process the single band rasters if
arcpy.Describe(ras).bandCount == 1: and in the if structure:
ExtractByMask using input ras and mask raster "city" and assigning to newras
Save newras using the output name derived as outputname = newPath + "/" + ras . To save time
just loop through `arcpy.ListRasters("e")`*
In the loop print the raster name that is being processed so you can see its progress.
In [ ]: #
elev
elev30
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex07_arcpyIntro_results.html 34/37
1/27/24, 2:50 PM Ex07_arcpyIntro_results
In [ ]: #
Hints:
You can borrow a lot of the logic used in the previous script to create the parallel workspace. It's simpler
because there aren't bands.
You'll want to have: from arcpy import conversion as CV
Since you'll be using it 3 times and it has to be the same, set a string variable to hold the folder name:
shapefolder = "penshapes" and you can build its path as shapesPath = proj + "/" +
shapefolder , and start by deleting it if it already exists.
In [ ]: #
In [ ]: #
C:\Users\900008452\Box\course\625\exer\Ex07_08_arcpy/penshapes
⌨ Continuing...
Use CreateFolder to create the penshapes folder. (We won't use a geodatabase because our goal is to
create shapefiles.)
Look up the usage for CreateFolder -- it needs both a folder to put the folder in and the name of
the folder you want to create.
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex07_arcpyIntro_results.html 35/37
1/27/24, 2:50 PM Ex07_arcpyIntro_results
⌨ Continuing...
In your loop through the feature classes, you'll need a tool from the conversion toolbox:
FeatureClassToShapefile and the output folder would be the shapesPath created above.
If you store a feature class into a folder, it will be created as a shapefile.
Note that since the output is specified as the folder name, you don't need to specify the shapefile
name as the output, and it will simply be named the same as the original.
As before, check to see what you get by setting the workspace to shapesPath and printing
the list of feature classes (shapefiles are also feature classes).
In [ ]:
Out[ ]: 'C:\\Users\\900008452\\Box\\course\\625\\exer\\Ex07_08_arcpy\\hmb.gdb'
In [ ]: #
Feature to raster
To create rasters from feature classes, you need to specify a field to get the Value to assign to the raster.
⌨ To do this, create a two Python lists, to hold the dataset pairs geol , TYPE_ID , landusePen , LU_CODE in
pen.gdb .
Then convert each feature class in pen.gdb to a raster (to store back in the same gdb), with the corresponding
field, and use "60" as the cell size. *Hint: to connect the indices for the feature class and the field, you'll want to
loop through the indices like for i in range(len(feat))
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex07_arcpyIntro_results.html 36/37
1/28/24, 10:10 AM Ex08_arcpyDataCursorsPandas_results
Splitting cells
A useful way of debugging in Jupyter is to break up your code into multiple code cells. This is also a good idea
for providing more documentation to your code so you will remember what it's doing and for anyone who reads
your code to understand it. You can either write your code this way from the beginning or split it by inserting your
cursor where you want it split and using Edit/Split Cell .
⌨ Copy one of your fully working code cells here, and then split it into code cells at likely locations.
In [ ]:
Print statements
The tried and true method used by programmers since the dawn of time: simply printing the current value of
variables, or printing information about the progress of the script. Simply add a print statement to tell the user
where the program has gotten to, and once you're done debugging, you can leave it as a comment statement so
you'll remember later on what you're doing. Some good places to put print statements include:
at the end of the code cell, where you've arrived at a product of that cell. This is a lot like a function, when
you have one final result.
during each step of a loop, where a print statement can show progress. If an operation takes a lot of time,
this can show you that it's still running (and isn't hung up) and give you a sense of how much time it'll take to
complete (you could probably also include some code to derive an estimated time remaining to complete the
task).
You could keep print statements in your code but just turn them off with a #, as shown here:
⌨ For one of the programs you've already run, add print statements to tell the user what the program is doing,
especially during a loop.
In [ ]:
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex08_arcpyDataCursorsPandas_results.html 1/42
1/28/24, 10:10 AM Ex08_arcpyDataCursorsPandas_results
try...except blocks
One way of handling errors is to provide some code to run in case an "exception" (error) is raised. This allows
you to put things back in order and also display any messages that the offending code produced. A try...except
block can go anywhere in your code, and code within the try block is run until an exception is raised where it
shifts execution to the except block, which only runs if an exception is raised.
⌨ For your feature to raster code created earlier, copy it into the next code cell. Then insert a try: statement
shortly before the code where you're creating the feature and field codes, and then at the end of the section to
evaluate (where the exception might be raised), insert an except: statement. Indent all of the code in between,
as well as the code to run if the exception is raised. Make sure to create an exception with a typo -- a misspelling
of the field name perhaps.
Heres what that part of the code might look like, where you note that LU-CODE is typed instead of LU_CODE ,
the correct field name in the attribute table:
try:
feat = ["geol", "landusePen"]
fld = ["TYPE_ID", "LU-CODE"]
env.workspace = projdir + "\pen.gdb"
for i in range(len(feat)):
CV.FeatureToRaster(feat[i], fld[i], feat[i] + "ras", 60)
except:
print(arcpy.GetMessages())
Note that what's in the exception section is printing arcpy.GetMessages() which is what is produced from the
most recently executed geoprocessing function. You'll find that the message displayed is very clear about what
the problem is.
In [ ]: #
➦ Explore the GetMessages section of the help system. You'll find for instance that different types of messages
can be selected to display, like warnings, general messages and errors; by default, all messages are displayed.
You may also see .AddMessage which doesn't really do anything in Jupyter, but is useful when running script
tools and provides information to the user when they run the script tool (which print() does not).
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex08_arcpyDataCursorsPandas_results.html 2/42
1/28/24, 10:10 AM Ex08_arcpyDataCursorsPandas_results
Data management
Processing data in tables, as well as creating fields to store those data, and other data operations are important
in GIS work. Using a script is a good choice when you need to perform a sequence of data management and
analysis steps involving data tables. In some cases we need to process data fields, and there is an array of tools
we can use to, for example, add fields, calculate values for fields, delete fields, and join tables to bring in
additional data fields via a relate field. Some of these tools also create new summary tables, where input data
fields are summarized using various statistics. In other cases, we need to work with rows of data, which might be
individual features with vector data or values for raster data; we'll learn about using cursors to process rows, one
at a time.
Analysis Frequency Calculates frequency statistics for field(s) in the input table table
- Statistics Calculates summary statistics for field(s) in the input table. table
Data
AddField Adds a field to an data table field in existing table
Management
values in an existing
- CalculateField Calculates a value for a field using an expression
field
- CopyFeatures Copies selected features to a new feature class new feature class
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex08_arcpyDataCursorsPandas_results.html 3/42
1/28/24, 10:10 AM Ex08_arcpyDataCursorsPandas_results
Start with your boilerplate including working with Spatial Analyst and the DM shortcut to the data
management toolbox. As before, save the script in the project folder and use the relative path method.
To avoid creating a problem due to inconsistent names, before running the statements that need them,
assign names of fields and datasets to string variables. This is good practice (assign hard-coded values only
once in a script) and also prepares you for getting these as inputs in a script tool later on. Here are some
you'll want to use:
contourfcl = "contours10m" (for the contour feature class)
elevras = Raster("elev") (for the elevation raster)
elevfeet = "ContourFeet" (for the field name)
Use Exists and Delete to delete contourfcl if it already exists.
Create the contour feature class using the Contour tool in Spatial Analyst. Note that since this creates a
feature class, not a raster object, so doesn't work in map algebra. However, since you've imported
everything (*) from arcpy.sa, you can run the tool with some shorthand as: Contour(elevras,
contourfcl, 10) to create 10m contours
To display the resulting field names with one line of code, use:
Add the elevfeet field as type "DOUBLE" after first making sure it doesn't already exist, with if not
arcpy.ListFields(contourfcl, elevfeet):
Use CalculateField from the management toolbox to assign "!Contour! / 0.3048" to the elevfeet field.
Finally check out the results in ArcGIS.
In [ ]: #
In [ ]: #
Note the use of !Contour! to represent the Contour field. This is required in this type of
expression.
AddXY
⌨ Create code that uses the AddXY tool in the management toolbox (features toolset) to add x & y values to
"samples" in the marbles workspace.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex08_arcpyDataCursorsPandas_results.html 4/42
1/28/24, 10:10 AM Ex08_arcpyDataCursorsPandas_results
In [ ]: #
Allison has poor distance vision but we've found from independent observers that there are actually twice as
many whales visible as she reports.
Sydney exaggerates counts, and from independent observers we've found that it's best to reduce his counts
by half.
First we'll give you a taste of how we might use cursors to print an attribute table as a pandas dataframe:
In [ ]: import arcpy, os
proj = os.getcwd()
import pandas as pd
arcpy.env.workspace = proj
def table(tbl):
fld_list = arcpy.ListFields(tbl)
flds = []
for fld in fld_list: flds.append(fld.name)
table = []
cur = arcpy.SearchCursor(tbl, flds)
for row in cur:
rowList = []
for fld in flds:
rowList.append(row.getValue(fld))
table.append(rowList)
del cur
df = pd.DataFrame(table, columns = flds)
return df
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex08_arcpyDataCursorsPandas_results.html 5/42
1/28/24, 10:10 AM Ex08_arcpyDataCursorsPandas_results
So we'd like our code to multiply Allison's counts by 2, Sydney's by 0.5, and leave the rest the same -- so multiply
by 1. Our function is able to detect these conditions and come up with the adjusted count, stored in a new field in
the attribute table. See if you can correctly interpret it.
whalesWS = "whalesGGate"
if not arcpy.Exists(whalesWS):
arcpy.CreateFolder_management(proj,whalesWS)
else: print(whalesWS + " exists.")
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex08_arcpyDataCursorsPandas_results.html 6/42
1/28/24, 10:10 AM Ex08_arcpyDataCursorsPandas_results
In [ ]: #
whalesGGate exists.
Out[ ]:
FID Shape Observer Year Month Day Hour Minute Latitude Longitude
(-122.493592
0 0 37.794274 NaN Allison 2019 4 2 12 45 37.794274 -122.493592
NaN)
(-122.459953
1 1 37.830697 NaN Allison 2019 5 1 11 30 37.830697 -122.459953
NaN)
(-122.599558
2 2 37.81561 NaN Allison 2019 5 4 12 30 37.815610 -122.599558
NaN)
(-122.604379
3 3 37.797698 NaN Allison 2019 5 4 15 30 37.797698 -122.604379
NaN)
(-122.527239
4 4 37.7292360000001 Allison 2019 5 14 11 30 37.729236 -122.527239
NaN NaN)
... ... ... ... ... ... ... ... ... ... ...
(-122.515109
172 172 37.8055350000001 Allison 2019 10 13 8 30 37.805535 -122.515109
NaN NaN)
(-122.809715
173 173 37.6611310000001 Allison 2019 10 13 10 0 37.661131 -122.809715
NaN NaN)
(-122.947331
174 174 37.730912 NaN Allison 2019 10 13 2 0 37.730912 -122.947331
NaN)
(-122.620435
175 175 37.782788 NaN Allison 2019 11 9 12 30 37.782788 -122.620435
NaN)
(-122.569333
176 176 37.768944 NaN Allison 2019 11 9 1 0 37.768944 -122.569333
NaN)
Select
Selecting by attributes is a common need.
⌨ We'll want to set the workspace to "pen.gdb" , and use arcpy.ListFeatureClasses() to see what it
contains.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex08_arcpyDataCursorsPandas_results.html 7/42
1/28/24, 10:10 AM Ex08_arcpyDataCursorsPandas_results
In [ ]: #
⌨ For the land use feature class in pen.gdb, find out what the fields are. You'll want to find the land-use code.
In [ ]: #
OBJECTID
Shape
AREA
PERIMETER
LANDUSE_
LANDUSE_ID
LU_CODE
Shape_Length
Shape_Area
⌨ Use the Select tool from the analysis toolbox with the input features being the land use feature class,
create "urban" as the output feature class, and use a where clause that gets all land-codes under 20.
In [ ]: #
🌎 Check the map to confirm that you got what you were hoping for.
Layers
Often you'll want or need to work with layers similar to how you often work in a map. Instead of creating a new
feature class with a selection, you might apply a selection query to a layer.
⌨ Let's do the same thing we just did but create the output as a layer named "Urban" by using the
MakeFeatureLayer tool in the data management toolbox. Use the same input feature class and where clause.
In [ ]: #
🌎 Then check out the results on the map. In the contents, go to the properties of both this new layer "Urban"
and the previous feature class "urban" to check their source.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex08_arcpyDataCursorsPandas_results.html 8/42
1/28/24, 10:10 AM Ex08_arcpyDataCursorsPandas_results
Cursors
Cursors give you access to values in your data fields, and allow you to loop through records in your data tables.
Since each record (row) might be a vector feature or a raster value, this gives you considerable power to process
your data, manipulate and create geometries, and develop tools that can do what isn't possible in the stock
ArcGIS GUI.
⌨ First we'll use a simple cursor that goes through the X & Y values of a feature class by getting the first part of
the value in the the shape field, after determining that field's name. Note that all geometries have parts, which
make sense for polylines and polygons, but even points have parts, even if there's only one part, as there is
here. Note that you need to delete the cursor at the end.
It's useful to remember that each feature is a row in the database. The following code loops through the
database as a "cursor" scrolling through the rows. The for ptf in cur: assigns each row feature to an
object we'll call ptf (for "point feature"), then getValue(shapefld) gets the shape field value from that
feature. The .getPart() gets the only part of the point geometry, which is then assigned to pnt , from which
we can get X and Y values with pnt.X and pnt.Y .
import arcpy, os
projdir = os.getcwd()
from arcpy import env
env.workspace = "marbles.gdb"
desc = arcpy.Describe("co2july95")
shapefld = desc.ShapeFieldName
cur = arcpy.SearchCursor("co2july95")
for ptf in cur:
pnt = ptf.getValue(shapefld).getPart()
print(f"{pnt.X}, {pnt.Y}")
del cur
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex08_arcpyDataCursorsPandas_results.html 9/42
1/28/24, 10:10 AM Ex08_arcpyDataCursorsPandas_results
In [ ]: #
484700.7000000002, 4601827.0
484164.2999999998, 4601425.0
484164.2999999998, 4601425.0
484164.2999999998, 4601425.0
484903.2999999998, 4599948.0
485286.0, 4600092.0
485126.0, 4600703.0
485126.0, 4600703.0
483986.0, 4600852.0
483454.0, 4601142.0
483454.0, 4601142.0
483479.0, 4601325.0
483614.0, 4601283.0
482971.7000000002, 4602427.0
482816.7000000002, 4602599.0
483486.0, 4601072.0
482983.2999999998, 4602562.0
483349.0, 4602502.0
483534.0, 4601546.0
483509.4000000004, 4601475.0
485077.0, 4599968.0
483662.2999999998, 4601293.0
⌨ Print out a list of all of the field names, so we can build the next code:
In [ ]: #
OBJECTID
Shape
DATE_
PDT
CO2_
SOIL°C
AIR°C
D_CM
ID
LOC
ELEV
PM
DESCRIPTIO
X
Y
⌨ Modify the above code that display pnt.X and pnt.Y to also display the date, elevation, PM (parent
material, such as rock type), and CO2 value, using the names we found for those fields.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex08_arcpyDataCursorsPandas_results.html 10/42
1/28/24, 10:10 AM Ex08_arcpyDataCursorsPandas_results
In [ ]: #
Note in the above output for CO2 and X and Y values have excessive decimal places (resulting from the way
digital numbers are stored, which at one level deep in the decimal places is an approximation).
⌨ Using either the round() function or (better) the f-strong formatting method, round these to 2 decimal
places for CO2 (how they were originally recorded) and 0 decimal places for X & Y (the approximate accuracy of
our location method in the field).
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex08_arcpyDataCursorsPandas_results.html 11/42
1/28/24, 10:10 AM Ex08_arcpyDataCursorsPandas_results
In [ ]: #
⌨ Try the following code which demonstrates the use of the SearchCursor to simply display the values of two
fields from a shapefile. Note the use of field names as properties of the row.
import arcpy, os
projdir = os.getcwd()
from arcpy import env
ws = env.workspace = projdir + "/curdata"
cur = arcpy.da.SearchCursor("contour.shp", ("ID", "ELEV"))
for row in cur:
print(f"{row[0]}, {row[1]}")
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex08_arcpyDataCursorsPandas_results.html 12/42
1/28/24, 10:10 AM Ex08_arcpyDataCursorsPandas_results
In [ ]: #
('ID', 'ELEV')
1, 100
2, 110
3, 120
4, 130
5, 140
6, 90
Change the code to display all of the fields. We'll detect the number of fields, and with a bit of extra coding
manage to create a comma-delimited string from it that contains the row of data.
flds = arcpy.ListFields(contours)
for i in range(len(flds)):
flds[i] = flds[i].name
print(flds)
cur = arcpy.da.SearchCursor(contours, flds)
for row in cur:
datastring = ""
for i in range(len(flds)-1): datastring = datastring + str(row[i]) + ","
print(datastring + str(row[len(flds)-1]))
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex08_arcpyDataCursorsPandas_results.html 13/42
1/28/24, 10:10 AM Ex08_arcpyDataCursorsPandas_results
⌨ Use a SearchCursor to read and print out values for all of the point features, including its id, Calcium
Carbonate concentrations and XY coordinates from samples.shp from the Marble Mountains, in curdata .
Start with the typical boilerplate, including the workspace setting to curdata we just used.
Set a variable ptfeats to have the path to samples.shp
Create a list of field names composed of "sample_id" , "CATOT" , and "SHAPE@XY" and assign it to a
variable fields
Use a with structure to enclose your cursor loop, which limits the cursor to the structure: with
arcpy.da.SearchCursor(ptfeats, fields) as cur:
Within that with structure, use a for loop structure to go through all of the point features in the list and
print out the three fields (including shape) with
After the loop, delete the cursor with del cur though that's not really necessary...
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex08_arcpyDataCursorsPandas_results.html 14/42
1/28/24, 10:10 AM Ex08_arcpyDataCursorsPandas_results
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex08_arcpyDataCursorsPandas_results.html 15/42
1/28/24, 10:10 AM Ex08_arcpyDataCursorsPandas_results
Add some code just after assigning the ws variable to create a new textfile (open it for writing), and first
delete it if it already exists:
txtpath = ws + "/samples.csv"
if arcpy.Exists(txtpath): arcpy.Delete_management(txtpath)
txtout = open(txtpath, "w")
Change the last item in the fields list from "SHAPE@XY" to "SHAPE@X" , "SHAPE@Y" . Why? We're going to
write the data to a text file, and to read it in again, it's going to be easier to parse out x and y as separate
items. What we're seeing is a lot of flexibility with the geometry.
Before the loop, write out the field names to the text file, to make it more useful:
Within the loop, instead of printing to the screen, use the following to write out all of the data to the text file.
Note the addition of "\n" since (unlike print) the write method doesn't otherwise insert a new line:
Then at the end, out of the for loop (so unindent), close the text file.
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex08_arcpyDataCursorsPandas_results.html 17/42
1/28/24, 10:10 AM Ex08_arcpyDataCursorsPandas_results
Later we may modify this script to run as a script tool with the feature layer as an input and writing out a text file.
InsertCursor
We'll write an InsertCursor to populate a feature class we'll create.
⌨ Use an InsertCursor to insert some points that we'll hard-code into the script, after first creating the feature
class. We'll create a shapefile, but similar code could be used to create a feature class in a geodatabase. We'll
also create a data folder to put the new shapefile in.
Use the same boilerplate as the last two scripts to import everything you need and set the workspace.
Set up the shortcut to data management tools as DM (from arcpy import management as DM)
Assign "marblePts.shp" to ptfeatname .
Because of a bug, insert print(arcpy.ListFeatureClasses()) to "wake it up" TEST
In [ ]: #
Step 2. Create the shapefile, borrowing an existing spatial reference, and add the fields we'll need.
If that shapefile (use the variable ptfeatname ) already exists in the workspace, DM.Delete it.
Use Describe to get the spatialReference from existing samples.shp data, and assign it to sr . Hint
on coding this: assign arcpy.Describe("samples.shp").spatialReference to sr .
Create a shapefile using the CreateFeatureclass tool from DM , with the following parameters: (ws,
ptfeatname, "POINT", "", "", "", sr)
For testing, see if it exists with print(arcpy.ListFeatureClasses())
Use the AddField tool to add the fields id of type LONG and then name of type TEXT to ptfeatname .
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex08_arcpyDataCursorsPandas_results.html 18/42
1/28/24, 10:10 AM Ex08_arcpyDataCursorsPandas_results
Step 3.
Create a list ptdata of points defined as tuples {hint: [(...),(...)] }, with each point definition as an id,
name, and x&y coordinates in UTM, using the following set of values (note that in the shapefile, a numeric Id
field is always created by default, so we'll just populate it):
(12,"Upper Meadow",(483473,4601523))
(42,"Sky High Camp",(485339,4600001))
In [ ]: #
Create an insert cursor named cur using your newly created ptfeatures feature class.
Loop through your ptdata list of points, assigning each as pt and inside the list:
cur.insertRow(pt)
In [ ]: #
del cur
DM.AddXY(ptfeatname)
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex08_arcpyDataCursorsPandas_results.html 19/42
1/28/24, 10:10 AM Ex08_arcpyDataCursorsPandas_results
Mapping XY points
If you are running this in ArcGIS Pro and had a map open, you will have been able to see the points displayed on
the map.
Geopandas alternative
But if (and only if) you're running this from Jupyter Notebooks, you'll want to use Geopandas to see the result:
marblepts.plot()
In [ ]: #
Out[ ]:
Id name POINT_X POINT_Y geometry
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex08_arcpyDataCursorsPandas_results.html 20/42
1/28/24, 10:10 AM Ex08_arcpyDataCursorsPandas_results
In [ ]:
The above script doesn't work with text files, and is pretty easy to change to working with a geodatabase:
simply change the workspace to "marbles.gdb" and change "marblePts.shp" and "samples.shp" to
"marblePts" and "samples".
However, we won't be able to look at it with geopandas, since that can only work with OpenGIS data like
shapefiles.
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex08_arcpyDataCursorsPandas_results.html 21/42
1/28/24, 10:10 AM Ex08_arcpyDataCursorsPandas_results
We'll work with the samples.csv file we wrote out earlier, and create a samplepts.shp output in the data
folder we just used for shapefiles. After you do this, you might want to try modifying it to work with the file
geodatabase.
Start with the code from Mb3_InsertCreatePts.py, but change the output feature class to "samplePts" and
make other changes after that�
In [ ]: #
Assign to inputFile a path to the samples.csv file you created earlier, and then create a textfile read object with
textin = open(inputFile, "r")
Have a careful look at the samples.csv file you created earlier by opening it in a text editor like Notepad (not
Excel). - - Note that it's comma-delimited, and the first line of text is the list of field names. Get the field
names as a list with flds[0]
In [ ]: #
Continue borrowing code from Mb3_InsertCreatePts.py, but change the AddField statements to instead add
"sample_id" of type "LONG" and "CATOT" of type "DOUBLE". This should make sense from your review of the
samples.csv file.
For now, make sure to include the various print(arcpy.ListFeatureClasses()) statements to wake things up, due to
some kind of bug related to Pro.
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex08_arcpyDataCursorsPandas_results.html 22/42
1/28/24, 10:10 AM Ex08_arcpyDataCursorsPandas_results
You might be tempted to use the field names detected above, but we'll want to create SHAPE@XY instead of
a separate SHAPE@X and SHAPE@Y , so set up the cursor with
After the insert cursor creation line, create a Boolean flag to allow you to ignore the first row of text from
textin: ```
firstrow = True
for pt in textin:
if not firstrow:
to only create new point features when not the first row of text.
Read in the values from the row of text, splitting at the commas:
dta = txtrow.split(",")
Build the point geometric object with the X & Y values from the text file (notice how this is different from what
we did before, and that we have to float the text), then insert it.
In [ ]: #
Close the text file, delete the cursor and use AddXY the same as before.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex08_arcpyDataCursorsPandas_results.html 23/42
1/28/24, 10:10 AM Ex08_arcpyDataCursorsPandas_results
In [ ]: #
Out[ ]:
Messages
In an ArcGIS map you should see the results (refresh the map if necessary).
samplepts.plot(column="CATOT",cmap=("YlOrBr"))
In [ ]: #
Out[ ]:
Id sample_id CATOT POINT_X POINT_Y geometry
61 rows × 6 columns
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex08_arcpyDataCursorsPandas_results.html 24/42
1/28/24, 10:10 AM Ex08_arcpyDataCursorsPandas_results
In [ ]: #
1. Start with the same boilerplate you've been using to reference the curdata workspace, importing DM, etc.
Add random to the list of imported modules.
In [ ]: #
1. Also as we've done above, use Describe and its spatialReference property to get the spatial reference of
samples.shp, and assign it to sr.
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex08_arcpyDataCursorsPandas_results.html 25/42
1/28/24, 10:10 AM Ex08_arcpyDataCursorsPandas_results
1. Then also from the Describe object, get XMin , XMax , YMin , and YMax from the Extent property and
assign these to simple variables xmin , xmax , ymin and ymax . We'll use these to make sure our lines
end up in the study area.
In [ ]: #
1. Assign xrg ("x range") as XMax-XMin, and yrg as YMax-YMin, then xstp as xrg/20 and ystp as yrg/20.
We're going to use these to create visible spacing of vertices on our polylines.
In [ ]: #
In [ ]: #
1. Create a feature class in ws with that filename in featfile, using the parameters: (ws, featfile, "POLYLINE",
"", "", "", sr)
In [ ]: #
1. Create an InsertCursor called cur using id and SHAPE@ as fields. So assign to cur the following:
arcpy.da.InsertCursor(ws + "\" + featfile, ("id", "SHAPE@"))
In [ ]: #
1. Create a method variable rnd from the random method of the random module. This method returns a
random floating-point number between 0 and 1: rnd = random.random
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex08_arcpyDataCursorsPandas_results.html 26/42
1/28/24, 10:10 AM Ex08_arcpyDataCursorsPandas_results
1. In order to create 12 polylines, create a for loop that repeats 12 times using i as index. Within that loop:
Create an empty list called pointlist . Polylines and polygons are built from series of vertices, and
we'll need to create each vertex as a point object.
Create an initial point that will fit within the extent (study the method used to see how it fits within the
extent), using the rnd method assigned above: p = arcpy.Point(rnd() * xrg + xmin, rnd() *
yrg + ymin)
Append the point to pointlist
Then create an interior for j loop that goes 30 times to create plenty of vertices for each polyline. Note that
vertices may extend a bit beyond the extent. We could test for that and avoid it, but we'll just go with it for
simplicity. Note also that we'll reuse the p point feature, which doesn't cause any problems.
Create a new p point that diverges from the previous point by a random distance from zero to xstp and
ystp distances: p = arcpy.Point(p.X + xstp*rnd() - xstp*rnd(), p.Y + ystp*rnd() -
ystp*rnd())
Append that point to pointlist Then create an array from pointlist, then make a polyline from it:
array = arcpy.Array(pointlist)
polyline = arcpy.Polyline(array)
cur.insertRow([i, polyline])
In [ ]: #
In [ ]: #
You should see the result displayed in ArcGIS... (but you may need to refresh the map)
randomlines.plot(column="Id", cmap=("Spectral"))
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex08_arcpyDataCursorsPandas_results.html 27/42
1/28/24, 10:10 AM Ex08_arcpyDataCursorsPandas_results
In [ ]: #
Out[ ]:
Id geometry
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex08_arcpyDataCursorsPandas_results.html 28/42
1/28/24, 10:10 AM Ex08_arcpyDataCursorsPandas_results
Random polygons
Creating polygons is almost identical to creating polylines, just that they close from the last point to the first
point again. Take the code you just wrote and convert it in various places to create polygons. These are the
changes:
In [ ]: #
Geopandas display
Again, only if you're running this outside ArcGIS Pro, you can display your results in Geopandas with:
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex08_arcpyDataCursorsPandas_results.html 29/42
1/28/24, 10:10 AM Ex08_arcpyDataCursorsPandas_results
In [ ]: #
Earlier we looked at the data analysis python library called pandas, and there we used csv files as input. We
can also use it together with arcpy and its cursors and geodatabases.
We'll explore creating various data types using search cursors on a feature class and then loading that data
into a pandas dataframe. Then we'll explore various methods and properties available to us to access data
within our dataframes. We'll conclude with converting a table to a numpy array.
import pandas as pd
import arcpy, os
ws = os.getcwd()
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex08_arcpyDataCursorsPandas_results.html 30/42
1/28/24, 10:10 AM Ex08_arcpyDataCursorsPandas_results
⌨ Next, we'll want to assign the schools variable to our "SF_Schools" feature class:
schools = ws + r"\SF.gdb"
In [ ]: #
fields = []
for fld in arcpy.ListFields(schools):
fields.append(fld.name)
fields
In [ ]: #
Out[ ]: ['OBJECTID',
'Shape',
'X',
'Y',
'Match_addr',
'Name',
'District',
'County',
'Street',
'City',
'State',
'ZIP9',
'DistType',
'Type',
'Latitude',
'Longitude',
'Grades',
'Status_1']
⌨ Once we've defined the schools feature class we're going to go through multiple examples of loading
different data types into a pandas dataframe. Below are examples where we can generate a list, dictionary,
numpy array, etc. all with using arcpy search cursors. First, let's examine how we can build a list of lists and
load that into a dataframe.
We'll just list the first few list members, and continue that practice in later code chunks...
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex08_arcpyDataCursorsPandas_results.html 31/42
1/28/24, 10:10 AM Ex08_arcpyDataCursorsPandas_results
In [ ]: #
Out[ ]: [[1,
-122.419992341,
37.7766494543,
'Alternative/Opportunity',
'San Francisco County Office of Education',
'San Francisco',
'ALTERNATIVE'],
[2,
-122.463582,
37.763352,
'Cross Cultural Enviromental Leadership (xcel) Acad',
'San Francisco Unified',
'San Francisco',
'HIGH SCHOOL'],
[3,
-122.395977,
37.7192,
'KIPP Bayview Academy',
'San Francisco Unified',
'San Francisco',
'MIDDLE']]
❔ What data type is returned from the above block of code? What is the data type of "School_list"?
⛬
⌨ We can create a dataframe from the object "school_list" above. In this situation we'll need to define the
cooresponding column names when we load the data into the dataframe.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex08_arcpyDataCursorsPandas_results.html 32/42
1/28/24, 10:10 AM Ex08_arcpyDataCursorsPandas_results
In [ ]: #
Out[ ]:
OBJECTID X Y Name District City Type
San
Francisco
San
0 1 -122.419992 37.776649 Alternative/Opportunity County ALTERNATIVE
Francisco
Office of
Education
San
KIPP Bayview San
2 3 -122.395977 37.719200 Francisco MIDDLE
Academy Francisco
Unified
⌨ Next, can also create a list of tuples data type and load that into a dataframe. When we create a search
cursor the row object that is returned, is returned as a tuple data type.
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex08_arcpyDataCursorsPandas_results.html 33/42
1/28/24, 10:10 AM Ex08_arcpyDataCursorsPandas_results
In [ ]: #
Out[ ]: [(1,
-122.419992341,
37.7766494543,
'Alternative/Opportunity',
'San Francisco County Office of Education',
'San Francisco',
'ALTERNATIVE'),
(2,
-122.463582,
37.763352,
'Cross Cultural Enviromental Leadership (xcel) Acad',
'San Francisco Unified',
'San Francisco',
'HIGH SCHOOL'),
(3,
-122.395977,
37.7192,
'KIPP Bayview Academy',
'San Francisco Unified',
'San Francisco',
'MIDDLE')]
⌨ Then we'll load this list of tuples object we've built into a pandas dataframe.
In [ ]: #
Out[ ]:
OBJECTID X Y Name District City Type
San
Francisco
San
0 1 -122.419992 37.776649 Alternative/Opportunity County ALTERNATIVE
Francisco
Office of
Education
San
KIPP Bayview San
2 3 -122.395977 37.719200 Francisco MIDDLE
Academy Francisco
Unified
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex08_arcpyDataCursorsPandas_results.html 34/42
1/28/24, 10:10 AM Ex08_arcpyDataCursorsPandas_results
⌨ Now, let's build a dictionary using a search cursor and then load that dictionary into a dataframe. What's
important is that you you understand the structure of the dictionary we're loading into the dataframe. We need
to correspond the key to the field name and all the values of that column in the attribute table to that key.
school_dict = {}
fields = ["OBJECTID","Name", "District", "City", "Type"]
with arcpy.da.SearchCursor(schools, fields) as cur:
for row in cur:
for field_name, data_row in zip(fields,row):
school_dict.setdefault(field_name, []).append(data_row)
school_dict.keys()
In [ ]: #
df_school_dict = pd.DataFrame(school_dict)
df_school_dict.keys()
In [ ]: #
⌨ Print out each of the data types we created in the above steps:
print(type(school_list))
print(type(school_dict))
print(type(school_list_tuples[0]))
In [ ]: #
<class 'list'>
<class 'dict'>
<class 'tuple'>
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex08_arcpyDataCursorsPandas_results.html 35/42
1/28/24, 10:10 AM Ex08_arcpyDataCursorsPandas_results
schools = ws + r"\SF.gdb\SF_Schools"
In [ ]: #
In [ ]: #
schools_df = pd.DataFrame(arr)
schools_df
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex08_arcpyDataCursorsPandas_results.html 36/42
1/28/24, 10:10 AM Ex08_arcpyDataCursorsPandas_results
In [ ]: #
Out[ ]:
X Y Name District City Type
San Francisco
San
0 -122.419992 37.776649 Alternative/Opportunity County Office ALTERNATIVE
Francisco
of Education
Cross Cultural
San Francisco San HIGH
1 -122.463582 37.763352 Enviromental Leadership
Unified Francisco SCHOOL
(xcel) ...
SBE - Edison
San
190 -122.426158 37.754743 Edison Charter Academy Charter ELEMENTARY
Francisco
Academy
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex08_arcpyDataCursorsPandas_results.html 37/42
1/28/24, 10:10 AM Ex08_arcpyDataCursorsPandas_results
sf_schools_webm = sf_schools.to_crs("EPSG:3857")
import contextily as cx
basemap, basemap_extent = cx.bounds2img(*sf_schools_webm.total_bounds, zoom=12, ll
=False)
In [ ]: #
⌨ Then use similar code to what we've used before to plot the school locations
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex08_arcpyDataCursorsPandas_results.html 38/42
1/28/24, 10:10 AM Ex08_arcpyDataCursorsPandas_results
In [ ]: #
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex08_arcpyDataCursorsPandas_results.html 39/42
1/28/24, 10:10 AM Ex08_arcpyDataCursorsPandas_results
Out[ ]: (-13638811.83098057,
-13619243.951739563,
4529964.044292687,
4559315.8631541915)
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex08_arcpyDataCursorsPandas_results.html 40/42
1/28/24, 10:10 AM Ex08_arcpyDataCursorsPandas_results
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex08_arcpyDataCursorsPandas_results.html 41/42
1/28/24, 10:10 AM Ex08_arcpyDataCursorsPandas_results
key
➦ This directs you to do something specific, maybe in the operating system or answer something conceptual.
In [ ]:
file:///C:/Users/900008452/Box/course/625/exer/exercise_html/results/Ex08_arcpyDataCursorsPandas_results.html 42/42