
Above: text in a friend group chat that inspired this :D
We can accomplish this by using the Python pandas library!
First, we need to get the data we want to parse. We’re using the list of mayors of NYC from Wikipedia. Since we’re using Wikipedia, we want to make sure that our User-Agent policy is compliant.
import pandas as pd
# getting the data from Wikipedia
wikiurl = "https://en.wikipedia.org/wiki/List_of_mayors_of_New_York_City"
wikitables = pd.read_html(wikiurl, storage_options={"User-Agent": "SmallProject/0.0 ([email protected])"})
When we print out our wikitables, we get a really long list of every table on the Wikipedia page…
[ No.[2] Name \
0 1 Thomas Willett (1st term)
1 2 Thomas Delavall (1st term)
2 3 Thomas Willett (2nd term)
3 4 Cornelius Van Steenwyk (1st term)
4 5 Thomas Delavall (2nd term)
5 6 Matthias Nicoll
6 7 John Lawrence (1st term)
7 8 William Dervall
8 9 Nicholas De Mayer
9 10 Stephanus Van Cortlandt (1st term)
10 11 Thomas Delavall (3rd term)
11 12 Francis Rombouts
12 13 William Dyre
13 14 Cornelius Van Steenwyk (2nd term)
14 15 Gabriel Minvielle (*)
15 16 Nicholas Bayard (*)
16 17 Stephanus Van Cortlandt (2nd term)
17 18 Peter Delanoy (only popularly-elected mayor be...
18 19 John Lawrence (2nd term *)
19 20 Abraham de Peyster
20 21 Charles Lodwik
21 22 William Merritt
...
We want to be able to parse the tables of note that we want to look at (in this case, pre- and post-consolidation). By counting the tables on the actual Wikipedia article, we can see that we want to look at the 2nd and 3rd tables.
pre_table = wikitables[1]
post_table = wikitables[2]
We want to be able to clean these tables for just the names (no birth dates or death dates, no additional numbers or subscripts that Wikipedia pages tend to have), so we use some regular expression/reg-ex here:
import re # regular expression library
# gathering all names from both tables
names = []
# parsing pre-consolidation table
for x in pre_table["Mayor"]:
y = re.sub(r'\(.*?\)', '', mayor) # getting rid of everything in parentheses
y = re.sub(r'[^a-zA-Z .]', '', y).strip() # getting rid of all things not alphabetical, spaces, or periods (.)
y.removesuffix("b.") # removing birth dates that fell through cracks
names.append(y)
# parsing post-consolidation table
for x in post_table["Name (Birth–Death)"]:
y = re.sub(r'\(.*?\)', '', mayor)
y = re.sub(r'[^a-zA-Z .]', '', y).strip()
y.removesuffix("b.")
names.append(y)
names = ['James Duane', 'Richard Varick', 'Edward Livingston', 'DeWitt Clinton', 'Marinus Willett', 'DeWitt Clinton', 'Jacob Radcliff', 'DeWitt Clinton', 'John Ferguson', 'Jacob Radcliff', 'Cadwallader D. Colden', 'Stephen Allen', 'William Paulding Jr.', 'Philip Hone', 'William Paulding Jr.', 'Walter Bowne', 'Gideon Lee', 'Cornelius Lawrence', 'Aaron Clark', 'Isaac L. Varian', 'Robert H. Morris', 'James Harper', 'William Frederick Havemeyer', 'Andrew H. Mickle', 'William V. Brady', 'William Frederick Havemeyer', 'Caleb Smith Woodhull', 'Ambrose Kingsland', 'Jacob Aaron Westervelt', 'Fernando Wood', 'Daniel F. Tiemann', 'Fernando Wood', 'George Opdyke', 'Charles Godfrey Gunther', 'John T. Hoffman', 'Thomas Coman', 'A. Oakey Hall', 'William Frederick Havemeyer', 'Samuel B. H. Vance', 'William H. Wickham', 'Smith Ely Jr.', 'Edward Cooper', 'William Russell Grace', 'Franklin Edson', 'William Russell Grace', 'Abram Hewitt', 'Hugh J. Grant', 'Thomas Francis Gilroy', 'William Lafayette Strong', 'Robert Anderson Van Wyck', 'Seth Low', 'George B. McClellan Jr.', 'William Jay Gaynor', 'Ardolph L. Kline', 'John Purroy Mitchel', 'John Francis Hylan', 'William T. Collins', 'Jimmy Walker', 'Joseph V. McKee', 'John P. OBrien', 'Fiorello LaGuardia', 'William ODwyer', 'Vincent R. Impellitteri', 'Robert F. Wagner Jr.', 'John Lindsay', 'Abraham Beame', 'Ed Koch', 'David Dinkins', 'Rudy Giuliani', 'Michael Bloomberg', 'Bill deBlasio', 'Eric Adams']
Right now, we have a big list of names. We want to make sure that we don’t repeat names (since some of these people have served for multiple terms).
unique_names = []
for name in names:
if name not in unique_names:
unique_names.append(name)
Now we can finally get the first names!
freq = {}
for name in unique_names:
# print(name)
first_name = name.split()[0]
if first_name not in freq.keys():
freq[first_name] = 1
else:
freq[first_name] += 1
sorted_freq = dict(sorted(freq.items(), key=lambda item: item[1], reverse=True))
print(sorted_freq)
Turns out that William is by far the most common with 9 (!!!) Williams.
{'William': 9,
'John': 6,
'Robert': 3,
'James': 2,
'Edward': 2,
'Jacob': 2,
'George': 2,
'Thomas': 2,
'Richard': 1,
'DeWitt': 1,
'Marinus': 1,
'Cadwallader': 1,
'Stephen': 1,
'Philip': 1,
'Walter': 1,
'Gideon': 1,
'Cornelius': 1,
'Aaron': 1,
'Isaac': 1,
'Andrew': 1,
'Caleb': 1,
'Ambrose': 1,
'Fernando': 1,
'Daniel': 1,
'Charles': 1,
'A.': 1,
'Samuel': 1,
'Smith': 1,
'Franklin': 1,
'Abram': 1,
'Hugh': 1,
'Seth': 1,
'Ardolph': 1,
'Jimmy': 1,
'Joseph': 1,
'Fiorello': 1,
'Vincent': 1,
'Abraham': 1,
'Ed': 1,
'David': 1,
'Rudy': 1,
'Michael': 1,
'Bill': 1,
'Eric': 1}
To extend this, we can do a silly natural language processing exercise to “calculate” the chance for you to be mayor. I utilized the fuzzy string matching Python library called fuzzywuzzy and a phonetic and string comparison function library called jellyfish to help me compute the American Soundex value of a string and compare it to the values we had above from our first name search of NYC mayors.
First, I assigned “name chance values” for each based on each name’s representation in the dictionary above:
# applying percentages to each name
name_chance = {}
tot_names = sum(freq.values())
for name in freq.keys():
name_chance[name] = freq[name] / tot_names
print(name_chance)
and we get something like this
{'James': 0.03125, 'Richard': 0.015625, 'Edward': 0.03125, 'DeWitt': 0.015625, 'Marinus': 0.015625, 'Jacob': 0.03125, 'John': 0.09375, 'Cadwallader': 0.015625, 'Stephen': 0.015625, 'William': 0.140625, 'Philip': 0.015625, 'Walter': 0.015625, 'Gideon': 0.015625, 'Cornelius': 0.015625, 'Aaron': 0.015625, 'Isaac': 0.015625, 'Robert': 0.046875, 'Andrew': 0.015625, 'Caleb': 0.015625, 'Ambrose': 0.015625, 'Fernando': 0.015625, 'Daniel': 0.015625, 'George': 0.03125, 'Charles': 0.015625, 'Thomas': 0.03125, 'A.': 0.015625, 'Samuel': 0.015625, 'Smith': 0.015625, 'Franklin': 0.015625, 'Abram': 0.015625, 'Hugh': 0.015625, 'Seth': 0.015625, 'Ardolph': 0.015625, 'Jimmy': 0.015625, 'Joseph': 0.015625, 'Fiorello': 0.015625, 'Vincent': 0.015625, 'Abraham': 0.015625, 'Ed': 0.015625, 'David': 0.015625, 'Rudy': 0.015625, 'Michael': 0.015625, 'Bill': 0.015625, 'Eric': 0.015625}
and now we can calculate it based on these chances:
import jellyfish
from fuzzywuzzy import fuzz
your_name = "YOUR_NAME_HERE" # REPLACE WITH YOUR NAME
code1 = jellyfish.soundex(your_name)
sum_chance = 0
for name in name_chance.keys():
code2 = jellyfish.soundex(name)
sum_chance += name_chance[name] * fuzz.ratio(code1, code2)
print(f"your chance is {sum_chance}")
My name is “Jasmine” and I have a 23.4375% chance. You can try it out yourself using this link, running all the cells, and replacing the your_name variable with your name in the last cell.