The increasing ubiquity of satellite/aerial data has been fueled by technological advancements making this method of imaging cheaper to acquire. This inspired me to turn my ML knowledge towards learning about what inequality in our cities looked like from aerial images. While we have census data to help us understand some areas of our cities that suffer the most inequality, these methods of data acquisition take time and are cost prohibitive, limiting frequent and vast collection. If we can produce a method to study and predict current areas that suffer from inequality using aerial imaging, maybe we can tranfer this model to quickly identify other areas covered by satellite that would not otherwise be frequented by surveys, giving us much quicker updates on the evolution of a place. Maybe further research can also help tell us about physical markers from landsat that signify areas of inequality?
Before we can address any of these more ambitious questions, a good place to start is to ask how we can quantify inequality. Furthermore, how are we impacted in our choice of measures based on the data that is available to us, and do our model choices, simplifications, and assumptions make sense in terms of our overall goal?
"All models are wrong, but some are useful"
I hardly need to tell the reader that many brilliant minds have thought about the variety of ways to quantify inequality. The ones most relevant to the data available to us centers on income measures, the Gini coefficient being the most common (a token of gratitude to Adam Wearne for helping me assess the usefulness of the Gini coefficient in context of this project). Each measure focuses on some aspect of inequality, but all have deficiencies particular to the measure. The Gini coefficient has its own share of problems.
The Gini coefficient is attractive because of its simplicity. It is a measure that asks: if we sort the income of all people in a segment of interest, lowest to highest, what proportion of total income would any cumulative share of that population possess? For example, if 50% of the population were only responsible for 1% of the total income, that would seem to be very unequal. If instead, the graph showed that 10% of the population were responsible for 10% of the total income, 20% of the population for 20% of the income, etc, forming a straight line, then that is the most equal in income distribution. The actual calculation of the Gini coefficient takes into account areas under the ideal line. Refer to the infographics below:
Our income data is organized into wide bins, which means that there is also wide uncertainty in using the Gini coefficient. This wide uncertainty will in turn propagate through our model. Instead of using the Gini coefficient, we can make an even simplier assumption that perhaps the ratio of households below the poverty threshold might serve as the simplest proxy for inequality. This method suffers many limitations, first and foremost, like the Gini, looking at relative proportions obscures absolute changes in income. It is very possible to have overall income rise for everyone, reducing the number of people living in poverty, yet the level of inequality can still rise. For this particular application, however, we are not yet looking at changes over time, but rather a snapshot of incomes from a particular time period. We are seeing if there is signal in the noise before we dive into improving measures and methodology (and given this is a two week project with many new GIS tools to learn, there is a time constraint). Lastly to justify using a poverty ratio, we have reason to believe that real world application might be more lenient than worst case hypothetical scenarios.
Block Group Poverty Ratio
The U.S. Department of Health and Human Services (HHS) gives a guideline income of approximately $25k as the poverty threshold for a household of four people. For our purposes, this should be a good enough approximation to allow us to estimate the ratio of household per some defined area, living under that threshold. This begs the question, what is the appropriate area for consideration? Well, there are a number of limiting factors: 1) what data is available and what area size does the data record, and 2) What is the maximum zoom level we can get satellite images for? Fortunately for us, we have some choices. There are census surveys that record income on a state level basis, county level basis, ZTCA level basis (similar to, but different from zipcodes), and block group level data (as it sounds, a collection of blocks of about 600-3,000 people). The more granular the data, the better. Luckily the smallest unit, a block group, should be sufficient (American Community Survey data doesn't go more granular for anonimity's sake). Also fortunate for us, we can find hi-res orthorectified images of San Francisco using USGS's database, where at the level of block groups and beyond, visual details are not fuzzy.
The following heatmap is visualized by extracting tabular data from the American Community Survey geodatabase in 2016. For each block group, we simply calculated the ratio of households under $24,999, over the total number of recorded households, and performed a geospatial join. Lighter colors represent a lower proportion of households in poverty, and darker blues represent a higher proportion of households in poverty. There are areas such as the Tenderloin, or the general market area, which from everyday correspondence, exemplifies the inequality in SF.
Taking a look at the distribution of these block group poverty ratios, we see below that the red line marks where half of our block groups have a poverty ratio above 0.174, or 17.4%, and half have a poverty ratio below 17.4%. This is a good natural boundary for a binary class segmentation problem. Can satellite images help identity whether block groups have a poverty ratio above or below 17.4%? This number seems a little ad-hoc. We have no good reason to believe this is the most optimal ratio to classify other than it conveniently gives us a balance class split. We'll come back to this in a bit and check our assumptions. One last note on this point, we might also justify this split as the best decision we can currently make coming from a prior of ignorance, conditioned on the data.
Before jumping into the nitty gritty of modeling and model results, lets take a quick tour of preparing orthorectified image data. The data for this project came from USGS's asset. In total there are 81 square areas of RGBA image data that needs to be mosaicked together to create a single image file (~12 GB large). The aerial image below of SF is visualized without the alpha band after stitching. Next, we need to convert both the single orthorectified image and the shapefile of block groups from the ACS survey into the same projection. This allows us to then take the shape of individual block groups and mask them out. Imagine getting a cookie cutter in the shape of individual block groups and cutting into the aerial image.
We are almost there! The last part of the process is to take square samples of our block group to feed into the random forest model we will use. The sampling algorithm created takes 40 square samples. If a sample has more then 10% of its pixels as black pixels, it randomly samples again. To reduce complexity of the data, the samples which were originally in four bands, RGBA, undergoes a luminance preserving transformation (Y' = 0.299 R + 0.587 G + 0.114 B) with the near-infrared band excluded (alpha channel). Finally, each 175x175 pixel, greyscale image is flatten into a 1-D array and tracked with corresponding block group poverty ratio for training.
Model and Discussion
Getting to the meat of our question, a binary random forest classifier (n_estimators = 200, min_samples_split = 10) was trained to detect whether each patch of 175x175 grey scale area belonged to an area that is above or below a poverty ratio of 0.174. The resulting ROC curve is displayed below. In this situation, we are concerned with correctly identifying the positive class (the class over a poverty ratio of 0.174), thus we are concerned more with the recall of our algorithm correctly identifying the relevant class. With minor hyperparameter tuning, our recall is 0.59. In a balanced binary classification, we expect that a random guess would get us 0.5 recall. This tells us that there is signal in this method, and perhaps with better assumptions in the model, we can improve this recall.
OOB Error: 0.67
Going back to our question of whether or not 0.174 was an optimal threshold, here we want to define optimal as a threshold that helps to yield the greatest recall. Intuitively, if we looked at a neighborhood and classified it into a low poverty ratio and high poverty ratio neighborhood, is there a poverty ratio at which we are optimally segmenting these groups? It might be very hard to differentiate neighborhoods when the poverty ratio is at 0.01. We can ask the same of our algorithm by starting at a poverty ratio of 0.10, incrementing the ratio by 0.09, up to a max of 0.81 to compare performance:
Indeed, what we find is that the maximum recall over 9 runs has a poverty ratio of 0.19, which suggests that our original poverty ratio, which performed better, is indeed at or near the optimal level to set the poverty ratio.
A word of warning to conclude this post. We started off asking the question of what we can learn about inequality using aerial imagery. That question evolved into a binary classification problem under specific assumptions that allowed us to use existant data to see if there is enough signal for further pursuit in this direction. We had to make a lot of compromises along the way. It would be great hubris, especially within the scope of a two week project, to immediately conclude that, yes, indeed satellite imagery can definitely be used to tell us about what inequality looks like. In fact, using an algorithm such as random forest really obscures what it is that the algorithm 'sees' in the pixel data to make its determination. On the bright side, there is the hope that signal exists within the data and this modeling process holds some potential.
There is so much space for exploration and modification. The assumptions along the way can be vastly improved, as well as the choice of modeling methods to help tackle the problem better. We can also reframe the problem to utilize state-of-the-art instance segmentation algorithms to better help us understand what the algorithm, 'sees'. Additionally the scope of this project was also limited by the small sample size available to us. A 175x175 pixel image yields 30,625 parameters (recall that we threw alpha alpha channel data). In contrast, we had approximately 3k blockgroup areas from which to sample these images, and due to file size limitations, it was difficult to include areas wider than that of SF.
Despite the challenges and limitations, this is a project that I'm very excited to revisit in some future time to improve upon! Check out my Github to get the code to this project or other projects.