Using R to find the “remotest” place in the world

For each of us, there exists a place on Earth which is the “remotest” place for us. That is, given all the places we have visited, which point on the globe is farthest away from everywhere we have ever been.

How do we find that place? We can use R’s geospatial libraries for that.

Here’s the core idea:

Step 1: We list all the places we’ve visited  (Lat, Long of those points.) Let’s call that set V.
Step 2: For any random point P on the globe, we can calculate its “remoteness” to us. That is the distance from P to the closest point in our visited set. For each v in V, we calculate its great circle distance to P and take the minimum.
Step 3: We do this for a large set of P that spans in the globe. (In say 1 degree intervals.)
Step 4: The point which has the greatest “remoteness” (max) is the point we are looking for.
This is sometimes called a “maximin” problem, the maximum of all minimum distances.

The idea is illustrated in this picture:

Screen Shot 2013-04-15 at 9.19.32 PM

Step 1: Write a few utility functions

Calculating Great Circle Miles between two points on Earth. For this I relied on Mario Pineda-Krch’s excellent post in r-bloggers. I just took the 3 functions I needed for this exercise.

# Code from http://www.r-bloggers.com/great-circle-distance-calculations-in-r/
# Convert degrees to radians
deg2rad <- function(deg) return(deg*pi/180)


# Calculates the geodesic distance between two points specified by radian latitude/longitude using the
# Spherical Law of Cosines (slc)
gcd.slc <- function(long1, lat1, long2, lat2) {
  #  R <- 6371 # Earth mean radius [km]
  R <- 3961 # Earth mean radius [miles]  
  d <- acos(sin(lat1)*sin(lat2) + cos(lat1)*cos(lat2) * cos(long2-long1)) * R
  return(d) # Distance in km
}

gcd <- function(long1, lat1, long2, lat2) {
  
  # Convert degrees to radians
  long1 <- deg2rad(long1)
  lat1 <- deg2rad(lat1)
  long2 <- deg2rad(long2)
  lat2 <- deg2rad(lat2)     
  
  return(gcd.slc(long1, lat1, long2, lat2)  )  
}

Step 2: Create a vector of places visited

I found a list of 250 world cities, with their lat/long values, and then created a copy of that with a 0/1 marking the places I had visited. This gist has a working copy if you want to try it for the cities you have visited.

Step 3: Generate a list of points on the globe

Latitude: From 90N to 90S
Longitude: From 0 to 180E to 180E (via 0)

#Create a datafram of Long Lats that spans the Globe
getGlobespan <- function() {
  latsteps = seq(-90,90, by=10) #in deg
  lonsteps = seq(-180,180, by=10)
  
  df <- NULL
  for (lat in latsteps) {
    for (lon in lonsteps) {  
      #print(paste(lon,"deg", lat))
      df <- rbind(df, c(lon, lat))
    }
  }
  
  df <- as.data.frame(df)
  names(df) <- c("Long", "Lat")
  return(df)
}

Step 4: Find the remotest point

Now, it is just a matter of calculating the “remoteness” of each point on the globe, and then taking its maximum.


unif.pts <- getGlobespan()
nearlist <- NULL
pt.closeness <- NULL
remote.dist <- 0
for (p in 1:nrow(unif.pts) ) {
  ptlong <- unif.pts[p,1]
  ptlat <- unif.pts[p,2]
  nearlist <- getMinDistanceFromOnePointToAllVisited(ptlong, ptlat, vdf)
  pt.closeness[p] <- nearlist[1]
  closest.to.p <- nearlist[1]
  if(remote.dist < closest.to.p) { 
    remote.dist <- closest.to.p
    remote.long <- ptlong
    remote.lat  <- ptlat
    closest.vlong  <- nearlist[2]
    closest.vlat   <- nearlist[3]
    print(c("remotest", ptlong,ptlat, round(remote.dist,2), closest.vlong, closest.vlat ))
  }
}  

Step 5: Plotting

Plotting these steps actually makes it very clear to see what’s going on.

First, on the world map, let’s plot all points – at 10 degree separations: This will be our proxy for “all” points on the globe.

mp <- NULL	 	
mapWorld <- borders("world", colour="gray70", fill="gray70")
title <- ggtitle("All points in the world, at 10 degree separation")	
uniform.pts <- geom_point(aes(x=gx, y=gy) ,color="black")
mp <- ggplot() +   mapWorld + title + uniform.pts	 
mp

world

Similarly, we can plot all the visited points (in blue)

mp <- NULL	 
mapWorld <- borders("world", colour="gray70", fill="gray70")
title <- ggtitle("Plotting Places Visited")
visited.pts <- geom_point(aes(x=visitx, y=visity) ,color="blue")	 
mp <- ggplot() +   mapWorld + title + visited.pts
mp	

visited

By adding a couple of lines, we can highlight the remotest point.

mp <- mp+ geom_point(aes(x=remote.long, y=remote.lat) , color="red", size=3 )
mp <- mp + geom_segment(aes(x=remote.long, y=remote.lat, xend=closest.vlong, yend=closest.vlat))
mp

remotest

Color-coding every point by its “remoteness”

One last thing: We can actually take *every* point on the globe and color-code by its remoteness. This serves to highlight the regions that haven’t been visited at all. The key line is the one that has “color = pt.closeness”

mp <- NULL
mapWorld <- borders("world", colour="gray70", fill="gray70")
title <- ggtitle("Every point on Earth color coded by its 'remoteness'")
mp <- ggplot() +   mapWorld + title

#Layer other points on top
gx <- unif.pts[,1]
gy <- unif.pts[,2]

#user closenss to color and determine square size. 15=square
mp <- mp+ geom_point(aes(x=gx, y=gy, color=pt.closeness, size=pt.closeness), shape=15) 
mp <- mp + scale_size(c(1:8))
mp <- mp + scale_color_gradient(high="red", low="green")
mp

color_coded_remoteness

Which City has the Ideal Temperature for me?

This is a side project that I did for fun. It started with my wife making what sounded like preposterous claim: She felt that Chicago IL had better weather than Sunnyvale CA.

So I did the following to convince both of us that she was wrong. You can interpret the results yourself.

The Setup

I needed to do several things first.

  1. I needed to get historical hourly temperature data. One or two full years worth
  2. How to summarize a City’s Temperature into one number? Average over what? This led to lots of discussions, and eventually to a “Personal Comfort Scale, that could vary from person to person

I also used this analysis to get myself better acquainted with several aspects of ggplot2 in R. You can choose to skip the parts that are not of particular interest to you.

Steps in the Analysis

  1. Getting Historical Weather Data from the web

  2. Agreeing on a way to Summarize temperature Data

    • Setting up a personal Weather Comfort Scale
  3. Comparing Temperatures across Cities

    1. Initial Results
    2. Plotting Densities – Comparing City Temperatures
    3. Comparing City Temperatures – Color Strips
    4. Wikipedia Style Comparisons
    5. Comparing Mean Temperatures – Across Cities by Month
  4. Comparing “Comfort Scores” across Cities


  5. Putting it all together

This post is about looking at the historical temperatures for many, many cities, mapping it to a person’s comfort factor and finding the “best” city for them, temperature-wise.

Set a Personal Comfort Scale based on a City’s Temperature

Using ggplot2

df <-read.csv("input_personal_preference.csv")
#plot a simple histogram
p <- ggplot(df, aes(x=High, y=Score))
p <- p + labs(title="Temperature Preference Scale")
p <- p + xlab("Temperature") + ylab("Personal Comfort Score")
p <- p + geom_bar(stat="identity") + aes(fill=Score)
p <- p + scale_x_continuous(breaks=seq(40,120,5))
p <- p + scale_fill_gradient("score",low="blue", high="orange")

So here’s what the personal scale looks like:
personal_preference

Fetch Historical Temperature Data

Verify Data Quality

#read in the data about number of data points
df = read.csv("out_numDataPoints.csv")
#create a new column for the Gaps in the data
 df$Gap <- (365 - df$Dpts)

#subset data frame to highlight big gaps
df_big = subset(df, Gap>200)

square=15 #the shape "square" is mapped to the number 15 in geom_point

#plotting
p <- ggplot() + aes(title= "Missing Data Points, by City, by Hour")
p <- p + scale_x_continuous(breaks=0:23) #place tick marks on each hour of day
p <- p + geom_point(data=df_big, aes(Hour,City),color="red", shape=square, size=5)
p

Leads to:

MissingGT200

If we want to focus on the cities/hours with a smaller gap, we can zoom in on a different subset:

df_small = subset(df, Gap<50)
p <- ggplot()
p <- p + geom_point(data=df_small, aes(Hour,City, color=Gap),shape=square, size=3)
#give it a continuous color scale
p <- p + scale_colour_continuous(low="blue", high="orange")
p

Which produces this plot:

missing_dpts_small

4. Initial Results

df <- read.csv("out_cityTemperatures.csv")
names(df)<- c("City","Date","Hour","Temperature")
numcolors <- length(unique(df$Temp)) #how many colors do we need?

redgreenrange<-colorRampPalette(c(rgb(1,0,0), rgb(0,0.7,0) ))
# More complex
m <- ggplot(df, aes(x=Temperature, fill=factor(Temperature)))
m <- m + geom_histogram(binwidth=1)
m <- m + scale_fill_manual(values=redgreenrange(numcolors))

m <- m + facet_grid(~City ~ .)
m <- m + ylab("Count of Hours")
m <- m + labs(title = "Comparing Temperature Distributions Across Cities (2011) ")
m

This produces:

Hourly Temperature Distribution

Altering Bin Width

We can also experiment with changing the bin width (the thickness of the bars). In the example above, binwidth was 1 degree. What if we made it 10 degrees?

m <- m + geom_histogram(binwidth=10)

We lose some detail, but we gain a different perspective, a little easier to see the bigger picture:
Temperature_10_degrees

Plotting as Densities

We can also plot multiple cities on one graph, using densities to aid in the comparisons.
(We use KDEs) Note that the colors are simply to differential cities, they do not have any inherent meaning and the cities are sorted Alphabetically.

#To plot multiple cities on one graph, as denisities
colorRange<-colorRampPalette(c(rgb(0,0,1), rgb(1,0.7,0) ))
p<-NULL
p<- ggplot(df, aes(Temperature, color=City)) + geom_density(size=2)
p<- p + scale_color_manual(values=colorRange(5))
p <- p + ylab("Density of Hours")
p <- p + labs(title = "Comparing Temperature Distributions Across Cities (2011) ")
p

This produces:

5_cities_KDE

Comparing City Temperatures – Color Strips

by binning Hourly Temperatures

#Bin the temperatures into 10 degree buckets, using the "cut" funtion
brk = c(seq(-10,100,10),1000)
label10s = c(as.character(seq(0,100,10)),">100")
TempBucket =cut(df$Temperature, breaks=brk, labels=label10s)

#now render barplots, colored by TempBucket
wx_range<-colorRampPalette(c(rgb(0,0.5,1), rgb(1,0.35,0) ))
p<-NULL
p<- ggplot(df, aes(City, fill=TempBucket )) + geom_bar()
p<- p + scale_fill_manual(values=wx_range(11))
p <- p + ylab("Number of Hours in that Temperature Range")
p <- p + labs(title = "Comparing Temperature Distributions Across Cities (2011) ")
p

This produces:
5city_colorstrip

Comparing Mean Temperatures – Across Cities by Month

We can also calculate the mean temperatures across a subset (say by Month) and then color code the values.

#Summarize the data into a new df first
df$Month <- months(as.Date(df$Date)) #Create a new column in df

smalldf <- ddply(df,.(City, Month), summarize, 
                 mean=mean(Temperature) ,
                 max = max(Temperature) ,     
                  min = min(Temperature)      )

p<-NULL
p<- ggplot(data=smalldf, aes(factor(smalldf$Month, levels=month.name), City, color=mean) )  
p<-p + geom_text(size=10, label=as.character(round(smalldf$mean, 0)))
p<- p + scale_color_gradient(low="blue", high="orange")
p <- p + theme(panel.background = element_rect(fill= "transparent"))
p<- p+xlab("Month")
p<- p+labs(title="Mean Temperatures For Cities, by Month")

This produces:

Mean_by_month

If we wanted a more traditional x-y graph, we can plot the Mean temperatures across months, and use a line to connect the different cities.

p<- ggplot(data=smalldf, aes(factor(smalldf$Month, levels=month.name), mean, group=City, color=City)  )
p<- p+geom_point(size=5)
p<- p+geom_line(size=1, alpha=0.2)
p<- p + geom_hline(yintercept=c(50,60,70))
p<- p + theme(panel.background = element_rect(fill= "transparent"))
p<- p+xlab("Month 2011")
p<- p+labs(title="Mean Temperatures For Cities, by Month")

This produces:
mean_line

Wikipedia Style Comparisons

We can compare the mean, min and max Temperatures for various cities, by Month. (Wikipedia style)

#Summarize the data into a new df first
smalldf <- ddply(df,.(City, Month), summarize, 
                 meanT= round(mean(Temperature),1) ,
                 maxT = round(max(Temperature),0) ,     
                 minT = round(min(Temperature),0)      )


ordered_month = factor(smalldf$Month, levels=month.name)

p <-NULL
p <- ggplot(data=smalldf, aes(x = ordered_month, y=meanT,
                              ymin=minT, ymax=maxT))
p <- p + geom_crossbar(width=0.2, fill="red")
p <- p + geom_text(data=smalldf, aes(y=maxT+5, label=maxT), color="red")
p <- p + geom_text(data=smalldf, aes(y=minT-5, label=minT), color="blue")
p <- p + facet_grid(City ~ .)
#Plot Aesthetics
p <- p + xlab("Month 2011") + ylab("")
p <- p+labs(title="City Mean, Max & Min Temperatures, by Month")
p

which produces:
citymmm

We can also compare any two cities. We could declare a “winner” for say each month, and see which city wins for having a more salubrious temperature. (In the following, the “winner” is the city with a higher mean temperature for the month.)

Comparing Mean Temperatures – 2 Cities, by Month

df <- read.csv("out_cityTemperatures.csv")
names(df)<- c("City","Date","Hour","Temperature")
numcolors <- length(unique(df$Temp))
df$Month <- months(as.Date(df$Date))

#Summarize the data into a new df first
smalldf <- ddply(df,.(City, Month), summarize, mean=round(mean(Temperature),1) )
bymonthdf <- cast(smalldf, Month ~ City, value="mean")


calc_higher_mean <- function(df, colA, colB) {
  vec = ifelse(df[colA] > df[colB], "A", "B" )
  return(vec)
}


cities <- unique(df$City)
A = 4
B = 2
hi_mean = calc_higher_mean(bymonthdf,  cities[A] , cities[B])


p <-NULL
p <- ggplot(data=bymonthdf, aes(x = factor(bymonthdf$Month, levels=month.name), fill=factor(hi_mean))  )
p <- p + geom_bar()
#Adjust Plot Color Scheme
coldCol <- rgb(0.1, 0.5, 0.2)
hotCol <- rgb(0.8, 0.3, 0)
p <- p + scale_fill_manual(values= c(coldCol,hotCol),
                            name="City with Higher Mean Temp",
                            labels=c(cities[A], cities[B])
                           )

#Plot Aesthetics
p <- p + theme(panel.background = element_rect(fill= "transparent"))
p <- p + theme(axis.text.y = element_blank() )
p <- p + xlab("Month 2011") + ylab("")
p <- p+labs(title="PianoGram of City with Higher Mean Temperatures, by Month")
p

And this produces the following PianoGram (with 12 “Piano keys”)
pg1

Calculate Personal scores

Plot the scores

Let’s plot it so that each city has a bar, color-coded to indicate the temperature range.  I got the idea for this from Hadley’s help page here.http://docs.ggplot2.org/current/geom_histogram.html, see the colorful example just following:

hist_cut + geom_bar(position=”fill”)

#read the cumulative scores data
cityscore <-read.csv("data/city_scores_curves.csv")

names(cityscore)
head(cityscore)

#In order to plot STACKED BARPLOTS, the data frame has to be melted

# melt the dataframe so that each measured variable is its own row
cs.melt <- melt(cs, id="bucket")
names(cs.melt)[2] = "City"
names(cs.melt)[3] = "Score"

#Using RampPalette to choose a suitable color range
# Using a slightly muted green
redgreenrange<-colorRampPalette(c(rgb(1,0,0), rgb(0,0.7,0) ))

po <- ggplot(cs.melt, aes(x=factor(City), y=Score, fill = factor(bucket)) )
po<- po + xlab("City Temperature Scores")
po<- po + geom_bar(stat="identity")
po <- po + scale_fill_manual(values = redgreenrange(11), name="Temperature")

City Temperature Strips

Sorting the ColorStrips

This looks good, but the cities are in some random order, the order in which the data was read in. We can change that using:

#first, we take one reading for each city, its 0 bucket score
zeros <- subset(cs.melt, bucket=="0")

#reorder the rows by the scores.
zeros.sort<- zeros[order(zeros$Score),]
# zeros.sort$City becomes the ordering we want

#invoke aes with the levels specified
aes(x=factor(City,levels=zeros.sort$City)

After sorting the cities (by increasing order of “0” bucket size, we get the following plot.)

city_strips

If we want to plot around a mid-line:

#### Now let's try to center the graph around a score of 50
#0-40 goes to one side, 50-100 goes above center line
lowbuckets = c('0','10','20','30','40')

#subset the lower half of the dataframe, rows with small scores
lowcs = subset(cs.melt,bucket %in% lowbuckets)

#Group that smaller df, by one row per City, score is sum of the scores. 100-x becomes the offset
#create a subset of df, one for each city, with score = 100-low_scores, and call that bucket "-"
newc <- ddply(lowcs, .(City), summarize, Score = 100-sum(Score), bucket="-" )
newcs <- rbind(newc,cs.melt)
newcs

#order the cities, based on the Score of the lower half
newc[order(newc$Score),]$City

# we plot the x as a factor, sorted by the '0' bucket score
po <- ggplot(newcs, aes(x=factor(City,levels=newc[order(newc$Score),]$City), y=Score, fill = factor(bucket)) )
po <- ggplot(newcs, aes(x=factor(City,levels=zeros.sort$City), y=Score, fill = factor(bucket)) )
po <- po + xlab("City Temperature Scores (sorted by Increasing 0-score bucket size)" )
po <- po + geom_bar(stat="identity")
po <- po + scale_fill_manual(values = c(rgb(1,1,1),redgreenrange(11)), name="Temperature"")
po

which produces the following:

Rplot

Plotting Comfort scores on Maps

‘GGMap’ is a package that allows data to be plotted on maps, as new layers, which is perfect for this exercise.

In order to try that we have to do a few things: 1. Have an input file with Cities and their scores.

# Load libraries
library('ggplot2')
library('plyr')
library(ggmap)
library(mapproj)

#read the cumulative scores data
cityscore <-read.csv("out_city_final.csv")

cities <- as.character(cityscore$City)
score <- cityscore$Score
#geocode is a convenient function that fetches Lats and longs, needed for plotting
ll = geocode(cities)

#bind it all into one useful data frame (ll)
cbind(ll,score)

mtype = "roadmap"
mp = get_map(location = 'USA', zoom = 4, maptype = mtype)
#If you prefer bw maps so that the data can be viewed better

#mp = get_map(location = 'USA', zoom = 4, maptype = mtype, color="bw")
mp <- ggmap(mp)

mp<- mp+ geom_point( data=ll, aes(x=lon, y=lat, color=score), size=score/10)
mp<- mp + scale_colour_continuous(low="blue", high="red")

#Label the data points, using city names. The 'lat-1' moves the names down a bit
mp<- mp+geom_text(data=ll,aes(y=lat-1,label=cities), size=4)
#would be useful to display the Scores as well
mp<- mp+geom_text(data=ll,aes(y=lat+1.5,label=round(score),color=score), size=5)
mp

All of that produces this map:

Comfort Scores of US Cities

Plot the cities on a map of India

Thanks to ggmap, we only need to change one line of code to switch from a US map to one of India:

Change:

mp = get_map(location = 'USA', zoom = 4, maptype = mtype)
To:
mp = get_map(location = 'India', zoom = 5, maptype = mtype)

And we get:

Comfort Scores for Cities in India

Putting it all together

Once all the computations are done, a simple graph can tell the story. Mapping, while pretty, is not necessarily the best medium for certain information. Simple tables do the job better in many cases.

df = read.csv("out_city_final.csv")
p<- ggplot(df)
p<- p+aes(x=Score,y=reorder(City, Score))
p<-p+geom_point()
p

city_final_ordered

Keywords: R, R Statistics, Weather Underground, Web Data Scraping, ggplot2, ggmap

Prettying up data through visualization

Here’s a good article by Fernanda Viegas and Martin Wattenberg, who are leading up Google’s Big Picture project. In the article they talk of Visualization becoming more and more mainstream, and they advocate grabbing and keeping the viewer’s attention.

These same two folks developed IBM’s Many Eyes effort, which I am hoping to try out for myself.