Using R to find the “remotest” place in the world

For each of us, there exists a place on Earth which is the “remotest” place for us. That is, given all the places we have visited, which point on the globe is farthest away from everywhere we have ever been.

How do we find that place? We can use R’s geospatial libraries for that.

Here’s the core idea:

Step 1: We list all the places we’ve visited  (Lat, Long of those points.) Let’s call that set V.
Step 2: For any random point P on the globe, we can calculate its “remoteness” to us. That is the distance from P to the closest point in our visited set. For each v in V, we calculate its great circle distance to P and take the minimum.
Step 3: We do this for a large set of P that spans in the globe. (In say 1 degree intervals.)
Step 4: The point which has the greatest “remoteness” (max) is the point we are looking for.
This is sometimes called a “maximin” problem, the maximum of all minimum distances.

The idea is illustrated in this picture:

Screen Shot 2013-04-15 at 9.19.32 PM

Step 1: Write a few utility functions

Calculating Great Circle Miles between two points on Earth. For this I relied on Mario Pineda-Krch’s excellent post in r-bloggers. I just took the 3 functions I needed for this exercise.

# Code from
# Convert degrees to radians
deg2rad <- function(deg) return(deg*pi/180)

# Calculates the geodesic distance between two points specified by radian latitude/longitude using the
# Spherical Law of Cosines (slc)
gcd.slc <- function(long1, lat1, long2, lat2) {
  #  R <- 6371 # Earth mean radius [km]
  R <- 3961 # Earth mean radius [miles]  
  d <- acos(sin(lat1)*sin(lat2) + cos(lat1)*cos(lat2) * cos(long2-long1)) * R
  return(d) # Distance in km

gcd <- function(long1, lat1, long2, lat2) {
  # Convert degrees to radians
  long1 <- deg2rad(long1)
  lat1 <- deg2rad(lat1)
  long2 <- deg2rad(long2)
  lat2 <- deg2rad(lat2)     
  return(gcd.slc(long1, lat1, long2, lat2)  )  

Step 2: Create a vector of places visited

I found a list of 250 world cities, with their lat/long values, and then created a copy of that with a 0/1 marking the places I had visited. This gist has a working copy if you want to try it for the cities you have visited.

Step 3: Generate a list of points on the globe

Latitude: From 90N to 90S
Longitude: From 0 to 180E to 180E (via 0)

#Create a datafram of Long Lats that spans the Globe
getGlobespan <- function() {
  latsteps = seq(-90,90, by=10) #in deg
  lonsteps = seq(-180,180, by=10)
  df <- NULL
  for (lat in latsteps) {
    for (lon in lonsteps) {  
      #print(paste(lon,"deg", lat))
      df <- rbind(df, c(lon, lat))
  df <-
  names(df) <- c("Long", "Lat")

Step 4: Find the remotest point

Now, it is just a matter of calculating the “remoteness” of each point on the globe, and then taking its maximum.

unif.pts <- getGlobespan()
nearlist <- NULL
pt.closeness <- NULL
remote.dist <- 0
for (p in 1:nrow(unif.pts) ) {
  ptlong <- unif.pts[p,1]
  ptlat <- unif.pts[p,2]
  nearlist <- getMinDistanceFromOnePointToAllVisited(ptlong, ptlat, vdf)
  pt.closeness[p] <- nearlist[1] <- nearlist[1]
  if(remote.dist < { 
    remote.dist <-
    remote.long <- ptlong  <- ptlat
    closest.vlong  <- nearlist[2]
    closest.vlat   <- nearlist[3]
    print(c("remotest", ptlong,ptlat, round(remote.dist,2), closest.vlong, closest.vlat ))

Step 5: Plotting

Plotting these steps actually makes it very clear to see what’s going on.

First, on the world map, let’s plot all points – at 10 degree separations: This will be our proxy for “all” points on the globe.

mp <- NULL	 	
mapWorld <- borders("world", colour="gray70", fill="gray70")
title <- ggtitle("All points in the world, at 10 degree separation")	
uniform.pts <- geom_point(aes(x=gx, y=gy) ,color="black")
mp <- ggplot() +   mapWorld + title + uniform.pts	 


Similarly, we can plot all the visited points (in blue)

mp <- NULL	 
mapWorld <- borders("world", colour="gray70", fill="gray70")
title <- ggtitle("Plotting Places Visited")
visited.pts <- geom_point(aes(x=visitx, y=visity) ,color="blue")	 
mp <- ggplot() +   mapWorld + title + visited.pts


By adding a couple of lines, we can highlight the remotest point.

mp <- mp+ geom_point(aes(x=remote.long, , color="red", size=3 )
mp <- mp + geom_segment(aes(x=remote.long,, xend=closest.vlong, yend=closest.vlat))


Color-coding every point by its “remoteness”

One last thing: We can actually take *every* point on the globe and color-code by its remoteness. This serves to highlight the regions that haven’t been visited at all. The key line is the one that has “color = pt.closeness”

mp <- NULL
mapWorld <- borders("world", colour="gray70", fill="gray70")
title <- ggtitle("Every point on Earth color coded by its 'remoteness'")
mp <- ggplot() +   mapWorld + title

#Layer other points on top
gx <- unif.pts[,1]
gy <- unif.pts[,2]

#user closenss to color and determine square size. 15=square
mp <- mp+ geom_point(aes(x=gx, y=gy, color=pt.closeness, size=pt.closeness), shape=15) 
mp <- mp + scale_size(c(1:8))
mp <- mp + scale_color_gradient(high="red", low="green")



Which City has the Ideal Temperature for me?

This is a side project that I did for fun. It started with my wife making what sounded like preposterous claim: She felt that Chicago IL had better weather than Sunnyvale CA.

So I did the following to convince both of us that she was wrong. You can interpret the results yourself.

The Setup

I needed to do several things first.

  1. I needed to get historical hourly temperature data. One or two full years worth
  2. How to summarize a City’s Temperature into one number? Average over what? This led to lots of discussions, and eventually to a “Personal Comfort Scale, that could vary from person to person

I also used this analysis to get myself better acquainted with several aspects of ggplot2 in R. You can choose to skip the parts that are not of particular interest to you.

Steps in the Analysis

  1. Getting Historical Weather Data from the web

  2. Agreeing on a way to Summarize temperature Data

    • Setting up a personal Weather Comfort Scale
  3. Comparing Temperatures across Cities

    1. Initial Results
    2. Plotting Densities – Comparing City Temperatures
    3. Comparing City Temperatures – Color Strips
    4. Wikipedia Style Comparisons
    5. Comparing Mean Temperatures – Across Cities by Month
  4. Comparing “Comfort Scores” across Cities

  5. Putting it all together

This post is about looking at the historical temperatures for many, many cities, mapping it to a person’s comfort factor and finding the “best” city for them, temperature-wise.

Set a Personal Comfort Scale based on a City’s Temperature

Using ggplot2

df <-read.csv("input_personal_preference.csv")
#plot a simple histogram
p <- ggplot(df, aes(x=High, y=Score))
p <- p + labs(title="Temperature Preference Scale")
p <- p + xlab("Temperature") + ylab("Personal Comfort Score")
p <- p + geom_bar(stat="identity") + aes(fill=Score)
p <- p + scale_x_continuous(breaks=seq(40,120,5))
p <- p + scale_fill_gradient("score",low="blue", high="orange")

So here’s what the personal scale looks like:

Fetch Historical Temperature Data

Verify Data Quality

#read in the data about number of data points
df = read.csv("out_numDataPoints.csv")
#create a new column for the Gaps in the data
 df$Gap <- (365 - df$Dpts)

#subset data frame to highlight big gaps
df_big = subset(df, Gap>200)

square=15 #the shape "square" is mapped to the number 15 in geom_point

p <- ggplot() + aes(title= "Missing Data Points, by City, by Hour")
p <- p + scale_x_continuous(breaks=0:23) #place tick marks on each hour of day
p <- p + geom_point(data=df_big, aes(Hour,City),color="red", shape=square, size=5)

Leads to:


If we want to focus on the cities/hours with a smaller gap, we can zoom in on a different subset:

df_small = subset(df, Gap<50)
p <- ggplot()
p <- p + geom_point(data=df_small, aes(Hour,City, color=Gap),shape=square, size=3)
#give it a continuous color scale
p <- p + scale_colour_continuous(low="blue", high="orange")

Which produces this plot:


4. Initial Results

df <- read.csv("out_cityTemperatures.csv")
names(df)<- c("City","Date","Hour","Temperature")
numcolors <- length(unique(df$Temp)) #how many colors do we need?

redgreenrange<-colorRampPalette(c(rgb(1,0,0), rgb(0,0.7,0) ))
# More complex
m <- ggplot(df, aes(x=Temperature, fill=factor(Temperature)))
m <- m + geom_histogram(binwidth=1)
m <- m + scale_fill_manual(values=redgreenrange(numcolors))

m <- m + facet_grid(~City ~ .)
m <- m + ylab("Count of Hours")
m <- m + labs(title = "Comparing Temperature Distributions Across Cities (2011) ")

This produces:

Hourly Temperature Distribution

Altering Bin Width

We can also experiment with changing the bin width (the thickness of the bars). In the example above, binwidth was 1 degree. What if we made it 10 degrees?

m <- m + geom_histogram(binwidth=10)

We lose some detail, but we gain a different perspective, a little easier to see the bigger picture:

Plotting as Densities

We can also plot multiple cities on one graph, using densities to aid in the comparisons.
(We use KDEs) Note that the colors are simply to differential cities, they do not have any inherent meaning and the cities are sorted Alphabetically.

#To plot multiple cities on one graph, as denisities
colorRange<-colorRampPalette(c(rgb(0,0,1), rgb(1,0.7,0) ))
p<- ggplot(df, aes(Temperature, color=City)) + geom_density(size=2)
p<- p + scale_color_manual(values=colorRange(5))
p <- p + ylab("Density of Hours")
p <- p + labs(title = "Comparing Temperature Distributions Across Cities (2011) ")

This produces:


Comparing City Temperatures – Color Strips

by binning Hourly Temperatures

#Bin the temperatures into 10 degree buckets, using the "cut" funtion
brk = c(seq(-10,100,10),1000)
label10s = c(as.character(seq(0,100,10)),">100")
TempBucket =cut(df$Temperature, breaks=brk, labels=label10s)

#now render barplots, colored by TempBucket
wx_range<-colorRampPalette(c(rgb(0,0.5,1), rgb(1,0.35,0) ))
p<- ggplot(df, aes(City, fill=TempBucket )) + geom_bar()
p<- p + scale_fill_manual(values=wx_range(11))
p <- p + ylab("Number of Hours in that Temperature Range")
p <- p + labs(title = "Comparing Temperature Distributions Across Cities (2011) ")

This produces:

Comparing Mean Temperatures – Across Cities by Month

We can also calculate the mean temperatures across a subset (say by Month) and then color code the values.

#Summarize the data into a new df first
df$Month <- months(as.Date(df$Date)) #Create a new column in df

smalldf <- ddply(df,.(City, Month), summarize, 
                 mean=mean(Temperature) ,
                 max = max(Temperature) ,     
                  min = min(Temperature)      )

p<- ggplot(data=smalldf, aes(factor(smalldf$Month,, City, color=mean) )  
p<-p + geom_text(size=10, label=as.character(round(smalldf$mean, 0)))
p<- p + scale_color_gradient(low="blue", high="orange")
p <- p + theme(panel.background = element_rect(fill= "transparent"))
p<- p+xlab("Month")
p<- p+labs(title="Mean Temperatures For Cities, by Month")

This produces:


If we wanted a more traditional x-y graph, we can plot the Mean temperatures across months, and use a line to connect the different cities.

p<- ggplot(data=smalldf, aes(factor(smalldf$Month,, mean, group=City, color=City)  )
p<- p+geom_point(size=5)
p<- p+geom_line(size=1, alpha=0.2)
p<- p + geom_hline(yintercept=c(50,60,70))
p<- p + theme(panel.background = element_rect(fill= "transparent"))
p<- p+xlab("Month 2011")
p<- p+labs(title="Mean Temperatures For Cities, by Month")

This produces:

Wikipedia Style Comparisons

We can compare the mean, min and max Temperatures for various cities, by Month. (Wikipedia style)

#Summarize the data into a new df first
smalldf <- ddply(df,.(City, Month), summarize, 
                 meanT= round(mean(Temperature),1) ,
                 maxT = round(max(Temperature),0) ,     
                 minT = round(min(Temperature),0)      )

ordered_month = factor(smalldf$Month,

p <-NULL
p <- ggplot(data=smalldf, aes(x = ordered_month, y=meanT,
                              ymin=minT, ymax=maxT))
p <- p + geom_crossbar(width=0.2, fill="red")
p <- p + geom_text(data=smalldf, aes(y=maxT+5, label=maxT), color="red")
p <- p + geom_text(data=smalldf, aes(y=minT-5, label=minT), color="blue")
p <- p + facet_grid(City ~ .)
#Plot Aesthetics
p <- p + xlab("Month 2011") + ylab("")
p <- p+labs(title="City Mean, Max & Min Temperatures, by Month")

which produces:

We can also compare any two cities. We could declare a “winner” for say each month, and see which city wins for having a more salubrious temperature. (In the following, the “winner” is the city with a higher mean temperature for the month.)

Comparing Mean Temperatures – 2 Cities, by Month

df <- read.csv("out_cityTemperatures.csv")
names(df)<- c("City","Date","Hour","Temperature")
numcolors <- length(unique(df$Temp))
df$Month <- months(as.Date(df$Date))

#Summarize the data into a new df first
smalldf <- ddply(df,.(City, Month), summarize, mean=round(mean(Temperature),1) )
bymonthdf <- cast(smalldf, Month ~ City, value="mean")

calc_higher_mean <- function(df, colA, colB) {
  vec = ifelse(df[colA] > df[colB], "A", "B" )

cities <- unique(df$City)
A = 4
B = 2
hi_mean = calc_higher_mean(bymonthdf,  cities[A] , cities[B])

p <-NULL
p <- ggplot(data=bymonthdf, aes(x = factor(bymonthdf$Month,, fill=factor(hi_mean))  )
p <- p + geom_bar()
#Adjust Plot Color Scheme
coldCol <- rgb(0.1, 0.5, 0.2)
hotCol <- rgb(0.8, 0.3, 0)
p <- p + scale_fill_manual(values= c(coldCol,hotCol),
                            name="City with Higher Mean Temp",
                            labels=c(cities[A], cities[B])

#Plot Aesthetics
p <- p + theme(panel.background = element_rect(fill= "transparent"))
p <- p + theme(axis.text.y = element_blank() )
p <- p + xlab("Month 2011") + ylab("")
p <- p+labs(title="PianoGram of City with Higher Mean Temperatures, by Month")

And this produces the following PianoGram (with 12 “Piano keys”)

Calculate Personal scores

Plot the scores

Let’s plot it so that each city has a bar, color-coded to indicate the temperature range.  I got the idea for this from Hadley’s help page here., see the colorful example just following:

hist_cut + geom_bar(position=”fill”)

#read the cumulative scores data
cityscore <-read.csv("data/city_scores_curves.csv")


#In order to plot STACKED BARPLOTS, the data frame has to be melted

# melt the dataframe so that each measured variable is its own row
cs.melt <- melt(cs, id="bucket")
names(cs.melt)[2] = "City"
names(cs.melt)[3] = "Score"

#Using RampPalette to choose a suitable color range
# Using a slightly muted green
redgreenrange<-colorRampPalette(c(rgb(1,0,0), rgb(0,0.7,0) ))

po <- ggplot(cs.melt, aes(x=factor(City), y=Score, fill = factor(bucket)) )
po<- po + xlab("City Temperature Scores")
po<- po + geom_bar(stat="identity")
po <- po + scale_fill_manual(values = redgreenrange(11), name="Temperature")

City Temperature Strips

Sorting the ColorStrips

This looks good, but the cities are in some random order, the order in which the data was read in. We can change that using:

#first, we take one reading for each city, its 0 bucket score
zeros <- subset(cs.melt, bucket=="0")

#reorder the rows by the scores.
zeros.sort<- zeros[order(zeros$Score),]
# zeros.sort$City becomes the ordering we want

#invoke aes with the levels specified

After sorting the cities (by increasing order of “0” bucket size, we get the following plot.)


If we want to plot around a mid-line:

#### Now let's try to center the graph around a score of 50
#0-40 goes to one side, 50-100 goes above center line
lowbuckets = c('0','10','20','30','40')

#subset the lower half of the dataframe, rows with small scores
lowcs = subset(cs.melt,bucket %in% lowbuckets)

#Group that smaller df, by one row per City, score is sum of the scores. 100-x becomes the offset
#create a subset of df, one for each city, with score = 100-low_scores, and call that bucket "-"
newc <- ddply(lowcs, .(City), summarize, Score = 100-sum(Score), bucket="-" )
newcs <- rbind(newc,cs.melt)

#order the cities, based on the Score of the lower half

# we plot the x as a factor, sorted by the '0' bucket score
po <- ggplot(newcs, aes(x=factor(City,levels=newc[order(newc$Score),]$City), y=Score, fill = factor(bucket)) )
po <- ggplot(newcs, aes(x=factor(City,levels=zeros.sort$City), y=Score, fill = factor(bucket)) )
po <- po + xlab("City Temperature Scores (sorted by Increasing 0-score bucket size)" )
po <- po + geom_bar(stat="identity")
po <- po + scale_fill_manual(values = c(rgb(1,1,1),redgreenrange(11)), name="Temperature"")

which produces the following:


Plotting Comfort scores on Maps

‘GGMap’ is a package that allows data to be plotted on maps, as new layers, which is perfect for this exercise.

In order to try that we have to do a few things: 1. Have an input file with Cities and their scores.

# Load libraries

#read the cumulative scores data
cityscore <-read.csv("out_city_final.csv")

cities <- as.character(cityscore$City)
score <- cityscore$Score
#geocode is a convenient function that fetches Lats and longs, needed for plotting
ll = geocode(cities)

#bind it all into one useful data frame (ll)

mtype = "roadmap"
mp = get_map(location = 'USA', zoom = 4, maptype = mtype)
#If you prefer bw maps so that the data can be viewed better

#mp = get_map(location = 'USA', zoom = 4, maptype = mtype, color="bw")
mp <- ggmap(mp)

mp<- mp+ geom_point( data=ll, aes(x=lon, y=lat, color=score), size=score/10)
mp<- mp + scale_colour_continuous(low="blue", high="red")

#Label the data points, using city names. The 'lat-1' moves the names down a bit
mp<- mp+geom_text(data=ll,aes(y=lat-1,label=cities), size=4)
#would be useful to display the Scores as well
mp<- mp+geom_text(data=ll,aes(y=lat+1.5,label=round(score),color=score), size=5)

All of that produces this map:

Comfort Scores of US Cities

Plot the cities on a map of India

Thanks to ggmap, we only need to change one line of code to switch from a US map to one of India:


mp = get_map(location = 'USA', zoom = 4, maptype = mtype)
mp = get_map(location = 'India', zoom = 5, maptype = mtype)

And we get:

Comfort Scores for Cities in India

Putting it all together

Once all the computations are done, a simple graph can tell the story. Mapping, while pretty, is not necessarily the best medium for certain information. Simple tables do the job better in many cases.

df = read.csv("out_city_final.csv")
p<- ggplot(df)
p<- p+aes(x=Score,y=reorder(City, Score))


Keywords: R, R Statistics, Weather Underground, Web Data Scraping, ggplot2, ggmap

R Language – Useful links

I found this simple tutorial by UChicago’s Andrew Strayer a good way to get started with R.

R Study Group

R O’reilly Book – R in a Nutshell.

R on the Cloud –

R Studio – seems to be better than the native interface.

R Cookbook:

Here’s a Google TechTalk that I really liked. Prof Hadley gives a real nice intro and you can get the flavor of what’s possible with R.

Excellent Post (and a quick R Tutorial by Jeff Breen.)

Lots of cool (and easy/powerful) R tricks.

Quick-R has lots of helpful notes:

Here’s a nice ‘vignette’ by TwitteR’s Jeff Gentry –

on how to use his package.

Screencasting Software

“Screencasting” is simply the act of capturing parts of your laptop screen as a video file and sharing it with others. I have only recently started experimenting with it, but am impressed with what is possible. It has become really, really easy to produce a short educational training session and to post it on YouTube or Vimeo.

There are loads of software for it, paid as well as free. Start experimenting with free stuff and then buy a professional
version if you have sustained interest. All you need is a microphone to get started. You can also make “silent” movies, but those are not much fun to view and learn.

What is ScreenCasting? (Explanation from O’Reilly)

Here’s a Comparison of screencasting software (wikipedia article)

The following links have reviews of the different “screencasting” software options available.

I myself use CamStudio and it works just fine for my needs.


IPad Apps That I regularly Use

If you own an iPad, you should seriously consider getting these apps. They are all free.

iBooks: I still don’t understand why this is not rolled in with the device as it ships. I have downloaded the Kindle and Nook Apps, but this one is great. I use it mostly to read PDF files. Love the feature wherein if I double click a pdf page, the font grows big enough to fit the whole page, makes it easier on my eyes.

Flipboard: I love the way this application presents everything. It is “my” newspaper, with the very best possible layout. I especially like reading my Twitter feeds in Flipboard, because the pages linked to are right below the tweet itself.

Zite: Something that I discovered recently. It “learns” of my preferences and keeps dishing up better and better articles. I especially love the “custom” sections for topics that are not that popular. I love the simple click to tell it to serve me “more articles like this.”

RSS reader: Mine is called MobileRSS and it links to my Google Reader feeds. It is not the greatest of interfaces, but I have grown used to it.

The above are the “must-haves.”  The following Apps are ones that I like and frequently use.

NY Times: This is the App that I often use to ‘show off’ the iPad. I think they have only the Times Editor’s choice articles, and the presentation is crisp, spectacular even.

StumbleUpon: Truly surfing, aka whiling away time while feeling that we are being doing something productive. I find that it works better if I focus the topics I want to “stumble upon” instead of letting it pick. The photos section is good, but many of the articles that are highly popular are too sentimental and syrupy for my taste.


  TED: The Ted Talk app lays out the talks in a very appealing way. The “Themes” are a good way to select a talk that sounds appealing. The save talks feature is good, but it seems to take forever in my broadband connection. Come to think of it, the iPad Safari web browser should suffice, but I like having this app stream the Talks to me.

  BrainPop: Check out these short information-packed animations, especially if you have kids who are 5+, though they are good for all ages.  They are entertaining and educational. (Topics range from history of wars, botany, animal life, life of scientists, freedom fighters etc. I watch quite a few of them myself, and then I take the 10-question multiple-choice quiz to make sure that I have learned something from it.)

Games: After a year of owning the device, the only games I go back to are: Blue Block (I am still figuring out optimal algorithms) and Gears. And if you like cricket, for your cricket updates (IPL) consider an app called Cricitch. It seems to be enough for my needs.


Pandora is the ‘jukebox’ App that I sometimes use. (If you happen to be of Indian descent and enjoy Hindi or Tamil film music, you should download Tandora. Thanks to Mukund for the pointer to Tandora.)

For TV Guides and local movie listings, I have been using “Zap2It’s What’s On.”

Let me know (via email or comment below) if I am missing any Apps that you absolutely love.


Last updated on May 16th, 2011.

Prettying up data through visualization

Here’s a good article by Fernanda Viegas and Martin Wattenberg, who are leading up Google’s Big Picture project. In the article they talk of Visualization becoming more and more mainstream, and they advocate grabbing and keeping the viewer’s attention.

These same two folks developed IBM’s Many Eyes effort, which I am hoping to try out for myself.

Slides analyzing Steve Jobs’ presentation secrets

A lot of thought has gone into these slides. The creator has dissected Jobs’ presentations. Quite a few things for us to learn from. (View these in Full Screen mode)