Which City has the Ideal Temperature for me?

This is a side project that I did for fun. It started with my wife making what sounded like preposterous claim: She felt that Chicago IL had better weather than Sunnyvale CA.

So I did the following to convince both of us that she was wrong. You can interpret the results yourself.

The Setup

I needed to do several things first.

  1. I needed to get historical hourly temperature data. One or two full years worth
  2. How to summarize a City’s Temperature into one number? Average over what? This led to lots of discussions, and eventually to a “Personal Comfort Scale, that could vary from person to person

I also used this analysis to get myself better acquainted with several aspects of ggplot2 in R. You can choose to skip the parts that are not of particular interest to you.

Steps in the Analysis

  1. Getting Historical Weather Data from the web

  2. Agreeing on a way to Summarize temperature Data

    • Setting up a personal Weather Comfort Scale
  3. Comparing Temperatures across Cities

    1. Initial Results
    2. Plotting Densities – Comparing City Temperatures
    3. Comparing City Temperatures – Color Strips
    4. Wikipedia Style Comparisons
    5. Comparing Mean Temperatures – Across Cities by Month
  4. Comparing “Comfort Scores” across Cities

  5. Putting it all together

This post is about looking at the historical temperatures for many, many cities, mapping it to a person’s comfort factor and finding the “best” city for them, temperature-wise.

Set a Personal Comfort Scale based on a City’s Temperature

Using ggplot2

df <-read.csv("input_personal_preference.csv")
#plot a simple histogram
p <- ggplot(df, aes(x=High, y=Score))
p <- p + labs(title="Temperature Preference Scale")
p <- p + xlab("Temperature") + ylab("Personal Comfort Score")
p <- p + geom_bar(stat="identity") + aes(fill=Score)
p <- p + scale_x_continuous(breaks=seq(40,120,5))
p <- p + scale_fill_gradient("score",low="blue", high="orange")

So here’s what the personal scale looks like:

Fetch Historical Temperature Data

Verify Data Quality

#read in the data about number of data points
df = read.csv("out_numDataPoints.csv")
#create a new column for the Gaps in the data
 df$Gap <- (365 - df$Dpts)

#subset data frame to highlight big gaps
df_big = subset(df, Gap>200)

square=15 #the shape "square" is mapped to the number 15 in geom_point

p <- ggplot() + aes(title= "Missing Data Points, by City, by Hour")
p <- p + scale_x_continuous(breaks=0:23) #place tick marks on each hour of day
p <- p + geom_point(data=df_big, aes(Hour,City),color="red", shape=square, size=5)

Leads to:


If we want to focus on the cities/hours with a smaller gap, we can zoom in on a different subset:

df_small = subset(df, Gap<50)
p <- ggplot()
p <- p + geom_point(data=df_small, aes(Hour,City, color=Gap),shape=square, size=3)
#give it a continuous color scale
p <- p + scale_colour_continuous(low="blue", high="orange")

Which produces this plot:


4. Initial Results

df <- read.csv("out_cityTemperatures.csv")
names(df)<- c("City","Date","Hour","Temperature")
numcolors <- length(unique(df$Temp)) #how many colors do we need?

redgreenrange<-colorRampPalette(c(rgb(1,0,0), rgb(0,0.7,0) ))
# More complex
m <- ggplot(df, aes(x=Temperature, fill=factor(Temperature)))
m <- m + geom_histogram(binwidth=1)
m <- m + scale_fill_manual(values=redgreenrange(numcolors))

m <- m + facet_grid(~City ~ .)
m <- m + ylab("Count of Hours")
m <- m + labs(title = "Comparing Temperature Distributions Across Cities (2011) ")

This produces:

Hourly Temperature Distribution

Altering Bin Width

We can also experiment with changing the bin width (the thickness of the bars). In the example above, binwidth was 1 degree. What if we made it 10 degrees?

m <- m + geom_histogram(binwidth=10)

We lose some detail, but we gain a different perspective, a little easier to see the bigger picture:

Plotting as Densities

We can also plot multiple cities on one graph, using densities to aid in the comparisons.
(We use KDEs) Note that the colors are simply to differential cities, they do not have any inherent meaning and the cities are sorted Alphabetically.

#To plot multiple cities on one graph, as denisities
colorRange<-colorRampPalette(c(rgb(0,0,1), rgb(1,0.7,0) ))
p<- ggplot(df, aes(Temperature, color=City)) + geom_density(size=2)
p<- p + scale_color_manual(values=colorRange(5))
p <- p + ylab("Density of Hours")
p <- p + labs(title = "Comparing Temperature Distributions Across Cities (2011) ")

This produces:


Comparing City Temperatures – Color Strips

by binning Hourly Temperatures

#Bin the temperatures into 10 degree buckets, using the "cut" funtion
brk = c(seq(-10,100,10),1000)
label10s = c(as.character(seq(0,100,10)),">100")
TempBucket =cut(df$Temperature, breaks=brk, labels=label10s)

#now render barplots, colored by TempBucket
wx_range<-colorRampPalette(c(rgb(0,0.5,1), rgb(1,0.35,0) ))
p<- ggplot(df, aes(City, fill=TempBucket )) + geom_bar()
p<- p + scale_fill_manual(values=wx_range(11))
p <- p + ylab("Number of Hours in that Temperature Range")
p <- p + labs(title = "Comparing Temperature Distributions Across Cities (2011) ")

This produces:

Comparing Mean Temperatures – Across Cities by Month

We can also calculate the mean temperatures across a subset (say by Month) and then color code the values.

#Summarize the data into a new df first
df$Month <- months(as.Date(df$Date)) #Create a new column in df

smalldf <- ddply(df,.(City, Month), summarize, 
                 mean=mean(Temperature) ,
                 max = max(Temperature) ,     
                  min = min(Temperature)      )

p<- ggplot(data=smalldf, aes(factor(smalldf$Month, levels=month.name), City, color=mean) )  
p<-p + geom_text(size=10, label=as.character(round(smalldf$mean, 0)))
p<- p + scale_color_gradient(low="blue", high="orange")
p <- p + theme(panel.background = element_rect(fill= "transparent"))
p<- p+xlab("Month")
p<- p+labs(title="Mean Temperatures For Cities, by Month")

This produces:


If we wanted a more traditional x-y graph, we can plot the Mean temperatures across months, and use a line to connect the different cities.

p<- ggplot(data=smalldf, aes(factor(smalldf$Month, levels=month.name), mean, group=City, color=City)  )
p<- p+geom_point(size=5)
p<- p+geom_line(size=1, alpha=0.2)
p<- p + geom_hline(yintercept=c(50,60,70))
p<- p + theme(panel.background = element_rect(fill= "transparent"))
p<- p+xlab("Month 2011")
p<- p+labs(title="Mean Temperatures For Cities, by Month")

This produces:

Wikipedia Style Comparisons

We can compare the mean, min and max Temperatures for various cities, by Month. (Wikipedia style)

#Summarize the data into a new df first
smalldf <- ddply(df,.(City, Month), summarize, 
                 meanT= round(mean(Temperature),1) ,
                 maxT = round(max(Temperature),0) ,     
                 minT = round(min(Temperature),0)      )

ordered_month = factor(smalldf$Month, levels=month.name)

p <-NULL
p <- ggplot(data=smalldf, aes(x = ordered_month, y=meanT,
                              ymin=minT, ymax=maxT))
p <- p + geom_crossbar(width=0.2, fill="red")
p <- p + geom_text(data=smalldf, aes(y=maxT+5, label=maxT), color="red")
p <- p + geom_text(data=smalldf, aes(y=minT-5, label=minT), color="blue")
p <- p + facet_grid(City ~ .)
#Plot Aesthetics
p <- p + xlab("Month 2011") + ylab("")
p <- p+labs(title="City Mean, Max & Min Temperatures, by Month")

which produces:

We can also compare any two cities. We could declare a “winner” for say each month, and see which city wins for having a more salubrious temperature. (In the following, the “winner” is the city with a higher mean temperature for the month.)

Comparing Mean Temperatures – 2 Cities, by Month

df <- read.csv("out_cityTemperatures.csv")
names(df)<- c("City","Date","Hour","Temperature")
numcolors <- length(unique(df$Temp))
df$Month <- months(as.Date(df$Date))

#Summarize the data into a new df first
smalldf <- ddply(df,.(City, Month), summarize, mean=round(mean(Temperature),1) )
bymonthdf <- cast(smalldf, Month ~ City, value="mean")

calc_higher_mean <- function(df, colA, colB) {
  vec = ifelse(df[colA] > df[colB], "A", "B" )

cities <- unique(df$City)
A = 4
B = 2
hi_mean = calc_higher_mean(bymonthdf,  cities[A] , cities[B])

p <-NULL
p <- ggplot(data=bymonthdf, aes(x = factor(bymonthdf$Month, levels=month.name), fill=factor(hi_mean))  )
p <- p + geom_bar()
#Adjust Plot Color Scheme
coldCol <- rgb(0.1, 0.5, 0.2)
hotCol <- rgb(0.8, 0.3, 0)
p <- p + scale_fill_manual(values= c(coldCol,hotCol),
                            name="City with Higher Mean Temp",
                            labels=c(cities[A], cities[B])

#Plot Aesthetics
p <- p + theme(panel.background = element_rect(fill= "transparent"))
p <- p + theme(axis.text.y = element_blank() )
p <- p + xlab("Month 2011") + ylab("")
p <- p+labs(title="PianoGram of City with Higher Mean Temperatures, by Month")

And this produces the following PianoGram (with 12 “Piano keys”)

Calculate Personal scores

Plot the scores

Let’s plot it so that each city has a bar, color-coded to indicate the temperature range.  I got the idea for this from Hadley’s help page here.http://docs.ggplot2.org/current/geom_histogram.html, see the colorful example just following:

hist_cut + geom_bar(position=”fill”)

#read the cumulative scores data
cityscore <-read.csv("data/city_scores_curves.csv")


#In order to plot STACKED BARPLOTS, the data frame has to be melted

# melt the dataframe so that each measured variable is its own row
cs.melt <- melt(cs, id="bucket")
names(cs.melt)[2] = "City"
names(cs.melt)[3] = "Score"

#Using RampPalette to choose a suitable color range
# Using a slightly muted green
redgreenrange<-colorRampPalette(c(rgb(1,0,0), rgb(0,0.7,0) ))

po <- ggplot(cs.melt, aes(x=factor(City), y=Score, fill = factor(bucket)) )
po<- po + xlab("City Temperature Scores")
po<- po + geom_bar(stat="identity")
po <- po + scale_fill_manual(values = redgreenrange(11), name="Temperature")

City Temperature Strips

Sorting the ColorStrips

This looks good, but the cities are in some random order, the order in which the data was read in. We can change that using:

#first, we take one reading for each city, its 0 bucket score
zeros <- subset(cs.melt, bucket=="0")

#reorder the rows by the scores.
zeros.sort<- zeros[order(zeros$Score),]
# zeros.sort$City becomes the ordering we want

#invoke aes with the levels specified

After sorting the cities (by increasing order of “0” bucket size, we get the following plot.)


If we want to plot around a mid-line:

#### Now let's try to center the graph around a score of 50
#0-40 goes to one side, 50-100 goes above center line
lowbuckets = c('0','10','20','30','40')

#subset the lower half of the dataframe, rows with small scores
lowcs = subset(cs.melt,bucket %in% lowbuckets)

#Group that smaller df, by one row per City, score is sum of the scores. 100-x becomes the offset
#create a subset of df, one for each city, with score = 100-low_scores, and call that bucket "-"
newc <- ddply(lowcs, .(City), summarize, Score = 100-sum(Score), bucket="-" )
newcs <- rbind(newc,cs.melt)

#order the cities, based on the Score of the lower half

# we plot the x as a factor, sorted by the '0' bucket score
po <- ggplot(newcs, aes(x=factor(City,levels=newc[order(newc$Score),]$City), y=Score, fill = factor(bucket)) )
po <- ggplot(newcs, aes(x=factor(City,levels=zeros.sort$City), y=Score, fill = factor(bucket)) )
po <- po + xlab("City Temperature Scores (sorted by Increasing 0-score bucket size)" )
po <- po + geom_bar(stat="identity")
po <- po + scale_fill_manual(values = c(rgb(1,1,1),redgreenrange(11)), name="Temperature"")

which produces the following:


Plotting Comfort scores on Maps

‘GGMap’ is a package that allows data to be plotted on maps, as new layers, which is perfect for this exercise.

In order to try that we have to do a few things: 1. Have an input file with Cities and their scores.

# Load libraries

#read the cumulative scores data
cityscore <-read.csv("out_city_final.csv")

cities <- as.character(cityscore$City)
score <- cityscore$Score
#geocode is a convenient function that fetches Lats and longs, needed for plotting
ll = geocode(cities)

#bind it all into one useful data frame (ll)

mtype = "roadmap"
mp = get_map(location = 'USA', zoom = 4, maptype = mtype)
#If you prefer bw maps so that the data can be viewed better

#mp = get_map(location = 'USA', zoom = 4, maptype = mtype, color="bw")
mp <- ggmap(mp)

mp<- mp+ geom_point( data=ll, aes(x=lon, y=lat, color=score), size=score/10)
mp<- mp + scale_colour_continuous(low="blue", high="red")

#Label the data points, using city names. The 'lat-1' moves the names down a bit
mp<- mp+geom_text(data=ll,aes(y=lat-1,label=cities), size=4)
#would be useful to display the Scores as well
mp<- mp+geom_text(data=ll,aes(y=lat+1.5,label=round(score),color=score), size=5)

All of that produces this map:

Comfort Scores of US Cities

Plot the cities on a map of India

Thanks to ggmap, we only need to change one line of code to switch from a US map to one of India:


mp = get_map(location = 'USA', zoom = 4, maptype = mtype)
mp = get_map(location = 'India', zoom = 5, maptype = mtype)

And we get:

Comfort Scores for Cities in India

Putting it all together

Once all the computations are done, a simple graph can tell the story. Mapping, while pretty, is not necessarily the best medium for certain information. Simple tables do the job better in many cases.

df = read.csv("out_city_final.csv")
p<- ggplot(df)
p<- p+aes(x=Score,y=reorder(City, Score))


Keywords: R, R Statistics, Weather Underground, Web Data Scraping, ggplot2, ggmap