본문 바로가기
R

(R) Alternatives to overlapped geom_point()/ continuous/ discrete /stat_bin2d()/stat_binhex()/position="jitter"

by jangpiano 2020. 8. 31.
반응형

<The limit of scatter dot plot - geom_point()


When the density of the graph in a particular range is high, we cannot really figure out how dense it is, and how many dots are in the position. For a solution, we can set alpha and shape of the dots, but the solutions are not enough to solve the limitations. 


ggplot(data=diamonds,aes(x=x*y*z,y=price))+geom_point()

> ggplot(data=diamonds,aes(x=x*y*z,y=price))+geom_point(alpha=0.1)

> ggplot(data=diamonds,aes(x=x*y*z,y=price))+geom_point(shape=1)





<Alternative for a continuous variable - graphs showing frequencies> 

<stat_bin2d()> 

by using stat_bin2d() you can express how many values are in some positions by making some rectangles that represent the frequencies. 
The basic set of stat_bin2d() is 900, by dividing x into 30 bins and dividing y into 30 bins. 


> ggplot(data=diamonds,aes(x=x*y*z,y=price))+stat_bin2d()


> ggplot(data=diamonds,aes(x=x*y*z,y=price))+stat_bin2d(bins=50)


For this example, by setting the number of bins 50, the graph becomes having 2500 rectangles in total. 

by setting the number of bins yourself, you can have a more concrete frequency graph. 


> ggplot(data=diamonds,aes(x=x*y*z,y=price))+stat_bin2d(bins=50)+scale_fill_gradient(low="lightblue",high="black")



> ggplot(data=diamonds,aes(x=x*y*z,y=price))+stat_bin2d(bins=50)+scale_fill_gradient(low="lightblue",high="black",limits=c(0,9000))


the legend can be set manually by adding 'limits=c()'


<stat_binhex() - hexagon bins>


>install.packages("hexbin")

>library(hexbin)


> ggplot(data=diamonds,aes(x=x*y*z,y=price))+stat_binhex()


> ggplot(data=diamonds,aes(x=x*y*z,y=price))+stat_binhex()+scale_fill_gradient(low="lightblue",high="red",limits=c(0,12000))

> ggplot(data=diamonds,aes(x=x*y*z,y=price))+stat_binhex()+scale_fill_gradient(low="lightblue",high="red",breaks=c(1000,2000,3000,4000,5000,6000,7000,8000,9000,10000,11000,12000),limits=c(0,12000))

> ggplot(data=diamonds,aes(x=x*y*z,y=price))+stat_binhex()+scale_fill_gradient(low="lightblue",high="red",breaks=c(seq(from=0,to=12000,by=1000)),limits=c(0,12000))


seq(from=9,to=12000,by=1000) is the simple version of breaks=c(1000,2000,3000,4000,5000,6000,7000,8000,9000,10000,11000,12000)


<The reason for gray bins when setting the legend manually > 


> ggplot(data=diamonds,aes(x=x*y*z,y=price))+stat_binhex()+scale_fill_gradient(low="lightblue",high="red",limits=c(0,2000))


you can set the legend manually by adding 'limits=c()'. 

In this case, you should consider the range of counts carefully. 

The gray hexagon represents that the frequencies(counts) of the area is over the range of legend. 



<The way to jittering overlapped dots 

You may know that the limitation of the dot plot is a disability of representing frequencies. 

I think jittering dots seems to fix the problem. 

However, this method is only useful when using a discrete variable in x or y.  


> ggplot(data=mpg,aes(x=class,y=cty))+geom_point()


you cannot know the frequencies of dots by just using geom_point() 

> ggplot(data=mpg,aes(x=class,y=cty))+geom_point(position="jitter")


> ggplot(data=mpg,aes(x=class,y=cty))+geom_point(position=position_jitter(width=0.3,height=0))





반응형