Friday, April 27, 2012

Mapping other lives - part 2

(Continued from part 1)

"One cannot observe without a theory, and what seems the simplest of ornithological tasks - to go out of doors and look out for something worth recording - is in reality one of the hardest… It is a mistake to imagine that complete impartiality and freedom from preconceived ideas is the qualification for the perfect observer. The cow has a remarkably open mind, yet it has never been found to reach a high degree of civilisation." - Max Nicholson (1929)

Every data gathering exercise should have clear aims, have a clear vision and method. Max Nicholson captured the essence of this principle in a chapter aptly titled  'How to Observe' in his 'The Study of Birds' (1929).

Observing a single individual organism

If one were to get periodic observations of an individual organism, whose biology is unknown, over a period of say a month. One might see a picture of dots distributed in space. Here is a hypothetical map, the red spots being locations where our mystery organism was found and the blue line indicating a river.

If you knew that these dots were sampled over a few weeks and had some idea of the scale
  • You could tell that your organism is reasonably mobile
    • you might be able to get some idea of how far they travel each day
  • If you knew the time at which those points were taken, you might be able to say something about their habit
    • perhaps they go regularly to water
    • perhaps there is a place where it rests
    • perhaps it never crosses the river 
All those are hypotheses and everyone including so-called non-scientists come up with them and unconsciously make estimates of how reasonable each idea might be without actually using any pompous terms for those actions. (Actually many other animals demonstrate  "scientific reasoning")

It is clear that our map is a rather crude representation of reality and that we might be jumping to conclusions. So let us continue recording the locations of our mystery organism.

Based on the additional data we might now be reasonable in assuming that our mystery organism moves along fairly definite routes. Because we still do not know much about the times at which they are present at various points we cannot tell much about their habits. Perhaps they follow a fixed daily routine. If it went up to the river edges in the mornings, one could hypothesize that the animal goes there to drink water. We still know so little about the terrain. Remember what I mentioned in the first part about maps simplifying things by hiding reality.

So what happens when our map reveals that there is a large rock in the centre? We now refine our idea of the animal and its movement. It seems to avoid moving over the rock, perhaps that would expose it to the view of predators.

Observing many individuals of a species

Individual organisms usually have a home range and will sometimes defend a territory. This is the region in which they live and obtain what they need within it. No organism will spend more energy than is worthy. No species, not even a bird, will fly a 100 km just to feed on a berry. The net gain has to positive. When too many individuals live in the same area there are conflicts and they will space themselves out by defending individual territories and pushing others out to the edges of their own if the distribution of resources is uniform enough.

Taking our mystery organism example again. Suppose we looked at a large area and looked at all the places where our mystery organism occurs and map it, we might get something like this.
Now you might actually be able to strengthen our case for the animal preferring to be close to rivers. You may also be able to make a stronger case for their inability to cross the river.

What might we be able to say about the species as a whole. What is its distribution? Here are just two possibilities based on the earlier map.
A bounding polygon

Based on distance to river or habitat suitability

The regions that we marked are not places where the species occurs, but rather specifies a region within which the species is highly likely to occur. Now the second map takes into account something about the animal's life history, proximity to water, which the map on the left ignores.

The likelihood of occurence of a species is not uniform. Indeed most modern maps no longer use a single flat colour and instead use shades. The situation is a lot more difficult to deal with when we look at the distribution of animals at the scale of a country the size of India.

Observing over time

There are also great changes in the distributions of organisms over time and seasons. There might have been a time when the distribution map of a tiger or elephant could well have shown all of India shaded. Today we have hemmed such species into pockets. So the habitat per se could well be good for a species but its likelihood of occurrence is altered by anthropogenic factors. There can also seasonal changes.
Shrinking distribution of Sarus Crane (Sundar, KSG et al. 2000)
Pretty but not accurate

We decided from our example that distribution maps essentially indicated likelihood. How is that probability figure arrived at? Why are maps in black and white if probability can vary continuously from 0 to 1? The reality is that most maps that one sees are more artistry than science. This is rarely admitted but perhaps understandable given that most of these books are targetted at touring birders rather than those interested in longer-term aspects of ecology or conservation.

Here is a comparison of the distribution maps of a "fairly common" species (within its range, whatever that may be!) - now Parus cinereus (earlier included in Parus major as P. m. cinereus, P. m. stupae etc. ). This is one of my favourites, simply because it should be a good target to test "citizen science" ideas. One can see a range of ideas on the distribution of the species from fairly simple ideas from 1969 to very complex ones in recent times. One would expect the more complex maps to be better justified. Kazmierczak and Van Perlo (2000) indicate spots for what they think of as "outliers". None of these published maps actually share the underlying point records, making it hard to make any judgement on accuracy.

Ali & Ripley (1969) - "Handbook".Volume 9

Grimmett, Inskipp & Inskipp (1998)
Kazmierczak & Van Perlo (2000)

Rasmussen & Anderton (2005)

The nature of spatio-temporal data

Data association with locations and time can be of a range of types. In the previous part on history we noted the early geological mapping work of "Strata Smith". What he noted was that the pattern of strata in the soil showed continuity. If location A and location B had identical strata, then a point halfway between A and B was most likely to match the pattern as well. This is something that simply does not happen in animal distributions.

There can be good habitats with a high density of a particular animal and there could be similar habitat nearby with a few nearby, but suitable habitats between them do not neccessarily have an intermediate density. The mechanics of animal population movements is more complex. A population that is breeding at one location could result in younger animals dispersing out of it in search of good habitats in the vicinity. So it would be a bit like diffusion. Animal populations can also change over a fairly short time period. The original population being present itself is a part of biogeographic history. The whole idea of population density is a derived one and the only way to make that measure appear smooth over space is through the use of computational approaches. There are a whole slew of computational approaches for this - the commonest involving the use of triangulation, nearest neighbours, kernels or splines (the 3D version known as a thin-plate spline). Because the choice of method affects the result, it makes sense for raw data to be preserved for posterity with any work in this area.

Rainfall is a feature that is best displayed as an interpolated surfaces and not as points
Weather and other such large-scale phenomena are good candidates for smoothing. Interpolation of weather data on a large scale has been found to be reasonable except in hilly areas where there are extremely local effects due to topology. Those working on systems like this would benefit from a reading of the methodology used in the creation of WorldClim. The India Biodiversity Portal has some rather  confused maps in this regard. When a climatic layer is selected, it produces a scatter of points. One would have thought that there would be some continuous data layer (at least monthwise averages associated with those points, although this is already done by WorldClim). If this site is to be usable, the best option would be for the system to take raw data from the met stations and generate interpolated grids. These grids should be precalculated and one should be able to query the weather parameter for any day-month-year, month-year, or month (average over a year range). Minimal bending thin-spline based interpolated layers would be particularly nice to have. Precomputing them at a resolution appropriate to the density of stations would be nice. If such layer data is available, it would be possible for a researcher to examine relationships between species occurrence and climatic factors.

Because spatial data can vary over time, any system that works on them should allow for a time period to be specified. So a typical website would have sliders for start year (date) and end year (date) that allow one to make use of a filtered set of records. Biological data also shows cyclic annual patterns, and so data aggregation by month and week within year would be desirable.

The mighty sparrow

A classic work by Edward Tufte warns us of the lie factor in graphics. The size of markers lie about the density of sampling. A sparrow is a tiny bird and using a large marker obscures the sparsely distributed samples seen in this 2012 website created ostensibly to study the House Sparrow in India.

One should be cautious about the size of markers used.
The website does not make the data available (although it claims it will), nor does it indicate a hypothesis, method used or underlying assumptions. Given the data being gathered and the non-standard comments received, it is hard to imagine what the expected outcomes are. The site claims that the exercise has been organized by the BNHS. The BNHS is a rather ancient (if not archaic) organization that even at the time of founding decided to keep away the average Briton living in India. The opening page of the first Journal issue is telling:
...was founded on the 15th September 1882 by seven gentlemen interested in natural history, who proposed to meet monthly and exchange notes, exhibit interesting specimens, and otherwise encourage one another. The subscription was purposely made little more than nominal, ...
As a club largely for gun-wielding English officers, its Indian members were chiefly from the royalty belonging to princely states within India (who were allowed to hunt while ordinary folk rarely came in touch with the wildlife). After Independence, it remained largely a club of upper-class Mumbai residents. During its existence, there have been few major activities outside of its member-clique. Despite this elite membership, its journal has been subsidised by tax-payer funds (MoEF). In the years of its existence, numerous surveys have been conducted and along with its spin-off organization, SACON, many questionnaires have been posted to its members. The aims of these surveys, the data collected and the results have rarely been made available. Of course one cannot question private surveys circulated within a private club. The matter however becomes serious when public money is spent on them.


In the case of this sparrow website, the about link has little information to offer. The potential contributor is  not provided any education, training or insight into the project which apart from other considerations can lead to the collection of data of dubious quality. A sparrow is a small flocking bird and the distributions of flocks are patchy. If the project aims to find habitat associations, then one should have been asking for more specifics on the habitat where they were seen. Because this survey is also looking at historic data, it would require historic conditions to be captured. This is not an easy problem and at the very minimum the aims and methods followed could have been discussed. Improvements could have been suggested if it was open to citizen comment. The project appears not to be a long term one either and in spite of being so badly designed has actually been funded by the Ministry of Environment and Forests (the extent of which is not even revealed). And it is rather unfortunate that the BNHS does not even add its own museum records to the database! The spatial display merely hides the poor design and even this display leaves too much to be desired. Compare the quality of display in a system like eBird or even the Indian Rail monitoring website.

Google Maps are very often used for showing point records, while this might be arguably easy to program, it is something that should be avoided. In fact if a system was merely required to store point records and show them on a map - a very straightforward way would be to install MediaWiki - transfer some of the basic templates needed and one can have spot maps like the one here on White-tailed Iora. MediaWiki security settings can of course be altered. An extra benefit of using something like this is that the traceability of records and their modification are automatically handled.


Simpler citizen engagement projects for the BNHS to attempt


The BNHS could try small exercises in science before attempting something complex. In terms of data collation, there are numerous things that they can do on their own - things that any other organization of its kind would have taken up as a routine activity. Here are a few that have the organization should have done a long time ago.
  • Digitize all the specimens held in the collections - make a big Excel File / consider using Google documents - spreadsheet - with species, collector, date, location, determiner, sex, other information, length, culmen, tarsus, wing, tail etc. - a line per specimen
  • Photograph all the specimens and labels carefully and make these available online as well
  • Digitize all the ringing records and ring recoveries - georeference locations - as above
  • Scan all the old archival literature and records and save them for posterity and make them online - convert to PDF and upload to the Internet Archive 

All the above activities can easily be done as "citizen science" - one would just have to let in interested citizens into the premises and they could take up the tasks above. This would cost nothing and earn the organization some credit and would help change it from a colonial club to a serious modern organization. Making such data available would be more helpful than conducting badly designed surveys.

Observing many species over time

Keeping track of the geographical distributions of all bird species over time should help track many other patterns. Do species co-occur? Are there species showing a mutually exclusive pattern? What is the species richness at various places?

What about abundance? Can one get a relative measure using non-standardized and low-effort approaches? Does the reliability vary across species?

Naturally, there are so many questions to be asked ? Many attempts have been made to do such studies. SACON once sent out questionnaires to everyone asking them to count birds on the birthday of Salim Ali. Many of us sent in data but nothing useful came out of it and in fact it generated only scepticism in future survey participation.

Working with raw data

Just to show what an average computer user with access to the Internet can do I consider about 260 georeferenced point occurrence records for Parus cinereus in the BirdSpot database - either sight reports published in journals, egroups or locations extracted from the labels of specimens in museums (museums in India refuse to such data even upon request - and that is another story!). There are definite biases in reporting, more records noted in winter and southern India has a better published record of the species. Additionally there is no information on non-occurrence - no data points recording the absence of the species. Proving absence is a lot harder. The data is spread out over a century or more and they are all taken together here.


The map here is made using basic high school geography techniques and does not use a GIS. And here is a little experiment on the way to see how markers can mislead the observer about the sampling intensity. I have Google Earth installed and so using the button "Google Earth" and changinge the maker size we get.


It is also very easy to work with the open source statistical system R. If you have good Internet access, you can always add new analytical packages.

I have the point records for Parus cinereus in a comma-separated file with the format:

Long, Lat
72.6,23
77.5,27.5
77.58,12.98
77.2,28.5
75,26.5
...
79.47,29.4
Inside the R console you provide commands as follows to do a quick plot.
 # Do this once to get the extra packages
install.packages(c('raster', 'rgdal', 'dismo', 'rJava', 'fossil', 'vegan', 'ape'))

# load the library at the start of the session
library(maptools)
# use the world map outlines
data(wrld_simpl)
# plot the outlines (don't care about your country borders, the other animals don't either)
plot(wrld_simpl, xlim=c(60,100), ylim=c(0,40), axes=TRUE)
parus<-read.csv("PATH/TO/DATA/points.csv", header=TRUE)
points(parus$Long, parus$Lat, col='red', pch=20, cex=0.75)

Now coming back to the Parus cinereus occurrence records. What can be extracted from such a set of points? The first and roughest approach to finding the boundary distribution of a species can be derived using what is called a convex hull. You can see that it can greatly overestimate the distribution range of a species.

With the data set above in R one can quickly plot it using the following snippet:
# we need to provide the data as a matrix to find the hull
hull<-chull(cbind(parus$Long, parus$Lat))
# add the initial point at the end to allow a closed polygon
hull <- c(hull, hull[1])
lines(cbind(parus$Long[hull], parus$Lat[hull]))

Minimum bounding polygon / convex hull for Parus cinereus

The following diagram of a "minimum convex bounding polygon" (Sundar et al. 2000) is not. It is one of many possible bounding polygons but is not convex. And although it looks like a much tighter boundary, this is not unique. Indeed one could push in many other edges until they touch points. This is not based on an algorthmic approach and is therefore not "repeatable", although it is possible that this region is more accurate and guided by the intuition of the author(s).

Sarus distribution - spot records and a "manually chosen" bounding polygon

Another technique to look at animal distributions is through the use of what is called a "minimum spanning tree". This allows one to think of how the populations of organisms could have movements between them. Note that this is simplistic and does not take into account rivers or other hurdles that an organism may not be able to cross. It could however help identify disjunct distributions through automated procedures. A long connecting edge in the MST that exceeds a certation distance threshold (dependent on the mobility of the organism) could be cut out to identify population clusters. A computationally more complex approach but one worth trying is a Steiner tree - which would introduce new nodes (=potentially new areas to conserve) that help identify a better network of connections between populations or protected areas. These algorithms require a distance matrix and are affected by zeros in it. A quick fix therefore involves the removal of duplicate records. So here we go with R again:

# remove duplicated points
dups<-duplicated(cbind(parus$Long, parus$Lat))
parusuniq<-cbind(parus$Long[!dups], parus$Lat[!dups])
# calculate distance matrix (here Euclidean)
dm<-dist(parusuniq)
# identify mst
library(ape)
tree<-mst(dm)
# use library fossil to find and plot the mst
library(fossil)
mstlines(tree, parusuniq)
Which yields this spot map with an overlaid minimum spanning tree.

A minimum spanning tree is a way of indicating potential gene-flow between populations

One can also see spatial density using smoothed plots. Note that this merely gives the density of records available, but could be more representative if we have some indicator of sampling intensity incorporated into this.


smoothScatter(parus,xlim=c(60,100),ylim=c(0,40), axes=TRUE)
plot(wrld_simpl, xlim=c(60,100), ylim=c(0,40), axes=TRUE,add=T)
A simple density scatter plot. The kernel smoothing procedure bleeds into the sea but that can be masked.


These methods above do not take into account the habitat of the species. There are number of approaches to checking associations of species with their habitat and coming up with propositions on what they like or what limits their distribution. This requires additional data for climate, vegetation and abiotic measurements across the region of interest. These are usually available as what are called raster files. One well known set that is used widely in ecology is called BioClim. The datafiles are large and split into tiles. About 4 tiles need to merged together to cover the Indian region. You need to know a bit about how to convert band-interleaved data files into grid files (you can do this easily and also explore this with the free GIS - DIVA-GIS). Note that the environmental data layers may not be entirely suitable for examining distribution. Unfortunately such environmental data layers are not found at sufficiently high resolution, especially for the Indian region. Additionally, environmental conditions change and historic data would ideally need to be compared with historic conditions. Satellite and weather data for the Indian region are not easily available even to researchers. The government charges for some of this unlike the situation in many other parts of the world. The BioClim dataset is one that is free but its limitations should be understood and projects like the India Biodiversity Portal need to improve the quality of data layers of this kind.


The mechanical parts of running the analysis are a lot easier than grasping the theory, its benefit and pitfalls. Once you have it all in place, you can use the R package dismo to continue our exploration of mapping the possible distribution of Parus cinereus. As mentioned earlier distribution maps have been traditionally show in black-and-white when in fact they should be in shades-of-grey due to the probabilistic nature of the data. To give an idea of it, I have used the dismo package to generate a "bioclim" model and done a prediction for just a part of India (due to memory constraints). Now I am sure most will agree that making such tools more easy to use via a website would make sense. Note here that exploratory data analysis is an important aspect of science and one that should be shared with citizens in so-called "citizen science" projects. Showing cooked results is what scientists do when they submit papers to journals. Citizen science requires the recipe to be available.

# once you have downloaded the bioclim data for your region
layers<-c('PATH/TO/FILEbio1_asia.grd',
'PATH/TO/FILEbio1_asia.grd',
'PATH/TO/FILEbio2_asia.grd',
'PATH/TO/FILEbio3_asia.grd',
'PATH/TO/FILEbio4_asia.grd',
'PATH/TO/FILEbio5_asia.grd',
'PATH/TO/FILEbio6_asia.grd',
'PATH/TO/FILEbio7_asia.grd',
'PATH/TO/FILEbio8_asia.grd',
'PATH/TO/FILEbio9_asia.grd',
'PATH/TO/FILEbio10_asia.grd',
'PATH/TO/FILEbio11_asia.grd',
'PATH/TO/FILEbio12_asia.grd',
'PATH/TO/FILEbio13_asia.grd',
'PATH/TO/FILEbio14_asia.grd',
'PATH/TO/FILEbio15_asia.grd',
'PATH/TO/FILEbio16_asia.grd',
'PATH/TO/FILEbio17_asia.grd',
'PATH/TO/FILEbio18_asia.grd',
'PATH/TO/FILEbio19_asia.grd'
)
# make a stack of the bioclim layers
# note that they all have to be of the same region
library(raster)
predictors<-stack(files)
# build a bioclimatic model
bc<-bioclim(predictors,parus)
# predict for southern India alone
ext<-extent(71,83,8,17)
pb <- predict(predictors, bc, ext=ext)
plot(pb) # plot the probabilistic distribution
library(maptools)
# add the map boundary
plot(wrld_simpl,add=TRUE, border='dark grey')
# mark the actual point records
points(parus)

A probabilistic distribution map predicted using bioclimatic variables
There are numerous other techniques and a particularly elegant technique is  one based on the Mahalanobis distance. Unfortunately the R dismo implementation appears to be poorly done and even when unique points are provided the algorithm runs into trouble as it finds the covariance matrix near singular. An iterative power/deflation/NIPALS method would have been much faster and just a few principal components should have been computed. The method of course is easy to understand and implement on your own. Needless to say, the whole technique which is so ubiquitous in industrial statistics across the world seems to be hardly known within Indian academia. An irony given that the method itself was born in India, when Nelson Annandale, the director of the ZSI came across a problem and suggested it to P C Mahalanobis, the famous statistician, who came up with an elegant solution.


The R package "dismo" for distribution modeling is something that all biodiversity website designers should examine - that is of course assuming that they have already received a training in the basics of ecology particularly those topics dealt under the umbrellay of macroecology.  This should be compulsory reading even for software developers involved in any such project. From an end-user perspective such functionality has to be present on any biodiversity website/portal.

Summary

Numerous techniques exist for examining organism distributions over large scales using large amounts of data. The data is hard to collect and cannot be done within the limited world of academia where faculty strive to keep their careers and students flee by in their quest for careers. With failing achievements within that system the only hope then is outside of it - and it is little wonder that "citizen science" has become a buzzword. It however has many connotations to it - in particular - the use of citizens as instruments to gather data for professional scientists to advance their own careers is not one that can be seen positively. I personally suspect that low HDI countries will not be able to work effectively in cooperative projects. One has to be at a situation where self-preservation is not important in order to do good science and the extreme skew in the availability of resources to citizens in India does not do any good. Many projects claim to be equitable but the language used often belies the claim. (I have written about this in the past with regard to the Cornell eBird project. The response to it can also be found, and eBird  website represents a system that should be considered as the minimum standard for anyone to achieve in their own projects) Attitudes of superiority of compiler over contributor can be detected in the design of user-interfaces, the features and policies used. Such attitudes have no place if citizens are to be employed. Non-inclusive "collaboration" is doomed to fail in the long run and inclusiveness cannot coexist with "exclusive membership" in club-like/clique-oriented organizations such as the BNHS. Public money spent on private clubs has to be carefully monitored.

The idea of taking so much space here is to point out that that there is so much to the life of an organism that can be told by putting a collar on an animal or using an electronic gizmo to spy on it. Instruments have a way of generating a large amount of data. Humans can also be instruments for gathering such information. The gathering of data is itself fraught with risks and every detail counts. Gathering and looking at large amounts of data may require the use of computers - but - reasoning will always have a place and that has to be shared freely. Science requires clear reasoning even if it has to be labelled as "citizen science". If traditional science restricted reasoning within universities, then citizen science would necessarily have to make that  reasoning public. As citizens we cannot afford to be "open-minded cows" and we certainly should not be treating anyone claiming to be a scientist as a holy one.

Postscript

(May 2, 2012)
In examining the history of informed consent in medical practice one comes across a guiding principle 

“Every human being of adult years and sound mind has a right to determine what shall be done with his own body; and a surgeon who performs an operation without his patient’s consent commits as assault for which he is liable in damages.” - Schloendorff

In essence I think "scientists" working with "citizens" (the two classes being entirely arbitrary division by misguided professionals who have forgotten that science is an essentially egalitarian enterprise) should think of a similar contract in large scale data gathering exercises going under the umbrage of "citizen-science":

“Every human being of adult years and sound mind has a right to determine what shall be done with his own knowledge contribution; and a person  who performs an operation on that knowledge without the consent of the contributor commits an assault for which he is liable in damages.”

(July 3, 2012)
Here is a BTO report on sparrow declines which gives the kind of approaches one expects a typical study to use and the kind of results to examine. An average citizen like me can expect scientists to aspire to the level set in this study.

21 December 2014: Someone recently wrote to me asking about the use of Voronoi tesselations. If you are mapping a species at a very high resolution, this can be useful to examine home-ranges from point locations. This would be especially useful for territorial species. Let us say you had the locations of all singing Magpie Robins, you could generate a rough division of their home-ranges/territories even without having to establish the boundaries by assuming that they are divided mid-way between neighbouring birds. If that assumption works you could use the library deldir to plot dividing lines. As an example

library(deldir)
x <- 1000*runif(10)  // your longitudes can go here
y <- 1000*runif(10)  // your latitudes can go here
vt <- deldir(x, y) 
plot(vt, wlines="tess", lty="solid", add=TRUE)

Reference 
  • Ali, S and S D Ripley (1968-) Handbook of the Birds of India and Pakistan. Edition 2. Volumes 1-10.
  • Brown, J.H. and M.V. Lomolino.1998. Biogeography (2nd ed). Sinauer
  • Brown, J.H. 1995. Macroecology. University of Chicago Press, Chicago.
  • Richard Grimmett, Carol Inskipp and Tim Inskipp (1998) Birds of the Indian Subcontinent. Oxford University Press.
  • Kazmeirczak, K and Van Perlo, B. (2000) A fieldguide to the Birds of India, Srilanka, Pakistan, Nepal,Bhutan, Bangladesh and the Maldives. OM Book Service
  • Rasmussen PC and JC Anderton (2005) Birds of South Asia. The Ripley Guide. Volume 1 and 2. Smithsonian Institution and Lynx Edicions.
  • Sundar, KSG; Kaur, J; Choudhury, BC (2000). "Distribution, demography and conservation status of the Indian Sarus Crane (Grus antigone antigone) in India". J. Bombay Nat. Hist. Soc. 97 (3): 319–339.
  • R package "dismo" (Along with an introduction to distribution modeling
  • An older presentation made to the BNHS (on the need to trash elitism and clique-ish behaviour) 
  • A biodiversity mapping project from Australia
  • PlotKML package - http://gsif.isric.org/doku.php?id=wiki:tutorial_plotkml

No comments:

Post a Comment