Wednesday, August 29, 2007

Data, data, everywhere, enough to drown a fish…..

The Growlery has linked to his article in Scientific Computing, Making child's play of power tools.

I found this article both interesting and stimulating. It is good summary of data-crunching software with particular emphasis on how big league programs have developed modules aimed at students. It raises a number of issues and it also brings back some memories to this feeble old mind.

Firstly, it got me to thinking about education, particularly scientific education, and how it has changed over the years, and then its extension into scientific research. Secondly, it got me thinking, once again, about kinetic data and computing. I’ll simply reminiscence about the first topic, but the second is nearer and dearer to my heart since it reflects back on our discussion of information transfer.

Reminiscences:
When I was in high school, Sputnik had just gone up. If you had the least aptitude for science or math, you were lavishly encouraged. I will admit, that we all thought this was the way the world was heading and found it inconceivable that we would ever live in a country whose President was scared of science and thought his intuition was better. Worse yet, a President who states that he talks to God. (if that doesn’t make an atheist out of you, nothing will.) Science was pretty near the most exciting thing out there. In the mid 50’s we got a chance to see a computer, first hand, and to try our hand at programming. It was the IBM 650 which was the first real workhorse for business.
IBM introduced a smaller, more affordable computer in 1954 that proved very popular. The IBM 650 weighed over 900 kg, the attached power supply weighed around 1350 kg and both were held in separate cabinets of roughly 1.5 meters by 0.9 meters by 1.8 meters. It cost $500,000 or could be leased for $3,500 a month. Its drum memory was originally only 2000 ten-digit words, and required arcane programming for efficient computing.
This “memory” consisted, as it says, of 2000 addresses on a fast rotating drum. Data was read into it by punch cards (punch cards continued as the preferred mode of input up into the 70’s) and its output was on a teletype like printer. To program this computer, one had to place instructions on the disk in such a way that you took advantage of its rotation. That is, by the time the computer read the first instruction and executed it, the drum had rotated by, I think, three positions. The only thing the computer could do was add, subtract, multiply, and do logical tests like AND, EITHER, OR. Our final task in the introductory course was to program the 650 to do a square root. No hand held calculators here, just a $3 million dollar (2007 dollars) machine! I think we used Newton’s Method.

Moving onward to undergraduate school. The university where I went had a lot of pre-med students (I wasn’t one then). They all had to take Analytical Chemistry with us Chem majors. While there were some electronic instruments available, this was mostly a course of titrating one chemical with another. The data was read off a burette and analyzed using logarithms out of the Chemical Rubber Company handbook. Thus, all computations were done by hand without benefit even of an adding machine. (Slide rules, which we all carried in belt holsters, were, of course, not precise enough for this work.) You were graded on accuracy. What you gained from this exercise, which should have been appreciated by the pre-meds (it wasn’t), was precision. Also, in the context of the Growlery’s article, one really, really, really got close to the data in a way that is impossible today. In this course we had to personally collect each little tidbit of information by reading the meniscus on a burette. Compare that to what surely must be the current situation. Now students are inundated with thousands of pieces of data collected by a machine. There is inflicted remoteness from the primary data source. I think you can argue that this trivializes data in a way, but I’m open to other views.

Further onward to graduate school. Computers were just in the offing. We had a big IBM 360 main frame, but it was hard to get time on it.


Graduate chemistry was an experience with an effusion of data. However, it still was collected in a very primal way. The most common was analogue data (peaks on a chromatogram, infrared spectroscopy, NMR, etc.) which we digitized by hand. That is, we measured distances and, believe it or not, cut out paper peaks and weighed them on a triple beam balance . At this time, during the late sixties, crystallographers went from determining their structures by reading diffraction film and measuring distances by hand, doing the calculations on a big adding machine, to automated machines that collected data on a Geiger counter that generated punch cards. The cards still had to be carried to the “big” computer (e.g. the 360) where the data was processed. But, in the end, we even got a drawn structure with the CalComp plotter. This was graphics in its infancy.

Was the exponential increase in quantity of data worth the removal of the researcher from the first hand collection of data worth it? I am sure that it was. None of the current technical advances would have been able without it (e.g. sequencing the human genome). However, there was something earthy in all those notebooks of numbers put down by hand that must be missing in this age of printouts and automatically generated spread sheets. The question becomes, has this had an effect on the possibility of a researcher making an insight? Can such data overwhelm and drown the investigation? This might be true particularly in the field of neurobiology where one is dealing with 100 Billion neurons.

Kinetics:
Much later, I encountered a problem with data that baffled me. It wasn’t a very profound problem, but it occupied me for years while my main job was doing medicine. One of the principal anticancer drugs used in oncology over the past 30 years has been cisplatinum. The story of how an inorganic complex got to be where it is in the treatment of cancer is fascinating, but I’ll leave it for a future thread. In any case, I had been puzzled by the fact that everyone believed that cisplatinum achieved its anticancer action by reacting with DNA. This just didn’t make chemical sense to me (it still doesn’t) since platinum complexes are much more reactive with sulfur compounds than the predominately nitrogen containing nucleotides of DNA. In any case, we did some experiments with cisplatinum and glutathione (one of principal sulfhydral containing biomolecules). To make a long story short, the reaction is a two stage reaction with the complex reacting with glutathione and then that product reacting further. We never identified the products, we were mainly interested in the rate of reaction. Sounds trivial, I know. Mathematically this two stage reaction is represented by simultaneous differential equations. Being a good Scout, I tried to solve them. It turns out that the kinetics of competing reactions is difficult to do in an exact manner.

We finally solved the problem by modeling it in Excel (actually Lotus 1-2-3, it was that long ago). We (re)discovered that any complex kinetic problem can be divided into smaller and smaller time segments (Newton strikes again). If the time segments are small enough, then one can have each reaction run sequentially, and use the product of the reactions as the starting point of the next, tiny, time segment. To determine the rate constants you had to manually compare the reaction curves. I think this is similar to what is called Euler’s method but, remember, I am not a mathematician. Eventually, I discovered that this technique had already been computerized for me in a nice graphical manner by the people at ScientificSoftware with their program Stella. I didn't see it in the list of programs discussed by the Growlery probably because it is a different field.

One of the first things that ScientificSoftware points out when introducing students to Stella is that the human brain has a very hard time predicting something in the future when there is more than one input (i.e. competing reactions). The example they use is wolves and deer in the forest above the Grand Canyon (the Kaibab Plateau). Predicting how the population of deer changes depending on the number of wolves and the birth rate of wolves and deer is hard enough. Throw in a rough winter, human hunters, etc. and it becomes impossible. Stella, on the other hand, solves the problem nicely. It is an excellent kinetic “what if” program. I guess this is the field of system dynamics. (I am also pretty sure that this is the way that computer games are programmed, so my tiny effort has been lost in an ever expanding super nova of Tomb Raiders and The Sims.)

Kinetic modelling definitely impacts on discussions we have had before. One of the defining characteristics of the human brain, as opposed to much smaller mammalian brains, is its ability to construct future scenarios. While we have just mentioned the limitations of theoretical projections, e.g. more than one kinetic input, in practice, in the every day world, humans make predictions on a minute by minute basis. Driving a car in busy traffic requires an awful lot of intuitive computing of trajectories.

However, as we have previously described, in the end run, all mental processes are reducible to basic electrochemical reactions. Millions of them, to be sure, but still there. This set of reactors (computers if you will) is presented with millions of pieces of sequential data from, for instance, the eye. The processing of this data must involve kinetics. For instance, I have just been reading about the period gene, which is a highly conserved gene in a group responsible for the internal clock of everything from the fruit fly to the human. This clock probably operates by having an oscillating concentration of a biomolecule, actually a protein, with high levels shutting down its production and low levels stimulating its production (by transcription of the per gene and/or generation of RNA)

Finally, it occurred to me when reading this paper that many of the examples were of static data. That it is very hard to perform statistics on moving data. Many of us are familiar with Kaplan-Meier plots of survival data in oncology. These are used to compare survival data for patients using various treatment regimens, +/- surgery and +/- radiation therapy. While diverging survival curves are frequently pounced upon as showing the superiority of a treatment regimen, such plots must be taken with a grain of salt because of the increasing statistical uncertainty of the data as it matures. One of the advantages of static data (I assume that most of the data referred to in the Growlery’s paper is static data) is that very sophisticated statistics can be done on this data.

However, for moving data, i.e. kinetic data, all sorts of new problems arise. Taking all the data at a single point one can apply well known statistical analysis. However, if that data moves on to a new time point, one cannot treat the data at that point as independent, since it is constrained in some way by where the data started from at the previous point in time. I have never understood how statistical analysis took care of this other than connecting dots one to another. Maybe these new programs can do this.

1 comment:

Anonymous said...

What a beautiful post! I spend so much time railing at my trainees to "get intimate with your data", and they just look at me like I am out of my fucking mind. You really hit the nail on the head.