Wednesday, May 25, 2011

More Open Melting Points from EPI and other sources: on the path to ultimate curation

As recently as 2008, Hughes et al published a paper asking: Why Are Some Properties More Difficult To Predict than Others? A Study of QSPR of Solubility, Melting Point, and Log P
The question then is: why do QSPR models consistently perform significantly worse with regard to melting point? In the Introduction, we proposed three reasons for the failure of QSPR models: problems with the data, the descriptors, or the modeling methods. We find issues with the data unlikely to be the only source of error in Log S, Tm, and Log P predictions. Although the accuracy of the data provides a fundamental limit on the quality of a QSPR model, we attempted to minimize its influence by selecting consistent, high quality data... With regards to the accuracy of Tm and Log P data, both properties are associated with smaller errors than Log S measurement. Moreover, the melting point model performed the worst, yet it is by far the most straightforward property to measure...We suggest that the failure of existing chemoinformatics descriptors adequately to describe interactions in the crystalline solid phase may be a significant cause of error in melting point prediction.
Indeed, I have often heard that melting point prediction is notoriously difficult. This paper attempted to discover why and suggested that it is more likely that the problem is related to a deficiency in available descriptors rather than data quality. The authors seem to argue that taking a melting point is so straightforward that the resulting dataset is almost self-evidently high quality.

I might have thought the same before we started collecting melting point datasets.

It turns out that validating melting points can be very challenging and we have found enormous errors - even cases where the same compound in the same dataset is assigned very different melting points. Under such conditions it is mathematically impossible to obtain high correlations between predicted and "measured" values.

Since we have no additional information to go on (no spectral proof of purity, reports of heating rate, observations of melting behavior, etc.) the only way we can validate data points is to look for strong convergence from multiple sources. For example, consider the -130 C value for the melting point of ethanol (as discussed previously in detail). It is clearly an outlier from the very closely clustered values near -114 C.


This outlier value is now highlighted in red to indicate that it was explicitly identified to not be used in calculating the average. Andrew Lang has now updated the melting point explorer to allow a convenient way to select or deselect outliers and indicate a reason (service #3). For large separate datasets - such as the Alfa Aesar collection - this can be done right on the melting point explorer interface with a click. For values recorded in the Chemical Information Validation sheet, one has to update the spreadsheet directly.

This is the same strategy that we used for our solubility data - in that case by marking outliers with "DONOTUSE". This way, we never delete data so that anyone can question our decision to exclude data points. Also by not deleting data, meaningful statistical analyses of the quality of currently available chemical information can be performed for a variety of applications.

The donation of the Alfa Aesar dataset to the public domain was instrumental in allowing us to start systematically validating or excluding data points for practical or modeling applications. We have also just received confirmation that the entire EPI (PhysProp) melting point dataset can be used as Open Data. Many thanks to Antony Williams for coordinating this agreement and for approval and advice from Bob Boethling at the EPA and Bill Meylan at SRC.

In the best case scenario, most of the melting point values will quickly converge as in the ethanol case above. However, we have also observed cases where convergence simply doesn't happen.

Consider the collection of reported melting points for benzylamine.


One has to be careful when determining how many "different" values are in this collection. Identical values are suspicious since they may very well originate from the same ultimate source. Convergence for the ethanol value above is credible because most of the values are very close but not completely identical, suggesting truly independent measurements.

In this case values actually diverge into sources of either +10 C, - 10 C, -30 C or about -45 C. If you want to play the "trusted source" game, do you trust more the Sigma-Aldrich value at +10C or the Alfa Aesar value at -43 C?

Lets try looking at the peer-reviewed literature. A search on SciFinder gives the following ranges:


The lowest melting point listed there is the +10C value we already have in our collection but these references are to other databases. The lowest value from a peer-reviewed paper is 37-38 C.

This is strange because I have a bottle of benzylamine in my lab and it is definitely a liquid. Investigating the individual references reveals a variety of errors. In one, benzylamine is listed as a product but from the context of the reaction it should be phenylbenzylamine:


(In a strange co-incidence the actual intermediate - benzalaniline - is the imine that Evan Curtain has synthesized recently in order to measure its solubility)

In another example, the melting point of a product is incorrectly associated with the reactant benzylamine:

The erroneous melting points range all the way up to 280 C and I suspect that many of these are for salts of benzylamine, as I reported previously for the strychnine melting point results from SciFinder.

With no other obvious recourse from the literature to resolve this issue, Evan attempted to freeze a sample of benzylamine from our lab.(UC-EXP265)


Unfortunately, the benzylamine sample proved to be too impure (<85% by NMR) and didn't solidify even down to -78 C. We'll have to try again from a much more pure source. It would be useful to get reports from a few labs who happen to have benzylamine handy and provide proof of purity by NMR and a pic to demonstrate solidification.

As most organic chemists will attest, amines are notorious for appearing as oils below their melting points in the presence of small amounts of impurities. I wonder if the divergence of melting points in this case is due to this effect. By providing NMR data from various samples subjected to freezing, it might be possible to quantify the effect of purity on the apparent freezing point. I think the images of the solidification are also important because I think that some may mistake very high viscosity with actual formation of a solid. At -78 C we observed the sample to exhibit a viscosity similar to that of syrup.

Our model predicts a melting point of about -38 C for benzylamine and so I suspect that the values of -43 C and -46 C are most likely to be close to the correct range. Lets find out.

Tuesday, May 10, 2011

La Science par Cahier de Laboratoire Ouvert à l'Acfas

On May 9, 2011 I presented remotely for the French-Canadian Association for the Advancement of Science (ACFAS). This was the first time I gave a talk about Open Notebook Science in French. In fact the last time I gave a scientific talk in French was probably in 1995, when I was doing a postdoc at the Collège de France in Paris. I remember being teased for my French Canadian accent back then so happily that wasn't an issue this time. Even though I was a bit rusty I think I managed to communicate the key points well enough. (At least I hope I did)

My presentation was a good fit for the theme of the conference: Une autre science est possible : science collaborative, science ouverte, science engagée, contre la marchandisation du savoir. (Another Science is possible: collaborative science, open science, against the commercialization of knowledge). I would like to thank the organizers (Mélissa Lieutenant-Gosselin and Florence Piron) for inviting me to participate.

I was able to record most of the talk (see below) but very near the end Skype decided to install an update and shut down so the recording ends somewhat abruptly. Given what people use Skype for, that default setting for updates really doesn't make much sense.



Labels: ,

Sunday, May 08, 2011

Breast Cancer Coalition talk on ONS and Taxol solubility

On May 1, 2011 I presented "Accelerating Discovery by Sharing: a case for Open Notebook Science" at the National Breast Cancer Coalition Annual Advocacy Conference in Arlington, VA. This was the first year where they had a session on an Open Science related theme and the organizers invited me to highlight some of the tools and practices in chemistry which might be applicable to cancer research.

I was really touched by the passion from those in the audience as well as the other speakers and conference participants I met afterward. For many, their deep connection with the cause was strongly rooted in a personal experience as breast cancer survivors themselves or their loved ones. Several expressed a frustration with the current system of sharing results from scientific studies. They felt that knowledge sharing is much slower than it needs to be and that potentially useful "negative" results are generally not disclosed at all.

The NBCC has ambitiously set 2020 as the deadline to end breast cancer (including a countdown clock). It seems reasonable to me that encouraging transparency in research is a good strategy to accelerate progress. Of course, great care must be exercised wherever patient confidentiality is a factor. But health care researchers are already experienced with following protocols to anonymize datasets for publication. Opting to work more openly would not change that but it might affect when and how results are shared. Also there is a great deal of science related to breast cancer that does not directly involve human subjects.

One initiative that particularly impressed me was The Susan G. Komen for the Cure Tissue Bank, presented by Susan Clare from Indiana University and moderated by Virginia Mason from the Inflammatory Breast Cancer Research Foundation. As a result of this effort, thousands of women have donated healthy breast tissue to create a comprehensive database richly annotated with donor genetics and medical history. The idea of trying to tackle a disease state by first understanding normal functioning in great detail was apparently somewhat of a paradigm shift for the cancer research community and it was challenging to implement. According to Dr. Clare, data from the Tissue Bank have shown that the common practice of using apparently unaffected tissue adjacent to a tumor as a control may not be valid.

This example highlights one of the key principles of Open Science: there is value in everyone knowing more - even if it isn't immediately clear how that knowledge will prove to be useful.

In my experience, this is a fundamental point that distinguishes those who are likely to favor Open Science from those who reject its value. If two researchers are discussing Open Science and only one of them views this philosophy as being self-evident the conversation will likely be about why someone would want (or not want) to share more and the focus will fall on extrinsic motivators such as academic credit, intellectual property, etc. If both researchers view this philosophy as self-evident the conversation will probably gravitate towards how and what to share.

I refer to this philosophy as being self-evident because I don't think people can become convinced through argumentation (I've never seen that happen). Within the realm of Open Notebook Science I have been involved in countless discussions about the value of sharing all experimental details - even when errors are discovered. I can think of a few ways in which this is useful - for example telegraphing a research direction to those in the field or providing data for researchers who study how science is actually done (such as Don Pellegrino). But even if I couldn't think of a single application I believe that there is value in sharing all available data.

A good example of this philosophy at work is the Spectral Game. Researchers who uploaded spectral data to ChemSpider as Open Data did not anticipate how their contribution would be used. They didn't do it for extrinsic motives such as traditional academic credit. Assuming that their motivation was similar to our group's, they did it because they believed it was an obviously useful thing to do. It is only much later - after a critical mass of open spectra were collected - that the idea arose to create a game from the dataset.

With this mindset, I explored what contribution we might make to breast cancer research by performing a phrase search strategy. Doing a simple Google search for "breast cancer" solubility generated mainly two types of results.

The first set involve the solubility behavior of biomolecules within the cellular environment. An example would be the observed increased solubility of gamma-tubulin in cancerous cells.
The second type of results address the difficulty in preparing formulations for cancer drugs due to solubility problems. A good example of this is Taxol (paclitaxel), where existing excipients are not completely satisfactory - in the case of Cremophor EL some patients experience a hypersensitivity.
Since our modeling efforts thus far have focused on non-aqueous solubility, there is possibly an opportunity to contribute by exploring the solubility behavior of paclitaxel. By inputting solubility data from a paper by Singla 2002 into our solubility database, Abraham descriptors for paclitaxel are automatically calculated and the solubilities in over 70 solvents are predicted.

In addition, by simply adding the melting point of paclitaxel, we automatically predict its solubility at any temperature where these solvents are liquids (see for example water).

Because of the way we expose our results to the web, a Google search for "paclitaxel solubility acetonitrile" now returns the actual value in the Google summary on the first page of results (currently 7th on the first page). The other hits have all 3 keywords somewhere in the document but one has to click on each link then perform a search within the document to find out if the acetonitrile solubility for paclitaxel is actually reported. (Note that clicking on our link ultimately takes you to the peer-reviewed paper with the original measurement.)

To be clear about what we are doing here - we are not claiming to be the first to predict the solubility of paclitaxel in these solvents using Abraham descriptors or any other method. Nor are we claiming that we have directly made a dent in the formulation problem of paclitaxel. We are not even indicating that we have done a thorough search of the literature - that would take a lot more time than we have had given the enormous amount of work on paclitaxel and its derivatives.

All we are doing is fleshing out the natural interface between the knowledge space of the UsefulChem/ONS Challenge projects and that of breast cancer research - AND - we are exposing the results of that intersection through easily discoverable channels. By design, these results are exposed as self-contained "smallest publishable units" and they are shared as quickly (and as automatically) as possible. The traditional publication system does not have mechanism to disseminate this type of information. (Of course when enough of these are collected and woven into a narrative that fits the criteria for a traditional paper they can and should be submitted for peer-reviewed publication).

Here is a scenario for how this could work in this specific instance. A graduate student (who has never heard of Open Science or UsefulChem, the ONS Challenge, etc.) is asked to look for new formulations for paclitaxel (or other difficult to solubilize anti-cancer agents). They do a search on commercial databases offered by their university for various solubilities of paclitaxel and cannot find a measurement for acetonitrile. They then do a search on Google and find a hit directly answering their query, as I detailed above. This leads them to our prediction services and they start using those numbers in their own models.

That is a good outcome - and that is exactly what has been happening (see the gold nanodot paper and the phenanthrene soil contamination study as examples). But the real paydirt would come from the graduate student recognizing that we've done a lot of work collecting measurements and building models for solubility and melting points, and contact us about a collaboration. As long as they are comfortable with working openly we would be happy actively work together.

I'm using the formulation of paclitaxel as an example but I'm sure that there are many more intersections between solubility and breast cancer research. With a bit of luck I hope we can find a few researchers who are open to this type of collaboration.

As another twist to this story, I will briefly mention here too that Andrew Lang has started to screen our Ugi product virtual library for docking with the site where paclitaxel binds to gamma-tubulin (D-EXP018). This might shed some light on some much cheaper alternatives to the extremely expensive paclitaxel and derivatives. The drug binds through 3 hydrogen bonds, shown below - rendered in 2D and 3D representations (obtained from the PDB ligand viewer)


The slides and recording of my talk are embedded below:


Labels: , , , , ,

Collaboration using Open Notebook Science in Academia book chapter

I am very pleased to report that the book chapter that I co-wrote with Andrew Lang, Steve Koch and Cameron Neylon is now available online: Collaboration using Open Notebook Science in Academia. This is the 25th chapter of Collaborative Computational Technologies for Biomedical Research, edited by Sean Ekins, Maggie Hupcey, Antony Williams and Alpheus Bingham.

Our chapter provides some fairly detailed examples of how Open Notebook Science can be used to enhance collaboration between researchers from both similar or distant fields. It also suggests certain paths towards machine/human collaboration in science. Hopefully it will encourage researchers who have an interest in Open Science to experiment with some of the tools and strategies mentioned.

I am also grateful to Wiley for choosing our chapter as the free online sample for the book!
This book discusses the state-of-the-art collaborative and computing techniques for the pharmaceutical industry, the present and future implications and opportunities to advance healthcare research. The book tackles problems thoroughly, from both the human collaborative and the data and informatics side, and is very relevant to the day-to-day activities running a laboratory or a collaborative R&D project. It can be applied to help organizations make critical decisions about managing drug discovery and development partnership. The book follows a “man- methods-machine” format with sections on how to get people to collaborate, collaborative methods, and computational tools for collaboration. This book offers the reader a “getting started guide” or instruction on “how to collaborate” for new laboratories, new companies, and new partnerships, as well as a user manual for how to troubleshoot existing collaborations.


Labels: , , ,

Saturday, May 07, 2011

Evan Curtin is the May 2011 RSC ONS Challenge Winner

Evan Curtin, a chemistry freshman student working under the supervision of Jean-Claude Bradley at Drexel University, is the May 2011 Royal Society of Chemistry Open Notebook Science Challenge Award winner. He wins a cash prize from the RSC.

Evan's primary focus has centered on synthesizing aromatic imines and measuring their solubility in a number of organic solvents. This will allow us to generate Abraham descriptors for this class of compounds in order to predict their solubility in 70+ solvents. Coupled with our new model to include temperature dependent solubility, this should greatly facilitate optimal solvent prediction for this and related reactions.

Imine formation is of particular interest to the UsefulChem group because it is the first step of the Ugi reaction, which we have used to synthesize compounds with anti-malarial activity. But it is also a simple convenient reaction in itself to test our Solvent Selector's ability to predict optimal conditions (solvent and temperature) for isolation of products by precipitation.

Evan's synthesis experiments are available here:
http://usefulchem.wikispaces.com/Exp263
http://usefulchem.wikispaces.com/Exp262
http://usefulchem.wikispaces.com/Exp261


and his solubility experiments are listed here:

http://onschallenge.wikispaces.com/Exp207
http://onschallenge.wikispaces.com/Exp206
http://onschallenge.wikispaces.com/Exp205
http://onschallenge.wikispaces.com/Exp204
http://onschallenge.wikispaces.com/Exp201
http://onschallenge.wikispaces.com/Exp198
http://onschallenge.wikispaces.com/Exp197

Three more RSC ONS Awards will be made during 2011. Submissions from students in the US and the UK are still welcome.
For more information see:
http://onschallenge.wikispaces.com
http://onschallenge.wikispaces.com/RSCAwards2010

Labels: , , , ,

Creative Commons Attribution Share-Alike 2.5 License