Pages

Sunday, June 11, 2017

You are what you do, or how people got to see me as an engineer

Source, Wikicommons, CC-BY-SA.
Over the past 20 years I have had endless discussions into what the research is that I do. Many see my work as engineer, but I vigorously disagree. But some days it's just too easy to give up and explain things yet again. The question came up on the past few month several times again, and I am suggested to make a choice. That modern academia for you: you have to excel in something tiny, and complex and hard to explain ambition is loosing from the system based on funding, buzz words, "impact", and such. So, again, I am trying to make up my defense as to why my research is not engineering. You know what is ironic? It's all the fault of Open Science! Darn Open Science.

In case you missed it (no worries, many of the people I talk in depth about these things do, IMHO), my research is of theoretical nature (I tried bench chemistry, but my back is not strong enough for that): I am interested in how to digitally represent chemical knowledge. I get excited about Shannon entropy and books from Hofstadter. I do not get excited about "deep learning" (boring! In fact, the only fun I get out of that is pointing you to this). So, arguably, I am in the wrong field of science. One could argue I am not a biologist or chemist, but a computer scientist, or maybe philosophy (mind you, I have a degree in philosophy).

And that's actually where it starts getting annoying. Because I do stuff on a computer, people associate me with software. And software is generally seen as something that Microsoft does... hello, engineering. The fact that I publish papers on software (think CDK, Bioclipse, Jmol) does not help, of course.

That's where that darn Open Science comes in. Because I have a varied set of skills, I actually know how to instruct a computer to do something for me. It's like writing English, just to a different person, um, thingy. Because of Open Science, I can build the machines that I need to do my science.

But a true scientist does not make their own tools; they buy them (of course, that's an exaggeration, but just realize how well we value data and software citations at this time). They get loads of money to do so, just so that they don't have to make machines. And just because I don't ask for loads of money, or ask a bit of money to actually make the tools I need, you are tagged as engineer. And I, I got tricked by Open Science in fixing things, adding things. What was I thinking??

Does this resonate with experience from others? Also upset about it? What can we do about this?

(So, one of my next blog posts will be about the new scientific knowledge I have discovered. I have to say,  not as much as I wanted, mostly because we did not have the right tools yet, which I have to build first, but that's what this post is about...)

Saturday, June 10, 2017

New paper: "The Chemistry Development Kit (CDK) v2.0: atom typing, depiction, molecular formulas, and substructure searching"

This paper was long overdue. But software papers are not easy to write, particularly not follow up papers. That actually seems a lot easier for databases. Moreover, we already publish too much. However, the scholarly community does not track software citations (data citations neither, but there seems to be a bit more momentum there; larger user group?). So, we need these kind of papers, and just a version, archived software release (e.g. on Zenodo) is not enough. But, there it is, the third CDK paper (doi:10.1186/s13321-017-0220-4). Fifth, if you include the two papers about ring finding, also describing CDK functionality.


Of course, I could detail what this paper has to offer, but let's not spoil the article. It's CC-BY so everyone can readily access it. You don't even need Sci-Hub (but are allowed for this paper; it's legal!).

A huge thanks to all co-authors, John's work as release manager and great performance improvements as well as code improvement, code clean up, etc, and all developers who are not co-author on this paper but contributed bigger or smaller patches over time (plz check the full AUTHOR list!). That list does not include the companies that have been supporting the project in kind, tho. Also huge thanks to all the users, particularly those who have used the CDK in downstream projects, many of which are listed in the introduction of the paper.

And, make sure to follow John Mayfield's blog with tons of good stuff.

Saturday, May 20, 2017

May 29, Delft, The Netherlands: "Open Science: the National Plan and you"

In less than ten days, a first national meeting is organized in Delft, The Netherlands, where researchers can meet researchers to talk about Open Science. Mind you, researcher is very broad: it is anyone doing research, at home (e.g. citizen science, or as a hobby), at work (company or research institute), or in educational setting (university, HBOs, ...). After all, anyone benefits from Open Science (at least from that by others! "Standing on the shoulders of Open Science, ...")

The meeting is part of the National Plan Open Science (see also Open Science is already a thing in The Netherlands), which is a direct result of the Open Science meeting in Amsterdam during the Dutch presidency which resulted in the Amsterdam Call for action on Open Science.

The program for the #npos2017 meeting is very interactive. It starts with obligatory introductions, explaining how Open Science fits into the national future research landscape, but quickly moves to practical experiences from researchers, a Knowledge Commons session where everyone can show and discuss their Open Science works (with a free lunch: yes, #OpenScience and free lunches are compatible), a number of breakout sessions where the "but how" can be discussed and answered (topics in the image below), and a wrap up panel to wrap up the break out sessions, and a free drink afterwards.

During the Knowledge Commons I will join Andra Waagmeester (Micelio) and Yaroslav Blanter (Delft University) to show Wikidata, and how I have been using this for data interoperability for the WikiPathways metabolism pathways (via BridgeDb).

The meeting is free and you can sign up here. Looking forward to meeting you there!


Sunday, April 16, 2017

GenX spill, national coverage, but where is the data

First (I have never blogged much about risk and hazard), I am not an toxicological expert nor a regulator. I have deepest respect for both, as these studies are one of the most complex ones I am aware off. It makes rocket science look dull. However, I have quite some experience in the relation chemical structure to properties and with knowledge integration, which is a prerequisite for understanding that relation. Anything I do does not say what the right course of action is. Any new piece of knowledge (or technology) has pros and cons. It is science that provides the evidence to support finding the right balance. It is science I focus on.

The case
The AD national newspaper reported spilling of the compound with the name GenX in the environment and reaching drinking water. This was picked up by other newspapers, like de VK. The chemistry news outlet C2W commented on the latter on Twitter:


Translated, the tweet reports that we do not know if the compound is dangerous. Now, to me, there are then two things: first, any spilling should not happen (I know this is controversial, as people are more than happy to repeatedly pollute the environment, just because of self-interest and/or laziness); second, what do we know about the compound? In fact, what is GenX even? It certainly won't be "generation X", though we don't actually know the hazard of that either. (We have IUPAC names, but just like with the ACS disclosures, companies like to make up cryptic names.)

But having working on predictive toxicology and data integration projects around toxicology, and for just having a chemical interest, I started out searching what we know about this compound.

Of course, I need an open notebook for my science, but I tend to be sloppy and mix up blog posts like this, with source code repositories, and public repositories. For new chemicals, as you could read earlier this weekend, Wikidata is one of my favorites (see also doi:10.3897/rio.1.e7573). Using the same approach as for the disclosures, I checked if Wikidata had entries for the ammonium salt and the "active" ingredient FRD-903 (fairly, chemically they are different, and so may their hazard and risk profiles). Neither existed, so I added them using Bioclipse and QuickStatements (a wonderful tool by Magnus Manke): GenX and FRD-903. So, a seed of knowledge was planted.
    A side topic... if you have not looked at hypothes.is yet, please do. It allows you to annotate (yes, there are more tools that allow that, but I like this one), which I have done for the VK article:


I had a look around on the web for information, and there is not a lot. A Wikidata page with further identifiers then helps tracking your steps. Antony Williams, previous of ChemSpider fame, now working on the EPA CompTox Dashboard, added the DTX substance IDs, but the entries in the dashboard will not show up for another bit of time. For FRD-903 I found growth inhibition data in ChEMBL.

But Nina Jeliazkova pointed me to her LRI AMBIT database (poster abstract doi:10.1016/j.toxlet.2016.06.1469, PDF links) that makes (public) data from ECHA available from REACH dossiers in a machine readable way (see this ECHA press release), using their AMBIT software (doi:10.1186/1758-2946-3-18). (BTW, this makes the legal hassle Hartung had last year even more interesting, see doi:10.1038/nature.2016.19365). After creation of a free login, you can find a full (public) dossier with information about the toxicology of the compound (toxicity, ecotoxicity, environmental fate, and more):


I reported this slide, as they worry seems to be about drinking water, so, oral toxicity seems appropriate (note, this is only acute toxicity). The LD50 is the median lethal dose, but is only measured for mouse and rat (these are models for human toxicity, but only models, as humans are just not rats; well, not literally, anyway). Also, >1 gram per kilogram body weight ("kg bw"; assumption) seems pretty high. In my naive understand, the rat may be the canary in the coal mine. But let me refrain from making any conclusions. I leave that to the experts on risk management!

Experts like those from the Dutch RIVM, which wrote up this report. One of the information they say is missing is that of biodistribution: "waar het zich ophoopt", or in English, where the compound accumulates.

Friday, April 14, 2017

The ACS Spring disclosures of 2017 #2: some history

Bethany Halford adds some history about the sessions (see part #1):
    I believe Stu Borman was the first to cover the Division of Medicinal Chemistry’s First Time Disclosures symposium for C&EN, but it was Carmen Drahl who began the practice of hand-drawing and tweeting the clinical candidates as they were disclosed in real time. This seems like an oddball practice to folks who aren’t at the meeting. Why not just take a picture of the relevant slide? Well, that’s against the rules: There are signs all over the ACS National Meeting stating that photos, video, and audio recording of presentations are strictly prohibited. In San Francisco, symposium organizer Jacob Schwarz repeatedly reminded attendees that this was the case. Carmen’s brilliant idea to get around this rule was to simply draw the structures as they were presented, snap a photo, and then tweet it out.

    I’ve inherited the task since Carmen left the magazine a couple of years ago. I find it incredibly stressful. For an even that’s billed as a disclosure, the actual disclosing is fairly fleeting. The structures are often not on the screen for very long, and I’m never confident that I’ve got it 100% right. Last year in San Diego I tweeted out one structure and I heard the following day from Anthony Melvin Crasto, a chemist in India, that based on the patent literature he thought I had an atom wrong. I was certain that I had written this structure correctly, so I contacted the presenting scientist. He had disclosed the wrong structure!

    I agree that there should be some sort of database established afterwards, and I think you all have done great work on that front. I think you’ll find the pharmaceutical companies reluctant to help you out in any way. They guard these compounds so fiercely that it often makes we wonder why we have this symposium to begin with.