Tuesday, November 24, 2015

Databasing nanomaterials: substance APIs

 Cell uptake of gold nanoparticles in human cells. Source. CC-BY 4.0
Nanomaterials are quite interesting from a science perspective: first, they are materials and not so well-defined as such. The can best be described as a distribution of similar nanoparticles. That is, unlike small compounds, which we commonly describe as pure materials. Nanomaterials have a size distribution, surface differences, etc. But akin the QSAR paradigm, because they are similar enough, we can expect similar interaction effects, and thus treat them as the same. A nanomaterials is basically a large collection of similar nanoparticles.

Until the start interacting, of course. Cell membrane penetration is studies at a single nanoparticle level, and they make interesting pictures of that (see top left). Or when we do computation. Then too, we typically study a single materials. On the other hand, many nanosafety studies work with the materials, at a certain dosage. Study cell death, transcriptional changes, etc, when the materials is brought into contact with some biosample.

The synthesis is equally interesting. Because of the nature of many manufacturing processes (and the literature synthesizing new materials is enormous), it is typically not well understood what the nanomaterial or even nanoparticle looks like. This is overcome by stydying the bulk properties, and report some physicochemical properties, like the size distribution, with methods like DLS and TEM. The field just lacks the equivalent of what NMR is for (small) (organic) compounds.

Now, try capturing this in a unified database. That's exactly what eNanoMapper is doing. And with a modern approach. It's a database project, not a website proejct. We develop APIs and test all aspects of the database extensively using test data. Of course, using the API we can easily create websites (there currently are JavaScript and R client libraries), and we have done so at data.enanomapper.net. It's great to be working with so many great domain specialists who get things done!

There is a lot to write and discuss about this, but end now by just pointing you to our recent paper outlining much of the cheminformatics of this new nanosafety database solution.

Of course, we study in our group the nanosafety and nanoresponse (think nanomedicine) at a systems biology level. So, here's the obligatory screenshot of work of one of of interns (Stan van Roij). Not fully integrated with the database yet, though.

Jeliazkova, N., Chomenidis, C., Doganis, P., Fadeel, B., Grafström, R., Hardy, B., Hastings, J., Hegi, M., Jeliazkov, V., Kochev, N., Kohonen, P., Munteanu, C. R., Sarimveis, H., Smeets, B., Sopasakis, P., Tsiliki, G., Vorgrimmler, D., Willighagen, E., Jul. 2015. The eNanoMapper database for nanomaterial safety information. Beilstein Journal of Nanotechnology 6, 1609-1634.
http://dx.doi.org/10.3762/bjnano.6.165

Sunday, November 22, 2015

I have been happily tweeting the BioMedBridges meeting in Hinxton last week using the #lifesciencedata hashtag, along with more than 100 others, though a small subset was really active. A lot has been published about using Twitter at conference, like the recent paper by Ekins et al (doi:10.1371/journal.pcbi.1003789).

The backchannel discussions only get better when more and more people join, and when complementary information is passed around. For example, I tend to tweet links to papers that appear on slides, chemical/protein structure mentioned, etc. I have also started tweeting ORCID identifiers of the speakers if I can find them, in addition to adding them to a Lanyrd page.

Like at most meetings, people ask me about this tweeting. Why I do it? Doesn't it distract you from the presentation? I understand these questions.

First, I started recording my notes of meetings electronically during my PhD thesis, because I needed to write a summary of each meeting for my funder. So, when Twitter came along, and after I had already built up some experience blogging summaries of meetings, I realized that I might as well tweet my notes. And since I was looking up DOIs of papers anyway, the step was not big. The effect, however, was use. People started replying, some at the conference itself, some not. This resulted in a lot of meetings with people at the conference. Tweetups do not regularly happen anymore, but it's a great first line for people, "hey, aren't you doing all that blogging", and before you know it, you are talking science.

Second, no, it does not significantly distract me from the talk. First, like listening to a radio while studying, it keeps me focused. Yes, I very much think this differs from person to person, and I am not implying that it generally is not distracting. But it keeps me busy, which is very useful during some talks, when people in the audience otherwise start reading email. If I look up details (papers, project websites, etc) from the talk, I doubt I am more distracted than some others.

Third: what about keeping up. Yes, that's a hard one, and I was beaten in coverage speed by others during this meeting. That was new to me, but I liked that. Sadly, some of the most active people left the meeting after the first day. So, I was wondering how I could speed up my tweeting, or, alternatively, how it would take me less time so that I can read more of the other tweets. Obvious candidates are blogging additional information like papers, etc.

So, I started working on some R code to help me tweet faster, and using the great collection of rOpenSci packates, I have come up with the following two first helper methods. In both examples, I am using an #example hashtag.

Tweeting papers
This makes use of the rcrossref package to fetch the name of the first author and title of the paper.

Tweeting speakers
Or perhaps, tweeting people. This bit of code makes use of the rorcid package.

Of course, you are most interesting in the code than the screenshots, so here it is (public domain; I may make a package out of this):

library(rorcid)
library(rcrossref)

tweetAuthor = function(orcid=NULL, hashtag=NULL) {
person = as.orcid(orcid)
firstName = person[[1]]$"orcid-bio"$personal-details$given-names$value
surname = person[[1]]$"orcid-bio"$personal-details$family-name$value
orcidURL = person[[1]]$"orcid-identifier"$uri
tweet(
paste(firstName, " ", surname, " orcid:",

orcid, " ", orcidURL, " #", hashtag, sep="")
)
}
tweetPaper = function(doi=NULL, hashtag=NULL) {
info = cr_cn(dois=doi, format="citeproc-json")
tweet(
paste(
info$author[[1]]$family, " et al. \"",
substr(info$title, 0, 60), "...\" ", "http://dx.doi.org/", info$DOI, " #", hashtag, sep=""
)
)
}

Getting your twitteR to work (the authentication, that is) may be the hardest part. I do plan to add further methods like: tweetCompound(), tweetProtein(), tweetGene(), etc...

Got access to literature? Only yesterday I discovered that resolving some Nature Publishing Group DOIs do not necessarily lead to useful information. High quality metadata about literature is critical for the future of science. Elsevier just showed how creative publishers can be in interpreting laws and licenses (doi:10.1038/527413f).

So, it may be interesting to regularly check your machine readable Open Access metadata. ImpactStory helps here with their Open Access Badge. New to me was what Daniel pointed me to: dissemin (@disseminOA). Just pass your ORCID and you end up with a nice overview of what the world knows about the open/closed status of your output.

I would not say my report is flawless, but that nicely shows how important it is to get this flow of metadata right! For example, there are some data sets and abstracts detected as publications; fairly, I think this is to a good extend my inability to annotate them properly in my ORCID profile.

WikiPathways: capturing the full diversity of pathway knowledge

 Figure from the new NAR paper.
Biology is a complex matter. The biological matter indeed involves many different chemicals in very many temporospatial forms: small compounds may be present in different charge states (proteins too, of course), tautomers, etc. Proteins may exhibit isoforms, various post-translational modifications, etc. Genes shows structures we are only now starting to see: the complex structures in the nucleus have been invisible to mankind until some time ago. Likewise, the biological processes, encoded as pathways, cover an equal amount of complexity.

WikiPathways is a community run pathway database, similar to others like KEGG, Reactome, and many others. One striking difference is the community approach of WikiPathways: anyone can work on or extend the content of the database. This makes WikiPathways exciting to me: it encodes very different bits of biological knowledge, and a key reason why I joined Chris Evelo's team almost four years ago. Importantly, this community is supported by a lively and reasonably sized (>10 people and growing) curation team, primarily located at Maastricht University and the Gladstone Institutes.

The newest paper in NAR (doi:10.1093/nar/gkv1024) outlines some recent developments and the growth of the database. There is still so much to do, and given the current speed at which we learn new biological patterns, this will not get less soon.

Want to help? Sign up, enlist your ORCID! Need ideas what you can do? Why not take a recent paper you published (or read), take a new biological insight, look up an appropriate pathway and add that paper. If you have a novel pathway or important new insight in a biological paper published, why not convert that figure from that paper into a machine readable pathway?

Kutmon, M., Riutta, A., Nunes, N., Hanspers, K., Willighagen, E. L., Bohler, A., Mélius, J., Waagmeester, A., Sinha, S. R., Miller, R., Coort, S. L., Cirillo, E., Smeets, B., Evelo, C. T., Pico, A. R., Oct. 2015. WikiPathways: capturing the full diversity of pathway knowledge. Nucleic Acids Research.

RRegrs: exploring the space of possible regression models

Machine learning is a field of science that focusses on mathematically describing patterns in data. Chemometrics does this for chemical data. Examples are (nano)QSAR where structural information is related to biological activity. I studied during my PhD studies the interaction between the statistics and machine learning with how you computationally (numerically) represent the question. The right combination is not obvious and it has become common to try various modelling methods, though something with support vector machines (SVM/SVR) and more recently neural networks (deep learning) have become popular. A simpler model, however, has its benefits too and frequently not significantly worse than more complex models. That said, exploring all machine learning methods manually takes a lot of time, as each comes with its own parameters which need varying.

Georgia Tsiliki (NTUA partner in eNanoMapper), Cristian Munteany (former postdoc in our group), and others developed RRegrs, an R package to explore the various models and automatically calculate a number of statistics to allow to compare them (doi:10.1186/s13321-015-0094-2). That said, following my thesis, you must never rely on performance statistics, but the output of RRegrs may help you explore the full set of models.

Tsiliki, G., Munteanu, C. R., Seoane, J. A., Fernandez-Lozano, C., Sarimveis, H., Willighagen, E. L., Sep. 2015. RRegrs: an r package for computer-aided model selection with multiple regression models. Journal of Cheminformatics 7 (1), 46.