Pages

Tuesday, December 30, 2008

Editing and Validation of CML documents in Bioclipse

One advantage of using XML is that one can rely on good support in libraries for functionality. When parsing XML, one does not have to take care of the syntax, and focus on the data and its semantics. This comes at the expense of verbosity, though, but having the ability to express semantics explicitly is a huge benefit for flexibility.

So, when Peter and Henry put their first documents online about the Chemical Markup Language (CML), I was thrilled, even though is actually was still SGML when I encountered it. The work predates the XML recommendation. As I recently blogged, in '99 I wrote patches for Jmol and JChemPaint to support CML, which were published as preprint in the Chemical Preprint Server in a paper in 2000 in the Internet Journal of Chemistry. Neither of the two has survived.

Anyway, the Chemistry Development Kit makes heavy use of CML, and Bioclipse supports it too. Now, Bioclipse is based on the Eclipse Rich Client Platform architecture, for which there exist quite a few XML tools in the Web Tools Platform (WTP). Among these, a validation, content assisting XML editor. This means, I get red markings when I make my XML document not-well-formed or invalid. Just a quick recap: well-formedness means that the XML document has a proper syntax: one root node, properly closed tags, quotes around attribute values, etc. Validness, however, means that the document is well-formed, but also hierarchically organized according to some specification.

Enter CML. CML is such a specification, first with DTDs, but after the introduction of XML Namespaces with XML Schema (see There can be only one (namespace)). The WTP can use this XML Schema for validation, and this is of great help learning the CML language. Pressing Ctrl-space in Bioclipse will now show you what allowed content can be added at the current character position.

Yes, Bioclipse can do this now (in SVN, at least). This has been on my wishlist for at least two years now, but never really found the right information. Now, three days ago David wrote about End of Year Cramps in which he describes some of his work on the WTP for autocomplete for XPath queries. He see[s] a brighter future for XML at eclipse over the next year. I hope that those in the eclipse and XML community will help to continue to improve the basic support, so that first class commercial quality applications that leverage this support can continue to be built.

That was enough statement for me to ask in the comments on how to make the WTP XML editor aware of the CML XML Schema. It already picked up XML Schema's with xsi:schemaLocation, but I needed something to worked without such statements in the XML document itself. David explained that me that I could use the org.eclipse.wst.xml.catalog extension. This was really easy, and commited to Bioclipse SVN as:
<extension
point="org.eclipse.wst.xml.core.catalogContributions">
<catalogContribution>
<uri name="http://www.xml-cml.org/schema"
uri="schema24/schema.xsd"/>
</catalogContribution>
</extension>
However, that does not make the WTP XML editor available in the Bioclipse application yet. Not ever in the "Open With"... So, I set up a CML Feature. After a follow up question, it turned out that the CML content type of Bioclipse was already a sub type of the XML type (see ):
<extension
point="org.eclipse.core.runtime.contentTypes">
<content-type
base-type="org.eclipse.core.runtime.xml"
id="net.bioclipse.contenttypes.cml"
name="Chemical Markup Language (CML)"
file-extensions="cml,xml"
priority="normal">
</content-type>
</extension>
So, the only remaining problem was to actually get the WTP XML editor as part of the Bioclipse application. The new CML Feature takes care of that (I hope the export and building the update site work too, but that's yet untested), by important the relevant plugins and features. Last night, however, I ended up with one stacktrace which gave me little clue on which plugin I was still missing.

Therefore, I headed to #eclipse and actually met David of the blog that started this again. He asked nitind to think about it too, and they helped me pin down the issue. This relevant bit of the stacktrace turned out to be:
Caused by: java.lang.IllegalStateException
at org.eclipse.core.runtime.Platform.getPluginRegistry(Platform.java:774)
at org.eclipse.wst.common.componentcore.internal.impl.WTPResourceFactoryRegistry$ResourceFactoryRegistryReader.(WTPResourceFactoryRegistry.java:275)
at org.eclipse.wst.common.componentcore.internal.impl.WTPResourceFactoryRegistry.(WTPResourceFactoryRegistry.java:61)
at org.eclipse.wst.common.componentcore.internal.impl.WTPResourceFactoryRegistry.(WTPResourceFactoryRegistry.java:55)
... 37 more
This refered to this bit of code of Eclipse' Platform.java:
Bundle compatibility = InternalPlatform.getDefault()
.getBundle(CompatibilityHelper.PI_RUNTIME_COMPATIBILITY);
if (compatibility == null)
throw new IllegalStateException();

So, the plugin I turned to to have missing was org.eclipse.core.runtime.compatibility. Apparently, some parts of the WTP that the XMLEditor is using, still uses Eclipse2.x technology.

This screenshot shows the WTP XMLEditor in action in Bioclipse on a CML file. It shows the document contents with the 'Design' tab, which also shows allowed content, as derived from the XML Schema for CML. Also, note that the Outline and Properties view automatically come for free, which allows more detail and navigation of the content.

This screenshot shows the 'Source' tab for the same file, where I deliberately changed the value of the @id attribute of the first atom. The value does not validate against the regular expression defined in the CML schema for @id attribute values. It also shows the content assisting in action. At any location in the CML file, I can hit Ctrl-Space, and the editor will show me which content I can add at that location.

This makes Bioclipse a perfect tool to craft CML documents and learn the language.

11 Years of Debian

11 years ago, a day more or less, I bought an the special issue of CHIP which shipped Debian 1.3.1. I think I've tried SuSe and RedHat earlier that year, but this Debian release made me switch away from proprietary products 98% (taxes I still had to do with Windows98). Right now, I am mostly running Ubuntu, which leans heavily on the work of the Debian project.

I celebrated by installing a prerelease of Lenny, Debian's next stable release, but still testing now, in a virtual box with VirtualBox. Works like a charm, and will allow me in 2009 to finally pick up some packaging work for Debian, and maybe, finally, get Jmol available in Debian main.

Friday, December 26, 2008

State of CDK 1.2.0...

The reason why I have not blogged in more than two weeks, was that I was hoping to blog about the CDK 1.2.0 release. This was originally aimed at September, slipped into October, November and then December. There were only three show stoppers (see this wiki page), one of which the IChemObject interfaces were not properly tested.

The problem was that the unit tests for the methods in superinterfaces were not applied to implementations of subinterfaces. For example, the unit test for IElement.getSymbol() was not applied to the class Atom, which implements IAtom which is a subinterfaces of IElement.

In fixing this, I had to take some hurdles. For example: the unit test classes used a set up following the implementations; CDK 1.2.x has three implementations of the interfaces: data, datadebug and nonotify. The last does not send around update notifications, and rough tests indicate it is about 10% faster. The second implementation sends messages to the debugger for every modification of the data classes, which is, clearly, useful for debugging purposes.

However, the JUnit4 test classes were basically doing the same. The unit test DebugAtomTest inherited form AtomTest, and only overwrote customizations. AtomTest, itself, inherited from ElementTest. That's where things got broken. In the single implementation set up, this would have been fine, but to allow testing of all three implementations, getBuilder() had to be used.

And when I implemented that, I did not realize that ElementTest would do a test like:

IElement element = builder.newElement();
// test IElement functionality
However, while the use of builder ensure testing of all three implementations, it does not run these tests on IAtom implementations.

The followed a long series of patches to get this fixed. One major first patch, was to define unit test frameworks like AbstractElementTest which formalized running unit tests on any implementation, as I noticed that quite a few tests were still testing one particular implementation. This allowed DebugElementTest to extend AbstractElementTest, instead of ElementTest, which would now extend AbstractElementTest too.

OK, with that out of the way, it was time to fix running the unit test for IElement.getSymbol() on IAtom.getSymbol(), which required the removal of the use of IChemObjectBuilder implementations. So, I introduced newChemObject() which would return a fresh instance of the actually tested implementation. That is, DebugAtomTest would return a new DebugAtom, and the getSymbol() test would now run on DebugAtom and not DebugElement. Good.

No, not good. The actual implementation I was using, looks like:

public class DebugElementTest extend AbstractElementTest {
@BeforeClass public static void setup() {
setChemObject(new DebugElement());
}
}

public abstract class AbstractElementTest extend AbstractChemObjectTest {
@Test public void testGetSymbol() {
IElement element = (IElement)newChemObject();
// do testing
}
}

public abstract class AbstractChemObjectTest {
private IChemObject testedObject;
public static setChemObject(IChemObject object) {
this.testedObject = object;
}
public IChemObject setChemObject(IChemObject object) {
return (IChemObject)testedObject.clone();
} // just imagine it has try/catch here too

// and here the tests for the IChemObject API
@Test public void testGetProperties() {
IChemObject element = (IChemObject)newChemObject();
// do testing
}
}
Excellent! No.

Well, yes. The above system works, but made many unit tests fail, because of bugs in clone() methods. The full scope has to be explored, but at least IPolymer.clone() is not doing what I would expect it to do. Either I am wrong, and need to overwrite the clone unit tests of superinterfaces in AbstractPolymerTest, or the implementations needs fixing. I emailed the cdk-devel mailing list and filed a bug report. But having about 1000 unit tests fail, because of clone broken, is something I did not like. For example, as it makes bug fixing more difficult.

So, next step was to find an approach that did not require clone, but give some interesting insights in the Java language. JUnit4 requires the @BeforeClass method to be static. This means I cannot have a non-static DebugElementTest method return an instance. And, you cannot overwrite a static method! That had never occured to me in the past. DebugElementTest.newChemObject() does not overwrite AbstractChemObjectTest.newChemObject which is somewhere upstream.

But, after discussing matters with Carl, I ended up with this approach:

public abstract class AbstractChemObjectTest extends CDKTestCase {
private static ITestObjectBuilder builder;
public static void setTestObjectBuilder(ITestObjectBuilder builder) {
AbstractChemObjectTest.builder = builder;
}
public static IChemObject newChemObject() {
return AbstractChemObjectTest.builder.newTestObject();
}
}

public interface ITestObjectBuilder {
public IChemObject newTestObject();
}

public class DebugAtomTest extends AbstractAtomTest {
@BeforeClass public static void setUp() {
setTestObjectBuilder(new ITestObjectBuilder() {
public IChemObject newTestObject() {
return new DebugAtom();
}
});
}
}

Monday, December 08, 2008

Peer reviewed Cheminformatics #2: Code review for the Chemistry Development Kit

Peer review is an important component of open source development, and recently there was the discussion the other way around, if open source is required for peer review. Depends on your definition of peer review: No, if you restrict peer review to what it is in publishing (see Re: Open Source != peer review); Yes, if we really want to speed up cheminformatics evolution and assume unrestricted, open peer review where reviewers can openly publish there review report with all the greasy details (see Peer reviewed Chemoinformatics: Why OpenSource Chemoinformatics should be the default).

The CDK has a strong history of peer review. Patches have been available from SVN from the start, and later we instantiated a mailing list so that people could easily monitor code changes, and I have actually being doing this since the start, scanning the code patches, knowing that a lot of code is backed up by unit tests to detect regressions. Anyone can review CDK code in this manner, just by subscribing to the cdk-commits mailing list. If one has questions or comments on a patch, a reply to cdk-devel is all that is needed to get things going.

About a year ago, CDK development had become so extensive that code review in this manner was no longer the way forward (though still possible, and still used). However, it turned out that it was all too easy to overlook a patch or just click it away in busy times. This was experienced by some developers who previously monitored the cdk-commit messages sketched above. So, we moved to a more formal patching system where any non-trivial patching is done in a SVN branch. Once the primary developer is happy about the branch, (s)he requests a review by other developers. These can leave comments in the source code, reply to the mailing list, or leave comments in the CDK patch tracker. This more formal work habit got into action about half a year ago already.

A recent message from Stefan makes clear that this tracker has some room for improvements. For example, there is no automatic email to cdk-devel when a patch has not been tended to for a longer period of time. And, I do not see a simple way of doing this with the SourceForge bug track system.

But, what I can do, is define a number of groups to represent the state of the patch. So, I defined:
  • Needs Review: this patch has not been reviewed (sufficiently) yet
  • Accepted: but not yet applied to SVN yet. When applied, the patch report is simply closed
  • Needs Revision: the reviewers like to see changes made to the patch
Not perfect, but a step forward in tracking the state of patches.

Friday, December 05, 2008

Cheminformatics Benchmark Project #1

Yesterday's blog about Who says Java is not fast?!? caused quite some feedback (thanx to all commenters!) with several good points. Of course, a table like that in the cinfony paper (see also the comments in the blogs by Noel (the author) and Rich). Many things determine why the CDK might be fastest in that table for SDF iterating. Suggestions have been that OpenBabel and RDKit may be doing much more than simple reading; Java might actually take advantage of the second core for caching file content.

ZZ observed something I overlooked: calculating the molecular mass in CDK is by far slowest of all three toolkit, though people have suggestions on why that may is.

Benchmarking
The correct way to compare toolkits, open source, proprietary, free, commercial, is to have a proper benchmark toolkit for cheminformatics. That's what I am suggesting here: a project to define simple and fair benchmarks. It's an open project, and anyone can contribute in order to keep tests balanced in impartial towards any tested toolkit.

Thursday, December 04, 2008

Who says Java is not fast?!?

While performance tests actually show that for even core numerical calculations Java is at par with C in terms of speeds, and sometimes even hits Fortran-like speeds, people keep think that Java is not fast. I only invite you to test that yourself.

Meanwhile, I would like to take the opportunity to advertise Noel's cinfony paper in CCJ (doi:10.1186/1752-153X-2-24) which features these speed measurements (from the paper, CC-BY license): I have to say that these numbers surprised me, as the CDK is hardly optimized for speed at al...

Short variables and lack of comments...

... a source code reviewer nightmare. The must-read lwn.net ran a nice open letter to a Linux kernel developer. I'd like to cite this bit about code review (see also Re: Open Source != peer review):
    [Andrew Morton] had a number of concrete requests - such as documenting the user-space ABI and the network protocol - which have not been satisfied. He also asked for better code documentation in general:
      So please. Go through all the code and make it tell a story. Ask yourself "how would I explain all this to a kernel developer who is sitting next to me". It's important, and it's an important skill.
This is important indeed! This is also why CDK quality assurance tends to complain about short variables. While an for-next index i is clear enough, ac for an IAtomContainer is quite useless, as it does not explain what the purpose of the container is. BTW, a longer name like atomContainer does not really help here. Maybe I will wrote a unit test for that...

Sunday, November 30, 2008

Parallel building the CDK

Some time ago, I added parallel building targets for CDK's Ant build.xml. Now that I am setting up a Nightly for the jchempaint-primary branch, and really only want to report on the CDK modules control and render, I need the build system to use a properties files to define which modules should be compiled.

So, I hacked a bit on the build system, and made use of two ant-contrib tasks, if and foreach which in the first place reduce the size of the build.xml, but also provide means for parallelization. Earlier, it was using the parallel task of Ant itself for this (see CDK Module dependencies #2).

The build dependencies between CDK modules are fairly complex, and typically this complexity increases upon bug fixing etc. Ideally, the build dependencies will be calculated on runtime, instead of being hard-coded right now, and I will explore this in the near future.

These dependencies can be used to build some of the module in parallel, but not all. This causes speed up of the compilation not to scale linearly with the number of threads or cores. The below build times are calculated for three replicates, on a four core machine:

Going from one to two threads certainly pays of, but going to 4 shows only a three second speed up. The four processor cores were not utilized 100%, so I also attempted 2 threads core, but that showed zero improvement.

Monday, November 24, 2008

Software is a Method (Meme)


  1. it provides a recipe to approach (scientific) questions

  2. let's you cook up a (scientific) answer

  3. you can use it as a black box (like an orbitrap)

  4. you can refine existing methods (well, some can, others don't)

  5. it has an error (but I do not believe it is normally distributed)

Now, to me it's trivial to work put how Open Source supports this.

Thursday, November 20, 2008

Scripting JChemPaint

Today and tomorrow, Stefan, Gilleain, Arvid and I are having a JChemPaint Developers Workshop in Uppsala, to sprint the development of JChemPaint3, for which Niels layed out the foundation already a long time ago.

Gilleain and Arvid are merging their branches into a single code base, while Stefan is working on the Swing application and applet. The Bioclipse SWT-based widget is being developed for Bioclipse2.

The new design separates widget/graphics toolkit specifics from the chemical drawing and editing logic. Regarding the editing functionality, this basically comes down to have a semantically meaningful edit API. This allows us to convert both Swing and SWT mouse events into things like addAtom("C", atom), which would add a carbon to an already existing atom. However, without too much phantasy, it allows adding a scripting language. This is what I have been working on. Right now, the following API is available from the Bioclipse2 JavaScript console (via the jcp namespace, in random order):
  • ICDKMolecule jcp.getModel()
  • IAtom getClosestAtom(Point2d)
  • setModel(ICDKMolecule) (for really fancy things)
  • removeAtom(IAtom)
  • IBond getClosestBond(Point2d)
  • updateView() (all edit command issue this automatically)
  • addAtom(String,Point2d)
  • addAtom(String,IAtom) (which works out coordinates automatically)
  • Point2d newPoint2d(double,double)
  • updateImplicitHydrogenCounts()
  • moveTo(IAtom, Point2d)
  • setSymbol(IAtom,String)
  • setCharge(IAtom,int)
  • setMassNumber(IAtom,int)
  • addBond(IAtom,IAtom)
  • moveTo(IBond,Point2d)
  • setOrder(IBond,IBond.Order)
  • setWedgeType(IBond,int)
  • IBond.Order getOrder(int)
  • zap() (sort of sudo rm -Rf /*)
  • cleanup() (calculate 2D coordinates from scratch)
  • addRing(IAtom,int)
  • addPhenyl(IAtom)
This API (many more method will follow) is not really aimed at the end user, who will simply point and click. The goal of this scripting language is, at least at this moment, to test the underlying implementation using Bioclipse. Future applications, however, may include simple scripts which use some logic to convert the editor content. For example, replacing a t-butyl fragment into a pseudo atom "t-Bu". The key thing to remember, is that this will allow Bioclipse to have non-CDK-based programs act on the JChemPaint editor content (e.g. using getModel() and setModel(ICDKMolecule)). More on that later.

A simple script could look like: Or, as screenshot:

Tuesday, November 18, 2008

Solubility Data in Bioclipse #1

I am working on converting Jean-Claude's Solubility data to RDF (after Pierre's model, see here, here, and here, here for first data exploration), so that I can integrate it with data from DBPedia, Freebase, rdf.openmolecules.net, etc. Bioclipse will be the workbench in which this will be visualized, and just got graph depiction online using Zest. The screenshot does not show the RDF yet, but that will follow soon:

Next stops:
  1. create a Eclipse package for Jena
  2. read the Solubility data (does anyone know a Java library to read from Google Docs?)
  3. create a virtual database of Solubility compounds (possibly StructureDB-based)
  4. Use the CDK to autoextract chemical triples

Wednesday, November 12, 2008

Re: Open Source != peer review

Andrew has an interesting thread on the content of a slide of a recent presentation. In the comments you can read the back and forth on things; indeed, there are very many aspects to things and he did ask a very complex question, of which he assumed that I understood what he was asking, and I indeed assumed too that I understood what he was asking:
    Some argue that doing good computational-based science requires open source. The argument is that scientists need to review the source code in order to verify that it works correctly. How, they argue, can you review someone else's paper if you can't review the source code used to make that paper?

    I like open source. (My talk goes into the philosophical differences between "open source" and "free software.") I think there should be support for peer review. But I don't understand why the ability to see the source code, in order to review it for scientific quality, requires the right to redistribute the source code to others.
So, I assumed he was interested in hearing why people thing open source benefits open source. Misinterpreting the last two words, I though access to the code and the ability to redistribute code I find bad in my peer review. There was another incorrect assumption on my side: I had open peer review in mind, as I like so much about open source projects, instead of the peer review as in paper peer review, prior to the preprint server age. Another thing I understood incorrectly, was that he was only referring to computational packages, not cheminformatics in general. My mistake. Being from a GCC meeting, I assumed the latter.

Therefore, a lot of miscommunication. I agree to a large extend with Andrews analysis: peer review is certainly possible without Open Source. Actually, this matches closely with the discussion between Cathedral versus Bazaar opensource projects (see my post earlier this week). He argues that current opensource (cheminformatics) do not have enough eyeballs, and indicates that money buys eyeballs. Indeed it does.

However, the original argument I wanted to make, but failed, is that Open Source (any kind of access to the source code) is a strict requirement for reviewing the implementation. We do not want black boxes.

How you organize this access to the source code is another thing, and topic of much of the discussion in Andrews blog. There are many solutions, but all include some sort of access to the source code. Redistribution is not a requirement, though, if the review is only send upstream, as is common in reviewing papers.

I feel that Open Source is a solution worth fighting for, but I do understand the argument that funding of this approach remains to be a problem. Open Source cheminformatics is the equivalent of a preprint server; one solution to peer review, a good one, I think, not the only one. The parallels are seemingly even stronger: you cannot review a paper by just reading the abstract and the conclusion: a paper is not a black box either.

Anyway... just a tip of the iceberg touched in the discussion. Feel free to join in.

Monday, November 10, 2008

Finding the commit that causes the regressions...

CDK 1.1.x releases are well in progress, but a recent commit broke a number of unit tests. Here comes git-bisect.
$ git checkout -b my-local-1.2 cdk1.2.x
$ git bisect start
$ git bisect bad
$ git bisect good 8219139e9236ab8036e9d08c13fcd0482d500c79
These lines indicate that the current version (HEAD) is broken, and that revision 8219139e9236ab8036e9d08c13fcd0482d500c79 was OK. Now, git-bisect does the proper thing, and starts in the middle, allowing me to run my tests, and issue a git bisect bad or git bisect good depending on whether my test fails or not. The test I am running is:
$ ant clean dist-all test-dist-all jarTestdata
$ ant -Dmodule=smarts test-module
$ git bisect [good|bad]
So, if I had to inspect 1024 commits, I'd found the bad commit in 10 times running this test suite. For the culprit I was after it was 6 times. The outcome was this commit, what I already suspected and emailed about to the cdk-devel mailing list:
[fa49ac603c36908f341b25d52a78435cdb8ca4d3] atomicNumber set as default (Integer) CDKConstants.UNSET

Friday, November 07, 2008

Open{Data|Source|Standards} is not enough: we need Open Projects

The Blue Obelisk mantra ODOSOS, Open Data, Open Source, Open Standards, is well known, and much cited too. Jean-Claude Bradley popularized the Open Notebook Science (ONS). This has always been nagging me a bit, because the CDK, Jmol, JChemPaint and other chemistry projects have done that for much longer, though we did not use notebooks as much, so called it just an open source project. It really is no different, IMO, though surely, there are differences.

Anyway, the key thing which ONS and CDK and Jmol share, is that they use an Open Notebook. Not every Open Source or Open Data project does. Actually, many scientific Open Source are not open Projects! They are more like the Cathedral than the wished-for Bazaar (see The Cathedral and the Bazaar). So, Open Source (science) projects are certainly not ONS projects by default!

Now, the CDK actually is ONS, it is a Bazaar. The notebooks we use include: What more would you wish for? That's not a rhetorical question. Remember that every reader of this blog is in my advisory board!

Unfortunately, I do not create work at a workbench myself, so I do not produce new knowledge myself, other than extracted from existing data. That's really a shame, and I really do hope that Jean-Claude or Cameron will send me a box to measure solubilities (see here, here, and here, here for first data exploration), even though I cannot participate in the challenge. (hint, hint :)

From Cathedral to Bazaar in Life Sciences
One Cathedral we ran into with Bioclipse was BioCatalogue, which will serve as website where people can annotate and categorize (web) services. While the project has been around for a while, the website was rather uninformative. Fortunately, the projects is going to open up, and be more Bazaar-like. For example, they now started a wiki and a mailing list. I hope these efforts will continue, so that I can contribute from my point of view!

The EMBRACE Registry is a project with similar goals and a rather nice outcome (which I learned about on Monday). It is actually anticipate to be replaced by or merge with BioCatalogue. So, all data I entered, cheminformatics workflows (look, no 'o'), will later be available from BioCatalogue too. That is already my first contribution to BioCatalogue. One enormously interesting feature of the Registry, is that is allows uploading of code to test the service. This will mean the Registry will not only poll if the service is still online (by checking the WSDL file), it will also test if the service behaves properly. Now, immediate thoughts are mashups with MyExperiment. Each WSDL entry in the Registry points to MyExperiment workflows that use them, and the workflow page would indicate the status of all used WDSL services. This integration was already anticipated long before I thought about it, as the involved Cathedrals were nicely located in the same floor in Manchester.

Below is a screenshot from the EMBRACE Registry for the ChemSpider WDSL entry for a workspace I uploaded about a year ago to MyExperiment:

BTW, ChemSpider has an Advisory Board of which I am member, but it is also a classical (and intentional) Cathedral project. We do share common interests though, which makes us collaborate.

Why Important?
One recurrent theme in Open Source is given enough eyeballs, all bugs are shallow. This surely applies to science as well. The difference between the two is that in current science the eyes only inspect with a delay of at least 6 months. Current practice is that research is finished (delay), and when decided publishable written up a paper (delay, and loosing valuable information in the process, as you can read in my blog all the time), and published (even more delay). ONS changes that, and so do Bazaar-like open source projects, such as the CDK, Jmol and Bioclipse. They bugs are present, whether we like it or not, not just in source code, but in science too. Theories get overthrown, but why should we like the long delays current scientific good practice? Hate it! Work around it. Use the Bazaar. Use ONS!

Now, ONS actually needs Open Source, allowing them to deal effectively with the data they produce; to allow extraction of new scientific knowledge from the measurements. If Rajarshi and Pierre would not have made their efforts, other could not easily join in, leading to those much hated delays. Bugs should be shallow, and openness allows us to make those bugs visible. We can prove that there is a bug, without having to reproduce data ourselves, leading to those nasty delays again. Just copy the data, compare it to your own, do your analysis.

One recent project in open source chemistry dealing with making bugs visible, is the web page set up by Andreas Tille for the DebiChem project. His page summarizes the bugs listed for the chemistry in Debian (which includes the Blue Obelisk projects Avogadro, BODR, CDK, Chemical MIME Data, Kalzium and OpenBabel):

This data analysis helps the projects being analyzed.

Packaging
This brings me to a last topic, for this blog: packaging using Open Standards. In order to allow those eyeballs to spot bugs, it is of the utmost importance to package your results in Open Standards, and not just one, but likely many. For Open Source projects this ultimately means Distribution Packages (deb or rpm). If that goal has been achieved, you know your results can be read by anyone. Software should be installable (make, ant, cmake, etc), and Data should be readable (no PDF, but RDF, XML, JSON, or whatever standard). Preferably not Excel, as this is too free format (as Rajarshi also indicated), but with some added conventions it may do well. Blue Obelisk project are generally doing well in terms of packaging.

For the CDK, which already is reasonably well packaged, I am currently working on Eclipse and Maven2 packages. The former is already being used by Bioclipse, while the second aim at Jumbo (which has just seen a new release. Jim, I'm happy to see the CMLDOM/Jumbo split!), CDK-Taverna, and possibly a third (Paula, what for do you plan to use it?). The POM export is not fully working yet, but with four research sites involved in this Open Project, I'm sure we'll work it out.

The bottom line is, scientific progress would benefit so much from a Bazaar approach. And the key thing is not collaboration; that's something you can do in a Cathedral-like fashion too. No, the key thing is to be Open and allow anyone, even your worst nightmare, to comment on what you do. Let him prove you wrong, openly, that is.

OK, there it is. My open notebook entry for this week. Now you know what I have been up to this week.

Tuesday, November 04, 2008

Next generation asynchronous webservices #2

Getting back to some webservice stuff (see part #1 of this series)... actually, I'll use cloud service from now on, since web service is reserved for SOAP/WSDL (see my EMBRACE presentation). Let me present this bit of JavaScript I just ran in Bioclipse2:
xws.connect();
service = xws.getService("cdk.ws1.bmc.uu.se");
service.discoverSync(9000);
service.getFunctions();
f = service.getFunction("calculateMass");
ios = f.getIoSchemataSync(9000);
iof = xws.getIoFactory(ios);
smiDoc = iof.createSmilesDocument();
smiDoc.setSmiles("CCC");
result = f.invokeSync(smiDoc.toString(), 9000);
obj = iof.getOutputObject(result);
print("Mass: " + obj.getStringValue())
At first, it might look a bit verbose to just calculate the mass of a molecule, and it is, and it is not even written in XML. Hahahaha

Anyway, the code rocks, thanx to Johannes' great work on his xws4j library! I'll explain the script. The first line gets Bioclipse online using a Jabber account, which you set via Bioclipse' preferences pages. The next few lines allows you to connect to a cloud service, this one running on ws1.bmc.uu.se and called cdk. With the getFunctions() method we query which functions are available, called ports in WDSL if not mistaken, from which we pick the calculateMass one.

And then the action joins in. One nice feature of the IO-DATA proposal is that the function itself defines the XML Schema it uses for input and output, and does not rely on WDSL to do that (or maybe recent SOAP specs allows that too). So, we query the function for its schemata, and the xws4j library then something funky happens: we order the library to create a data model on the fly for this service! From this we get a Java data model for the service. This allows us to use createSmilesDocument() and setSmiles(). That's function-specific stuff!

Of course, we do not have to do that. For example, the second function I wrote (generate3Dcoordinates) eats and spits CML, and I'd rather rely on CMLDOM or CDK as data model then. But more on that later...

The Bioclipse xws4j plugin actually puts the data model in my workspace, so that I can easily introspect the API:



The last three lines invoke the function (synchronously, as it's cheap), and get the mass from the function output. BTW, I should stress that a function does not require any specific implementation regarding synchronous or asynchronous calls. You write one function, and can call it in either way you like. The library hides all IO-DATA details around that.

Monday, November 03, 2008

EMBRACE workshop in Uppsala

This Monday and Tuesday I will attend the EMBRACE workshop Understanding, creating and deploying EMBRACE compliant WebServices. I will present there the ongoing work in Bioclipse to support services and web services in particular. The sheets of the presentation will look like:

Friday, October 31, 2008

Next generation asynchronous webservices

Johannes joined a Bioclipse Workshop a long time ago, and introduced the participants to the idea of using XMPP (aka Jabber) for asynchronous web services. SOAP is commonly user to run webservices over HTTP, but via (SMTP) email and XMPP is possible too (see SOAP over XMPP). Using HTTP as transport layer has problems. The biggest problem, is possibly that HTTP connections are timed out, e.g. by intermediate router. This makes it rather unsuited for long running jobs. Workarounds are easy to come up with, and polling is a common solution.

Johannes ideas solve this limitation by using the general XMPP protocol for chatting:
client
he, can you do something for me?
service
sure, I can do generate3Dcoordinates and generateSMILES.
client
ah, nice! what input does generateSMILES take? and the output?
service
input: CML, output a simple string.
client
ok, here's the CML
service
I'm done now. sorry that it took 10 minutes, but I'm running Vista...
client
excellent, please send me the results
service
ok, here is the SMILES for lacosamide: CC(=O)N[C@H](COC)C(=O)NCC1=CC=CC=C1

Well, the important bit is in the last line. A job may take lone, even on clusters. The client might have to reboot meanwhile (possibly because of critical security updates)... the service will just continue, and send you a message when done. If you just happen to be offline, it will send a message when you are back online.

Johannes ideas led to the IO-DATA proposal (XEP-0244), which is currently marked experimental and being discussed on the ws-xmpp mailing list. He gathered a few people around him to get it going, resulting in working stuff! Yeah!

Chemistry Development Kit XWS
Besides contributing to the proposal, I am also involved in this project by writing XMPP-webservices, for the CDK. This brings me to cdk-xws, which is the project to bring CDK functionality online as webservices using IO-DATA.

This shows three nodes, the first being the CDK service, with two functions, of which I only implemented one yet.

For the curious, this is what the XMPP messages look like:
<iq from="egonw@ws1.bmc.uu.se/home" 
id="JSO-0.12.5-6"
to="cdk.ws1.bmc.uu.se"
type="set">
<command xmlns="http://jabber.org/protocol/commands"
action="execute"
node="calculateMass">
<iodata xmlns="urn:xmpp:tmp:io-data"
type="input">
<in>
<smiles xmlns="urn:xws:cdk:input">CCC</smiles>
</in>
</iodata>
</command>
</iq>
<iq from="cdk.ws1.bmc.uu.se"
id="JSO-0.12.5-6"
to="egonw@ws1.bmc.uu.se/home"
type="result"
xml:lang="en">
<command xmlns="http://jabber.org/protocol/commands"
node="calculateMass"
sessionid="XWS-1"
status="completed">
<iodata xmlns="urn:xmpp:tmp:io-data"
type="output">
<out>
<mass>36.032207690364004</mass>
</out>
</iodata>
<note type="info">Done</note>
</command>
</iq>

Embedding Gists in blogs

Mark pointed me to the embed functionality of Gist, product on GitHub where I host some todo software and a git mirror of CDK 1.2.x.

So, the other day, when I blogged about Bioclipse2 scripts, I should have embedded the script like this:

Saturday, October 25, 2008

Bioclipse2 Scripting #1: from SMILES to a UFF optimized structure in Jmol

After some difficulties this week with making an export of CDK plugins in the Bioclipse2 Cheminformatics feature of with the cdk-eclipse software, I got the following cute Bioclipse2 script up and running:
dimethylether = cdk.fromSMILES( "COC" );
cdk.addExplicitHydrogens( dimethylether );
cdk.generate3dCoordinates( dimethylether );

// save as CML
cdk.saveCML( dimethylether, "/Virtual/dimethylether.cml" );
ui.open( "/Virtual/dimethylether.cml" ); // this should open a JmolEditor

jmol.minimize();
You can see four of my favorite cheminformatics tools integrated: CDK is used to convert a SMILES into connection table with add explicit hydrogens, and to create initial 3D coordinates (with the code from Christian Hoppe, and thanx to Stefan for fixing that code in the CDK 1.1.x branch!). Then, CMLDOM is used to create and save a CML document, which is then opened into a Jmol editor in Bioclipse.

A variation of this script is visible in the following screenshot:

This and other Bioclipse2 scripts I will post in Gist, a sort of pastebin supporting version history, and I'll tag them with bioclipse gist on delicious, so that you can always browse them, comment on them, or add your own gists at http://delicious.com/tag/bioclipse+gist.

Friday, October 24, 2008

Git-Eclipse integration

Recently, I have been blogging about Git: One concern expressed by people was the lack of integration with IDEs. Now, an Eclipse plugin seems well on its way:

With a experimental update site (http://www.jgit.org/update-site), the plugin is just an Eclipse reboot away.

Now, the plugin is still in its early stages and many open feature requests, but fortunately the bug tracker can easy be integrated with Mylyn, and is still actively developed.

Cheers to Shawn and Robin for their work!

Monday, October 20, 2008

Bugzilla Eclipse IDE integration: Mylyn

A new environment means new tools. Bioclipse is Eclipse RCP-based, so colleagues work with Eclipse and are much more into Eclipse too. For example, into Mylyn. Mylyn is a tool to track tasks and assign context to them. The tasks I am interested in (for this blog item), is fixing bug reports. Mylyn is rather suited for this, as it allows linking Java source files to bug reports. With a growing list of projects in my navigator, browsing them becomes difficult because the list is way too long. Mylyn allows me to only show those source files which are actually related to the bug I am fixing. Cool!

However, SourceForge, our bug tracker, integrates, but to too limited functionality. Bugzilla, though, has excellent integration. And curious about what that would look like, I installed Bugzilla on an Ubuntu system. Which failed. Due to a bug know for two years already! Anyway, two tweaks to the system got it working!
  1. Work around the password in the postinstall script (see here)
  2. Set up a /bugs/ link (see here)
This is Bugzilla as viewed in Mylyn:

(The bug content is derived from Ubuntu bug #1.)

GitToDo support for Freemind: graphical mapping of important things on my schedule

About a week ago, I hooked up my GitToDo software with Freemind. This allows me to organize the projects I am working on, without having to code this in GitToDo directly. I also immediately take advantage of visualization, for example, adding an icon for projects with one or more TODO items marked TODAY or URGENT:

Keeping my GitToDo repository synchronized is as easy as typing:
gtd-freemind-update
gtd-freemind-show

Chemical Editing...

As you might have seen, we, Uppala and the EBI, are working on the next generation JChemPaint. JChemPaint is an editor, and therefore, consists of a mode (IChemModel), a view (IRenderer) and a controller (IController). See the many posts in Gilleain's blog.

For the renderer I have set up a wiki page which I'll be hacking in the next days, which shows how a IChemObject content should be rendered in JChemPaint. It looks like:



The IController is a rather important part too, and like the IRenderer bit of JChemPaint, needs a major overhaul. The new design, discussed by Gilleain here and here, should, IMHO, look like:

In this diagram, the gestures can come from any input device, mouse, tracking ball, Wiimote, and will result in events in some widget library (SWT, AWT shown). The old JChemPaint, converted the Swing MouseEvent's directly into IChemObject modifications, making the code incompatible with SWT. This is why the Chemical Editing Events layer must be added.

Events in this layer look like addAtom(attachementAtom, coordinates) and setFormalCharge(atom, newCharge). The link to scripting should be clear now, and will help use write unit tests for this layer.

Chem-bla-ics turns 3!

Five days ago, my chem-bla-ics blog turned 3. Here's the first post. It defined:
    chemblaics is the application of open source software in cheminformatics, chemometrics, proteochemometrics, etc, making experimental results reproducable and validatable.
Much has changed to the field since that post, for the better of chemical sciences.

Saturday, October 18, 2008

Chemoinformatics p0wned by cheminformatics... #2

Some time ago Noel ran a poll on chemoinformatics and cheminformatics, so I set up a poll too in part #1 of this series. The outcome is clear:



The Obernai meeting strongly suggested chemoinformatics [1], but the start of the open access Journal of Cheminformatics is the killer. I can no longer resist: I'll follow the wish from my advisory board, and the general trend around the world (except India).

The journal's editor-in-chief is David Wild, while Christoph Steinbeck seems to be going to lead the European branch. People seem to like the idea. The journal will clearly be in direct competition for market share with the JCIM, QSAR & Combinatorial Science, and even the open access Chemistry Central Journal. Interesting to see where this is going...

Tuesday, October 07, 2008

Jmol 11.6 RC 18 in Bioclipse

Just updated Bioclipse2 with Jmol 11.6 RC 18:

Now working in Uppsala makes Bioclipse my default life sciences platform, and I'll be porting older Bioclipse1 plugins to Bioclipse2, which has a much better architecture.

Bioclipse2 does not have a native Jmol Console, but script commands can easily be run with jmol.run() (written by Jonathan). I wonder if it would be hard to have a JmolScript view like this JavaScript Console... The outline on the right (written by Ola) allows me to navigate the Jmol data model.

Monday, October 06, 2008

pKa prediction, or, how to convert a JCIM paper into Java

Lee et al. published last week a paper on pKa prediction (doi:10.1021/ci8001815). As the paper says, the pKa, and in particular the ionic state of a molecule at physiological pH, affects pharmacokinetics and pharmacodynamics. The paper describes a (binary) decision tree using presence or absence of SMARTS substructures to traverse the tree, allowing prediction of monoprotic molecules.

Now, the paper's Supplementary Information contains the full model. I'd rather rebuild the model, but the full training set does not seem available. Still, the paper's model shows comparible predictive power as commercial models, so I'd say it would be a welcome addition to the CDK.

And as the CDK already has a SMARTS parser, adding this model should be easy enough. So, here goes :) First, let us outline the API:
/* $Revision$ $Author$ $Date$
*
* Copyright (C) 2008 Egon Willighagen <egonw@users.sf.net>
*
* Contact: cdk-devel@list.sourceforge.net
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU Lesser General Public License
* as published by the Free Software Foundation; either version 2.1
* of the License, or (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU Lesser General Public License for more details.
*
* You should have received a copy of the GNU Lesser General Public License
* along with this program; if not, write to the Free Software
* Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA.
*/
package org.openscience.cdk.charges.pka;

/**
* Tool to predict a molecule's pKa. The class implements
* the algorithm published by Lee et al. {@cdk.cite Lee2008}
* which is based on a SMARTS-based decision tree, trained
* with 1693 monoprotic compounds.
*
* @cdk.module extra
*/
public class PkaPredictor {

/**
* Predicts the pKa value of a molecule.
*
* @param container IMolecule to predict the pKa for
* @return The predicted pKa
*
* @throws CDKException upon failure of the prediction algorithm
*/
public static float predict(IMolecule container) throws CDKException {}

}
The first line is picked up by SVN, which will add the revision number, the last commiter and when the last commit happened. The third line is important: it indicates who has the right or need to be asked permission to modify the license, if ever needed. If people provide patches to the code, they are added to this list. The rest of the source file header includes a general contact email address, and the LGPL v2 license the CDK uses. The package declaration puts it in the cdk.charges.pka package, which seemed appropriate.

The class JavaDoc contains two CDK specific tags. The tag ${cdk.cite Lee2008} is used to point to the literature reference database in doc/refs/cheminf.bibx. When the HTML JavaDoc is compiled, the full reference gets included in the HTML. The other tag, @cdk.module is used by the CDK build system to determine in which CDK module the Class should end up; extra in this case. The method's JavaDoc is pretty default.

Next, we need some logic to traverse the look up the predicted pKa from the decision tree, and I implemented this as:
public static float predict(IMolecule container) throws CDKException {
if (node1 == null) initalize();

DecisionTreeNode node = node1;
// traverse down tree until we end up in a leave
while (!node.isTerminal()) {
node = node.decide(container);
}
return node.getValue();
}
The root node of the tree is called node1, and I explain its initialization later. Then, the code traverses the tree by asking each node to decide whether the SMARTS substructure is present or not. It returns a new DecisionTreeNode matching the presence or absence. At some point, the terminal node is reached, and we can ask this node the associated prediction value.

The Java version of the Decision Tree

The paper's supplementary information contains the tree encoded like this:
1,0,,,5.9131093
2,1,[#G6H]C(=O),1,3.6849957
3,1,[#G6H]C(=O),0,7.206913
That is, each line lists the node identifier, the parent identifier, the SMARTS query, presence (1) or absence (0), and the node value. Actually, a bit more, but these are the important bits for now. The first line is the root node node1, and the second and third line the two children of the root node. If the [#G6H]C(=O) substructure is present, then node2 applies, and the predicted value would be 3.6849957; if the substructure is absent, then node3 applies, and pKa 7.206913.

Now, these nodes and there interdepencies are encoded in the initialize() method as:
private static void initalize() throws CDKException {
node1 = new DecisionTreeNode(5.9131093f, 17.32f);
DecisionTreeNode node2 = new DecisionTreeNode(3.6849957f, 5.9569998f);
DecisionTreeNode node3 = new DecisionTreeNode(7.206913f, 17.32f);
node1.setChildNodes("[#G6H]C(=O)", node2, node3);
}
The second argument in the DecisionTreeNode constructor is the value range for the node, and is an indication of the variance of the prediction value.

A simple Perl script can convert the file from the supplementary information into Java source code. With more than 1500 nodes in the tree, this beats manual hacking up of the tree.

The JUnit4 test

The unit tests now looks like:
package org.openscience.cdk.charges.pka;

/**
* Unit test to test the functionality of the {@link PkaPredictor}.
*
* @author egonw
* @cdk.module test-extra
*/
public class PkaPredictorTest extends NewCDKTestCase {

@Test public void testThrowsNoException() throws Exception {
IMolecule methane = NoNotificationChemObjectBuilder.getInstance().newMolecule();
IAtom carbon = methane.getBuilder().newAtom(Elements.CARBON);
methane.addAtom(carbon);

float result = PkaPredictor.predict(methane);
// the actual value depends on the number of nodes I actually added,
// but I *do* know the min and max without having to have all nodes
// implemented
Assert.assertTrue(result < 15.526);
Assert.assertTrue(result > -0.6659999);
}
}
Note that I cannot assert the real prediction value until the full decision tree has been implemented in the class, but I do note the full range and thus test for that. You may have noted that several methods throw CDKException's, which would have been caused by SMARTS expressions the CDK cannot handle...

SMARTS problems...

Now, the SMARTS used in the supplementary information indeed do not work with the CDK SMARTS engine; the paper indicates that they used MOE which extends the original Daylight SMARTS. So, if you ever wondered about the forking risk of Open Standards...

So far, I have identified these three patterns used in the paper's model, but not parsable by the CDK engine:
  1. [i] a SP2 hybridized carbon (aromatic or delocalized)
  2. [#G6] matches carbon and sulfur, so seems to indicate a group in the periodic table
  3. [#X] no idea... (no internet at home yet, so cannot Google either)
The #G syntax can be rewritten in a OR form, and possible the others too. However, I'd rather see the CDK SMARTS engine support these industry adopted extensions.

Conclusion

The CDK shows its power as development kit, and allowed me to hack up the code of the paper on a casual Saturday evening (sitting on the couch next to a fire in our kacheloven with a glass of beer). Writing up this blog was done the next day.

Once the missing SMARTS patterns have been added to the CDK (or proper replacements have been defined), I'll compare the test set results of the paper with the CDK implementation. I probably also convert the test set results from the supplementary information into unit tests (the SI contains SMILES, experimental and predicted values).

Thursday, October 02, 2008

JChemPaint history: CML patches in 1999

There was some talk about the history of chemoinformatics toolkits by Noel and Andrew, which made me wonder on the exact history of Jmol and JChemPaint. Below is the email Christoph dug up from his archives:
X-Mozilla-Status: 1011                                       
X-Mozilla-Status2: 00000000
Message-ID: <372ECD5E.53A49584@ice.mpg.de>
Date: Tue, 04 May 1999 12:35:10 +0200
From: Christoph Steinbeck
Reply-To: steinbeck@ice.mpg.de
Organization: Max-Planck-Institute of Chemical Ecology
X-Mailer: Mozilla 4.51 [en] (WinNT; I)
X-Accept-Language: en
MIME-Version: 1.0
To: Egon Willighagen
Subject: Re: Participating in JChemPaint
References: <000701be9613$34cf52e0$8e74ae83@catv6142.extern.kun.nl>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit



> Egon Willighagen wrote:
>
> Dear Christoph Steinbeck,
>
> Yesterday I visited your site on JChemPaint. I like to contribute some
> of my expertise on
> Java and CML (1).
>
> CML is a markup language that is able to contain chemical information.
> It can contain for example physical properties, for which I use CML in
> my Dictionary on Organic Chemistry (2).
> But is also might contain spectra, bibliographic references etc. And
> of course 2D and 3D
> structural information.
>
> Therefore I propose to write both CML-input and -output procedures for
> the JChemPaint project.
>
> I hope to hear from you soon.
>
> Yours sincerely,
>
> Egon Willighagen
>
> 1. http://www.xml-cml.org/
> 2. http://www.sci.kun.nl/sigma/Chemisch/Woordenboek/

Dear Egon,

thanks very much for your mail and your offer to write CML-input and
output routines for JChemPaint.
That really sounds great to me and I will give you access to our CVS
tree as soon as we have discussed the details.

Cheers,

Chris

--C. S.
Dr. Christoph Steinbeck (http://www.ice.mpg.de/~stein)
MPI of Chemical Ecology, Tatzendpromenade 1a, 07745 Jena, Germany
Tel: +49(0)3641 643644 - MoPho: +49(0)177 8236510 - Fax: +49(0)3641
643665

What is man but that lofty spirit - that sense of enterprise.
.. Kirk, "I, Mudd," stardate 4513.3..

Now, my email must have been triggered by the announcement of JChemPaint on FreshMeat.net, which is the oldest public record of JChemPaint I have found so far:

Who likes my FriendFeed posts most...

Felix has a small tool on his website to show me (or anyone else) who likes what I post on my FriendFeed account:

Which actually is Deepak...

Wednesday, October 01, 2008

Cherry-picking commits from CDK trunk: how to make a reasonable commit message

Some of you heard me complain about commit messages resulting from git cherry-pick which allows me to apply patches from CDK trunk to a branch, without needing to do a full merge of what happens in trunk. The commit messages would be identical, which made it seem that those original messages were mine.

However, this is how I can modify those messages:
    $ git commit --amend
This allows me to convert a mere refactored a method into Applied patch from trunk (rev 12479): [shk3] refactored a method.

Tuesday, September 30, 2008

Git mirror for the CDK

While slowly merging with Sweden, and ADSL which should reach my house in some two weeks, I am enjoying my new office space and Git to upload patches to the CDK. Christoph wondered if we should switch CDK from SVN to Git. A few developers objected, for various reasons: no native Windows clients (though msysgit might be the solution), no (stable) plugins for Eclipse, IDEA(?), etc.

I made the switch, and really happy about it.

Anyway, one issue for me not to switch the full CDK project would be to have a central place where we could host our Git repository. Now, GitHub does just that, and after inquiring with them about the 100MB limit, Tom emailed me:
    Hi Egon,

    We'd love to have your open source project on GitHub. The 100MB is currently a soft limit, so you won't have any problems uploading a larger repo. We hope you enjoy GitHub!

    Tom Preston-Werner
    github.com/mojombo
So, I created an account (I'm happy there are so few Egon's in the world :), and uploaded the CDK 1.2 branch, which, for now at least, will serve as mirror only, while SVN will be the primary repository.

You can easily check it out with:
    $ git clone git://github.com/egonw/cdk.git
I am not sure how you can email me your patches, but I know it is possible and report on this later. This mirror is important to those who want to play with Git, as one no longer requires git-svn, dropping one dependency.

Now, it does provide some extra payload on my side, as I need to keep cdk SVN repository (or, better, my git-svn copy of it) synchronized with the git repository, but this turned out to be fairly easy:
    $ cd GitHub/cdk
    $ git pull ../../SourceForge/git-svn/cdk my-local-1.2
    $ git push
So, does this mean no goodies for people who stick to SVN? No, there are some, like this PunchCard:

Wednesday, September 24, 2008

Moved to Sweden: Post-doc in the Bioclipse group of Prof. Jarl Wikberg

The reason why I have not been able to blog much lately, is that my family and I have been moving to Uppsala/Sweden, where I'll start a postdoc in the group of Jarl Wikberg @ BMC @ Uppsala University, where I'll work on chemoinformatics in drug design, and the use of CDK and Bioclipse in particular.

More blogging when I have more frequent internet access again...

Tuesday, September 09, 2008

FriendFeed for the Chemistry Development Kit

FriendFeed is a nice aggregation service allowing discussion of items posted from delicious, blogs, and any other RSS-based feed.(e.g. my feed. It also has a room concept, where people can post stuff around a topic, such as a conference such as Science Blogging 2008 London, or the CDK:

I have associated the RSS feed of the CDK bug tracker, the CDK News ASAP, and will shortly add the commits messages feed.

Sunday, September 07, 2008

CDK development with branches using Git

Christoph pointed me to a video on Git by Linus. CDK is now using branches extensively in development, and just set up a branch for the upcoming 1.2.0 release later this year (end of October, see cdk-1.2.x). Christoph has just reviewed the branch containing the API move to Iterable. This patch now allows to do this (which would really deserve a blog item by itself):
for (IAtom atom : molecule.atoms()) {
System.out.println("Symbol: " + atom.getSymbol());
}
Now, while branching in SVN is easy (svn copy), merging is a pain, something Miguel and I found out in the last half year, where he and I experimented with using branches in development (see also Comparing Branches). We discovered that porting bug fixes from trunk to a branch, or just keeping the branch synchronized with trunk, simply does not work. And merging itself, after a while, became a tedious process. So, when watching Linus' movie on Git where he mentions being able to merge several branches a day, I knew I had to switch. A full switch for the CDK depends on an always accessible repository (I have been thinking about GitHub; anyone with an opinion on that?).

However, you can start using Git without a central Git repository, including branch support. This blog by Bart has the juicy details, which I'll apply here to CDK, for easy copy/pasting. This replaces the earlier writing on Offline CDK development using git-svn.

First step is to get yourself a Git mirror of SVN (which will take a long time; do it overnight(s)):
$ git svn clone https://cdk.svn.sourceforge.net/svnroot/cdk/cdk/ -T trunk -b branches -t tags
$ git gc
The second command compresses commits to reduce the size of your local Git copy, resulting in a cdk folder of about 300MB. Enter the directory, and check that it has the default master branch:
$ git branch
* master
In SVN one must always do a svn update before one starts coding. Similarly, in git you do (and I found this important to keep your local repository consistent):
$ git svn rebase
Committing has not changed, and a simple change would go via:
$ nano build.xml
$ git commit -m "Changed something, but too lazy to write up what I actually changed" build.xml
$ git svn dcommit
Branches
Now, before we move to setting up branches, one must realize that there are SVN branches and (local) Git branches. Keep that in mind, and consider that we have Git to realize how to keep them synchronized. The check the Git branches one uses git branch as shown above; to view the SVN branches, however, we type (which should produce a quite long list for the CDK; only a few listed below):
$ git branch -r
trunk
tags/cdk-2003-Oct-17
cdk-1.2.x
mesprague-iterators
Here, the first is CDK trunk, the second a tag tags/cdk-2003-Oct-17, and the last two are the branches cdk-1.2.x and mesprague-iterators (no longer existing). I am not sure why the branches/ is missing here; some git-svn magic I presume.

Now, to create local Git branches that are synchronized with the SVN cdk-1.2.x and cdk-1.0.x branches, we type:
$ git checkout -b my-local-1.2 cdk-1.2.x
$ git checkout -b my-local-1.0 cdk-1.0.x
$ git branch
* master
my-local-1.0
my-local-1.2
You can now easily change branches with git checkout <BRANCH>, and check which SVN path you are working against with git log -1:
$ git checkout my-local-1.2
$ git log -1
commit 93bd0b22bbad31897eed6686e5b208c5e23505f7
Author: egonw
Date: Sun Sep 7 08:13:38 2008 +0000

Fixed inline citation (closes #1987947)


git-svn-id: https://cdk.svn.sourceforge.net/svnroot/cdk/cdk/branches/cdk-1.2.x@12215 eb4e18e3-b210-0410-a6ab-dec725e4b171
Inspection of the output shows the git-svn-id line which indicates that that patch was indeed commited against cdk/branches/cdk-1.2.x.

With this set up, I can easily changes between trunk and branches, and backport patches from trunk to the cdk-1.2.x branch (using git cherry-pick) and merge all commits to the branch into trunk using:
$ git checkout master
git merge cdk-1.2.x

Git does an excellent job here. It recognizes when the branch was last merged with trunk, and will not attempt to apply patches twice. Even better, it also recognized patches that were backported from trunk to the branch, and will not attempt to merge that either.

The result: I can easily merge branches now, generally speeding up CDK development! For example, it reduces the time between someone submits a patch, and when I apply it to trunk (or cdk-1.2.x in case of a bug fix). I just set up a local branch, apply the patch, and tune until I am happy; I do not keep trunk unstable, as I am doing this in a separate branch. Similarly, if people develop there patch in an SVN branch, I can just as easily switch branches (as described above) and check things, before I merge).

Setting up new SVN branches
As far as I know, git-svn cannot create or delete new SVN branches. But this is easy enough with SVN command:
$ svn copy https://cdk.svn.sourceforge.net/svnroot/cdk/cdk/trunk https://cdk.svn.sourceforge.net/svnroot/cdk/cdk/branches/egonw-mynewbranch
$ git svn fetch
$ git checkout -b my-local-newbranch egonw-mynewbranch
$ # hack in my-local-newbranch
$ git commit -a
$ git svn dcommit
$ git checkout master
$ git merge my-local-newbranch
$ svn remove https://cdk.svn.sourceforge.net/svnroot/cdk/cdk/branches/egonw-mynewbranch
Enough for now.