Digitising Chemistry: A Handy Chemical Calculator for the Organic Chemist

Image courtesy vectorpocket at  freepik.com

Image courtesy vectorpocket at freepik.com

Author: Sandra Ionescu Edited by: Ines Barreios

Robots are taking over the world, and they’re coming for organic chemistry next! Sensationalist headlines such as ‘Are robots replacing chemists?’ [1] and ‘Organic synthesis: march of the machines’ [2] have made the digitisation of chemistry seem like a robot apocalypse rather than a natural progression in automation occurring in most fields. Humanity is increasingly reliant on computers, and developments in hardware, software, and computer-controlled equipment have already pushed the boundaries of what science can achieve. [3,4,5] We no longer divide 10-digit fractions to 10-digit decimal places by hand, we use a calculator. In the same vein, we can use machine learning and other branches of artificial intelligence to improve the ease and efficiency of chemical synthesis. Digitising chemistry — transforming chemistry into code to make it recordable, reproducible, and shareable — is not about replacing the organic chemist; it’s about giving them a handy chemical calculator.

Machine learning algorithms provide knowledge to computers through data and interactions with the world. The rules that the computer learns during training can then be used to make new predictions.[6] In the chemistry world, computers can be trained on a set of known chemical reactions, which then allows them to predict the outcomes of new reactions with high accuracy. Unlike in the early days of organic chemistry research, if you walk into a laboratory today you won’t see anyone smoking a pipe or mouth pipetting. New equipment dots the sterile white landscape, emitting the gentle whir of automation. However, chemists are still sketching out their synthesis routes on paper. Retrosynthetic planning—devising a synthesis pathway by breaking down the target molecule step by step—remains a skilled labour which relies on exhaustive literature searches and built-up knowledge of chemical rules. Databases such as Reaxys and SciFinder aid in experimental planning by identifying papers and patents relevant to a target product or pathway but even here modern computing power is underused, with searches and evaluation of results remaining largely manual. Time is a valuable resource, and too much of a chemist’s time is wasted on trivial tasks such as optimising reactions and troubleshooting, often due to irreproducible data in the literature.

Digitising chemistry, at least in part, means teaching machines how to explore and evaluate large repositories of chemical data in a few seconds to propose synthetic pathways. The ability to scan through large datasets means a machine can pick up trends that a human might miss or not be able to consider. The chemist’s job will then be to evaluate the options and tweak them in creative and intuitive ways—attributes still lacking in machines. However, chemistry is more complex than some currently fully automated processes, including the game of chess where robots have outperformed humans. [7] Robust predictive power is more difficult to achieve when navigating chemical space, requiring the evaluation of hundreds of thousands of possible moves at each synthesis step, ideally with regard to details such as safety and overall yield. As the structures becomes more complex, the number of synthesis pathways the computer has to evaluate increases exponentially. [8, 9] This raises the question of which criteria should be considered at each step to choose the ‘best’ next move? To answer it, different molecular attributes that affect the yield will have to be tested, such as stereochemistry, regiochemistry, protecting group information, and quantum mechanics. An important caveat is that optimal decision-making at each step in the synthesis depends on the data the machine has been given—what you get out is as good as what you put in. Our culture of poor record-keeping and over-protectiveness can lead machines trained on published data to dead ends or erroneous results. Some scientists tend to be secretive about new research, especially in areas where there is a much competition; a publication might leave out a critical step or condition to make it harder for someone else to reproduce their work.

To generate quality data for machine learning, we have to improve record keeping. Digitising chemistry also means creating an ‘immortal scientist’ that keeps track of published, unpublished, and failed experiments over generations. Even published syntheses frequently omit important details and present cleaned up versions of what happened in the lab. To create this immortal scientist, integration of electronic laboratory notebooks (ELNs) will be necessary along with concepts such as neural networks that can handle big data collection.[2] A machine capable of de-novo synthesis will need to be able to predict both when a reaction will work, and when it will fail, to avoid dead ends in the sea of combinatorial chemistry on which it needs to navigate. In the future, automated synthesis platforms could send conditions and outcomes (inputs and outputs) to an ELN which would upload details to an open-access database that could inform other machines. [10] If we knew the history of every chemical reaction ever tested in the laboratory, we would have amazing predictive capabilities.

At the forefront of recent efforts to digitise chemistry is the Lee Cronin group at the University of Glasgow. Their 2018 Nature paper describes an organic synthesis robot that can learn about chemical reactivity by performing a small set of chemical reactions and then use this knowledge to predict the reactivity of new reactions.[11] The machine learning algorithm was first trained on 72 reactive and non-reactive mixtures classified by a (measly human) chemist. Reactivity was assessed in real-time by various sensors—NMR, MS, and IR spectrometers—equipped to the robot. Following training, the algorithm could predict the reactivity of about 1000 reaction combinations with >80% accuracy. These predictions were followed up manually by a chemist, leading to the discovery of four reactions. The robot was more efficient than the average chemist, performing 6 experiments in parallel and allowing for up to 36 experiments in one day. [11]

Does this all sound expensive? That’s because it is. Existing automated systems—including flow chemistry and peptide and nucleic acid synthesisers— have not been widely adopted because they are costly ($25-500K USD), bulky, and highly specialised, i.e. capable of handling only a handful of molecular building blocks using few reactions. To address affordability and portability, the Cronin lab have devised a chemical computer: a 3D printer coupled with a flow cell cytometry that first fabricates the reaction vessel and then synthesises a target molecule such as a drug. [12] To showcase the technology, the commercial muscle relaxant baclofen was synthesised—by performing 3 reactions in 12 individual processing steps, including filtration and evaporation—with 39% yield and ≥95% purity. Despite these achievements, automated synthesis systems struggle to produce a wide range of targets with acceptable yield, and flow systems can face problems including inadequate mixing and blockages. Further, the product must be compatible with the material of the 3D printed vessel, currently polypropylene (PP). Cronin and colleagues are looking to expand the available materials from PP to allow broader compatibility and reduce sample loss on the rough PP surface.

However, a portable system, even with a limited range of outputs, would already be useful in developing countries and remote locations or for time-sensitive products, such as radioactively-labelled compounds. Importantly, the 3D printer used in the published research paper (Ultimaker 2+ by Ultimaker) is commercially available and costs only about $2.5K USD. [12] Although printing yourself an aspirin anytime you have a headache might be convenient, the biggest technological challenge is end-user safety—how will pharmaceuticals that can be produced anywhere in the world be regulated and how will quality control be implemented? If it will be through equipped sensors such as MS and NMR spectrometers, then cost will increase and portability decrease. Furthermore, how will automated synthesis platforms deal with patent-protected chemicals?

Bartosz Grzybowski, currently a professor at UNIST (South Korea), is working on a solution to some of these problems. He is the mastermind behind Chematica [9, 13], a chemical network that links more than 7 million substances via reaction pathways. The platform has been trained with more than 50,000 chemical rules, far surpassing the knowledge of a typical organic chemist. The rules were pulled from a Reaxys database which was scanned for incorrect entries or those lacking crucial information about reagent compatibility or reaction conditions. This effort underlines the necessity for quality data and the for-now indispensable role of humans in machine learning. Chematica can consider its knowledge of billions of reaction pathways in less than a second to choose one that is the most economical, eco-friendly, or involves only commercially-available chemicals. Grzyobwski’s team tested 51 syntheses using the ‘economical’ criterion and collectively trimmed costs by more than 45%. [10] Chematica can also reduce the number of reaction steps to a product, and has predicted over 30 validated one-pot syntheses. [13] Additionally, the automated platform can be trained to avoid synthesis routes protected by patents; it proposed a synthesis for the blockbuster anti-arrhythmia drug dronedarone, which is protected by 46 synthesis patents, with the same overall yield as the patented route.[9]

Despite Chematica’s incredible power in accelerating retrosynthetic planning, the nascent ‘immortal scientist’ doesn’t give precise conditions for each reaction. This leaves room for trial and error and serendipity, which can play a big role in the discovery of new reactions. Like humans, the programme is challenged by predictions involving architecturally complex molecules where stereo-electronic subtleties arise, and the team is working on improving outcomes for complex natural products. [9] Chematica was purchased by Merck in 2017 and it is available for users in academia and industry. [14]

Over 1060 stable molecules are thought to exist .[10] In the future, intelligent automated chemistry platforms may be capable of coming up with new useful molecules and reactions. For this to be possible, machines will have to: 1) access a highly accurate database of existing knowledge about how molecules can be built in a context-dependent manner, i.e. how a reagent will affect other parts of the molecule, and, 2) must be able to feed this knowledge into an algorithm that can map out synthetic steps, with an emphasis on overall yield, similar to how a chess master plans a series of moves to win a game. A fully autonomous machine will also have to perform experiments and assess outcomes in real-time. [11] To be generally applicable to a wide range of products and reactions, a universal programming language and scoring function will have to be implemented, and issues such as reaction vessel compatibility and safety—for both chemist and end-user—will have to be addressed.

Automating chemistry has numerous advantages, including increase in scale, improved precision, better reproducibility, and continuous feedback. [6] Rapid growth in both artificial intelligence technology and chemistry knowledge coupled with the diminishing cost of computing power and sensor systems will continue to drive automation. For the digitisation of chemistry to truly take off, developers need to create user-friendly interfaces and chemistry researchers will need to step outside of their comfort zones and learn how to manage automated systems, perhaps in collaboration with computer scientists. In the meantime, machine learning in organic chemistry will likely make a small but real impact, like a new microscope or assay might, limited to data collection and analysis at the start. Those who argue it will limit serendipitous discoveries should see machine learning as an asset to the bench chemist that keeps practical and intuitive sides of organic chemistry together—a sort of ‘accelerated serendipity’.  Decades may pass before an automated chemist as adept as a human is developed, but the automated systems outlined in this article are already pushing the frontiers of chemistry. For now, organic chemistry will remain a labour-intensive practice that relies heavily on training, planning, experience, observation, and interpretation. Whether digitising chemistry will lead to a new industrial revolution on the molecular scale or just shave off a few hours from the average organic chemist’s workday remains to be seen.


[1]      B. Maryasin, P. Marquetand, and N. Maulide, “Machine Learning for Organic Synthesis: Are Robots Replacing Chemists?,” Angew. Chemie Int. Ed., vol. 57, no. 24, pp. 6978–6980, Jun. 2018. https://doi.org/10.1002/anie.201803562

[2]      S. V. Ley, D. E. Fitzpatrick, R. J. Ingham, and R. M. Myers, “Organic Synthesis: March of the Machines,” Angew. Chemie Int. Ed., vol. 54, no. 11, pp. 3449–3464, Mar. 2015. https://doi.org/10.1002/anie.201410744

[3]      J. Stilgoe, “Machine learning, social learning and the governance of self-driving cars,” Soc. Stud. Sci., vol. 48, no. 1, pp. 25–56, Feb. 2018. https://doi.org/10.1177/0306312717741687

[4]      Google, “Magenta.” [Online]. Available: https://magenta.tensorflow.org/. [Accessed: 21-Nov-2018].

[5]      D. A. Hashimoto, G. Rosman, D. Rus, and O. R. Meireles, “Artificial Intelligence in Surgery,” Ann. Surg., vol. 268, no. 1, pp. 70–76, Jul. 2018. doi: 10.1097/SLA.0000000000002693

[6]      M. Trobe and M. D. Burke, “The Molecular Industrial Revolution: Automated Synthesis of Small Molecules,” Angew. Chemie Int. Ed., vol. 57, no. 16, pp. 4192–4214, Apr. 2018. https://doi.org/10.1002/anie.201710482

[7]      J. Roberts, “Thinking Machines: The Search for Artificial Intelligence | Science History Institute,” Distillations, 2016. [Online]. Available: https://www.sciencehistory.org/distillations/magazine/thinking-machines-the-search-for-artificial-intelligence . [Accessed: 21-Nov-2018].

[8]      S. Szymkuć, E. P. Gajewska, T. Klucznik, K. Molga, P. Dittwald, M. Startek, M. Bajczyk, and B. A. Grzybowski, “Computer-Assisted Synthetic Planning: The End of the Beginning,” Angew. Chemie Int. Ed., vol. 55, no. 20, pp. 5904–5937, May 2016. https://doi.org/10.1002/anie.201506101

[9]      T. Klucznik, B. Mikulak-Klucznik, M. P. McCormack, H. Lima, S. Szymkuć, M. Bhowmick, K. Molga, Y. Zhou, L. Rickershauser, E. P. Gajewska, A. Toutchkine, P. Dittwald, M. P. Startek, G. J. Kirkovits, R. Roszak, A. Adamski, B. Sieredzińska, M. Mrksich, S. L. J. Trice, and B. A. Grzybowski, “Efficient Syntheses of Diverse, Medicinally Relevant Targets Planned by Computer and Executed in the Laboratory,” Chem, vol. 4, no. 3, pp. 522–532, Mar. 2018.https://doi.org/10.1016/j.chempr.2018.02.002

[10]    M. Peplow, “Organic synthesis: The robo-chemist,” Nature, vol. 512, no. 7512, pp. 20–22, Aug. 2014.

[11]    J. M. Granda, L. Donina, V. Dragone, D.-L. Long, and L. Cronin, “Controlling an organic synthesis robot with machine learning to search for new reactivity,Nature, vol. 559, no. 7714, pp. 377–381, Jul. 2018.

[12]    P. J. Kitson, G. Marie, J.-P. Francoia, S. S. Zalesskiy, R. C. Sigerson, J. S. Mathieson, and L. Cronin, “Digitization of multistep organic synthesis in reactionware for on-demand pharmaceuticals.,” Science, vol. 359, no. 6373, pp. 314–319, Jan. 2018. DOI: 10.1126/science.aao3466

[13]    C. M. Gothard, S. Soh, N. A. Gothard, B. Kowalczyk, Y. Wei, B. Baytekin, and B. A. Grzybowski, “Rewiring Chemistry: Algorithmic Discovery and Experimental Validation of One-Pot Reactions in the Network of Organic Chemistry,” Angew. Chemie Int. Ed., vol. 51, no. 32, pp. 7922–7927, Aug. 2012. https://doi.org/10.1002/anie.201202155

[14]    Grzybowski Scientific Inventions, “Chematica.” [Online]. Available: http://chematica.net/#/. [Accessed: 21-Nov-2018].