-->
Reseach software engineer for climate science
During my Bachelor I realized that I really want to do research on how our world is evolving. This wish was even stronger than my joy about the beauty of maths in physics. I have strong theoretical skills so I started coding and to develop climate models. This is now the way to go for me and since then I rejoice in data analysis and visualization. We can extract so much new knowledge out of these large data sets through visual and statistical exploration. Hence I am always keen to learn and develop new techniques and to share them with others.
Helmholtz Coastal Data Center (HCDC)
Dec. 2019Numerical Tools and Software Solutions for Palaeoclimate Analysis
Dec. 2015Master in Integrated Climate System Sciences
Sep. 2013Biodiversity and Climate team
Sep. 2012Bachelor in Physics
Sep. 2009{"en"=>"Helmholtz Coastal Data Center"}
{"en"=>"Python framework for interactive data visualization", "de"=>"Python software zur interaktiven Datenvisualisierung"}
{"en"=>"Web portal to manage an academic community and to foster collaboration", "de"=>"Online portal um eine wissenschaftliche Community zu managen und die Zusammenarbeit zu fördern"}
{"en"=>"Software templates and Helm Charts for secure and sustainable deployment via Kubernetes", "de"=>"Software Templates und Helm Charts für ein sichere und nachhaltige Deployments über Kubernetes"}
{"en"=>"Holocene Climate Reconstruction for the Northern Hemisphere Extra-tropics", "de"=>"Klimarekonstruktionen der Nördlichen Hemisphäre während des Holozäns"}
{"en"=>"A software for a semi-automatic digitization of pollen diagrams or other types of stratigraphic diagrams using the command line or a graphical user interface.", "de"=>"Software für die semi-automatische Digitalisierung von Pollendiagrammen und anderen stratigrafischen Diagrammen das per Kommandozeile oder per graphischer Benutzeroberfläche bedient werden kann"}
{"en"=>"A global weather generator for daily data", "de"=>"Ein globaler Wettergenerator für tägliche Wetterdaten"}
{"en"=>"A model to simulate urban growth and transformation with the objective of minimising the energy required for transportation.", "de"=>"Ein Model zur Simulation urbaner Wachstums- und Transformationsprozesse bei gleichzeitiger Minimierung von benötigter Transport-Energie"}
{"en"=>"Python package for docstring repetition", "de"=>"Python Programm für die Wiederverwertung von Dokumentationen"}
Frequently in socio-environmental sciences, models are used as tools to represent, understand, project and predict the behaviour of these complex systems. Along the modelling chain, Good Modelling Practices have been evolving that ensure — amongst others — that models are transparent and their results replicable. Whenever such models are represented in software, Good Modelling meet Good Software Practices, such as a tractable development workflow, good code, collaborative development and governance, continuous integration and deployment; and they meet Good Scientific Practices, such as attribution of copyrights and acknowledgement of intellectual property, publication of a software paper and archiving. Too often in existing socio-environmental model software, these practices have been regarded as an add-on to be considered at a later stage only; modellers have shied away from publishing their model as open source out of fear that having to add good practices is too demanding. We here argue for making a habit of following a list of simple and not so simple practices early on in the implementation of the model life cycle. We contextualise cherry-picked and hands-on practices for supporting Good Modelling Practice, and we demonstrate their application in the example context of the Viable North Sea fisheries socio-ecological systems model.
The European Union has set ambitious CO2 reduction targets, stimulating renewable energy production and accelerating deployment of offshore wind energy in northern European waters, mainly the North Sea. With increasing size and clustering, offshore wind farms (OWFs) wake effects, which alter wind conditions and decrease the power generation efficiency of wind farms downwind become more important. We use a high-resolution regional climate model with implemented wind farm parameterizations to explore offshore wind energy production limits in the North Sea. We simulate near future wind farm scenarios considering existing and planned OWFs in the North Sea and assess power generation losses and wind variations due to wind farm wake. The annual mean wind speed deficit within a wind farm can reach 2-2.5 ms−1 depending on the wind farm geometry. The mean deficit, which decreases with distance, can extend 35-40 km downwind during prevailing southwesterly winds. Wind speed deficits are highest during spring (mainly March-April) and lowest during November-December. The large-size of wind farms and their proximity affect not only the performance of its downwind turbines but also that of neighboring downwind farms, reducing the capacity factor by 20% or more, which increases energy production costs and economic losses. We conclude that wind energy can be a limited resource in the North Sea. The limits and potentials for optimization need to be considered in climate mitigation strategies and cross-national optimization of offshore energy production plans are inevitable.
Freva – Free Evaluation System Framework for Earth system modeling is an efficient solution to handle evaluation systems of research projects, institutes or universities in the climate community. It is a scientific software framework for high performance computing that provides all its available features both in a shell and web environment. The main system design is equipped with the programming interface, history of evaluations, and a standardized model database. Plugin – a generic application programming interface allows scientific developers to connect their analysis tools with the evaluation system independently of the programming language. History – the configuration sub-system stores every analysis performed with the evaluation system in a database. Databrowser – an implemented meta data system with its advanced but easy-to-handle search tool supports scientists and their plugins to retrieve the required information of the database. The combination of these three core components, increases the scientific outcome and enables transparency and reproducibility for research groups using Freva as their framework for evaluation of Earth system models.
A comprehensive database of paleoclimate records is needed to place recent warming into the longer-term context of natural climate variability. We present a global compilation of quality-controlled, published, temperature-sensitive proxy records extending back 12,000 years through the Holocene. Data were compiled from 679 sites where time series cover at least 4000 years, are resolved at sub-millennial scale (median spacing of 400 years or finer) and have at least one age control point every 3000 years, with cut-off values slackened in data-sparse regions. The data derive from lake sediment (51%), marine sediment (31%), peat (11%), glacier ice (3%), and other natural archives. The database contains 1319 records, including 157 from the Southern Hemisphere. The multi-proxy database comprises paleotemperature time series based on ecological assemblages, as well as biophysical and geochemical indicators that reflect mean annual or seasonal temperatures, as encoded in the database. This database can be used to reconstruct the spatiotemporal evolution of Holocene temperature at global to regional scales, and is publicly available in Linked Paleo Data (LiPD) format.
An extensive new multi-proxy database of paleo-temperature time series (Temperature 12k) enables a more robust analysis of global mean surface temperature (GMST) and associated uncertainties than was previously available. We applied five different statistical methods to reconstruct the GMST of the past 12,000 years (Holocene). Each method used different approaches to averaging the globally distributed time series and to characterizing various sources of uncertainty, including proxy temperature, chronology and methodological choices. The results were aggregated to generate a multi-method ensemble of plausible GMST and latitudinal-zone temperature reconstructions with a realistic range of uncertainties. The warmest 200-year-long interval took place around 6500 years ago when GMST was 0.7 °C (0.3, 1.8) warmer than the 19th Century (median, 5th, 95th percentiles). Following the Holocene global thermal maximum, GMST cooled at an average rate −0.08 °C per 1000 years (−0.24, −0.05). The multi-method ensembles and the code used to generate them highlight the utility of the Temperature 12k database, and they are now available for future use by studies aimed at understanding Holocene evolution of the Earth system.
The Eurasian (née European) Modern Pollen Database (EMPD) was established in 2013 to provide a public database of high-quality modern pollen surface samples to help support studies of past climate, land cover, and land use using fossil pollen. The EMPD is part of, and complementary to, the European Pollen Database (EPD) which contains data on fossil pollen found in Late Quaternary sedimentary archives throughout the Eurasian region. The EPD is in turn part of the rapidly growing Neotoma database, which is now the primary home for global palaeoecological data. This paper describes version 2 of the EMPD in which the number of samples held in the database has been increased by 60 % from 4826 to 8134. Much of the improvement in data coverage has come from northern Asia, and the database has consequently been renamed the Eurasian Modern Pollen Database to reflect this geographical enlargement. The EMPD can be viewed online using a dedicated map-based viewer at https://empd2.github.io and downloaded in a variety of file formats at https://doi.pangaea.de/10.1594/PANGAEA.909130 (Chevalier et al., 2019).
Fossil pollen records are well-established indicators of past vegetation changes. The prevalence of pollen across environmental settings including lakes, wetlands, and marine sediments, has made palynology one of the most ubiquitous and valuable tools for studying past environmental and climatic change globally for decades. A complementary research focus has been the development of statistical techniques to derive quantitative estimates of climatic conditions from pollen assemblages. This paper reviews the most commonly used statistical techniques and their rationale and seeks to provide a resource to facilitate their inclusion in more palaeoclimatic research. To this end, we first address the fundamental aspects of fossil pollen data that should be considered when undertaking pollen-based climate reconstructions. We then introduce the range of techniques currently available, the history of their development, and the situations in which they can be best employed. We review the literature on how to define robust calibration datasets, produce high-quality reconstructions, and evaluate climate reconstructions, and suggest methods and products that could be developed to facilitate accessibility and global usability. To continue to foster the development and inclusion of pollen climate reconstruction methods, we promote the development of reporting standards. When established, such standards should 1) enable broader application of climate reconstruction techniques, especially in regions where such methods are currently underused, and 2) enable the evaluation and reproduction of individual reconstructions, structuring them for the evolving open-science era, and optimising the use of fossil pollen data as a vital means for the study of past environmental and climatic variability. We also strongly encourage developers and users of palaeoclimate reconstruction methodologies to make associated programming code publicly available, which will further help disseminate these techniques to interested communities.
Cities are fundamental to climate change mitigation, and although there is increasing understanding about the relationship between emissions and urban form, this relationship has not been used to provide planning advice for urban land use so far. Here we present the Integrated Urban Complexity model (IUCm 1.0) that computes “climate-smart urban forms”, which are able to cut emissions related to energy consumption from urban mobility in half. Furthermore, we show the complex features that go beyond the normal debates about urban sprawl vs. compactness. Our results show how to reinforce fractal hierarchies and population density clusters within climate risk constraints to significantly decrease the energy consumption of urban mobility. The new model that we present aims to produce new advice about how cities can combat climate change.
An international group of approximately 30 scientists with background and expertise in global and regional climate modeling, statistics, and climate proxy data discussed the state of the art, progress, and challenges in comparing global and regional climate simulations to paleoclimate data and reconstructions. The group focused on achieving robust comparisons in view of the uncertainties associated with simulations and paleo data.
While a wide range of Earth system processes occur at daily and even subdaily timescales, many global vegetation and other terrestrial dynamics models historically used monthly meteorological forcing both to reduce computational demand and because global datasets were lacking. Recently, dynamic land surface modeling has moved towards resolving daily and subdaily processes, and global datasets containing daily and subdaily meteorology have become available. These meteorological datasets, however, cover only the instrumental era of the last approximately 120 years at best, are subject to considerable uncertainty, and represent extremely large data files with associated computational costs of data input/output and file transfer. For periods before the recent past or in the future, global meteorological forcing can be provided by climate model output, but the quality of these data at high temporal resolution is low, particularly for daily precipitation frequency and amount. Here, we present GWGEN, a globally applicable statistical weather generator for the temporal downscaling of monthly climatology to daily meteorology. Our weather generator is parameterized using a global meteorological database and simulates daily values of five common variables: minimum and maximum temperature, precipitation, cloud cover, and wind speed. GWGEN is lightweight, modular, and requires a minimal set of monthly mean variables as input. The weather generator may be used in a range of applications, for example, in global vegetation, crop, soil erosion, or hydrological models. While GWGEN does not currently perform spatially autocorrelated multi-point downscaling of daily weather, this additional functionality could be implemented in future versions.
The continuous growth of Earth System data, coupled with its inherent heterogeneity and the challenges associated with distributed data centers, necessitates a robust framework for efficient and secure data analysis. This abstract outlines the plans to implement an analysis framework within the marine data portal of the German Marine Research Alliance (Deutsche Allianz Meeresforschung, DAM) at https://marine-data.de. The proposed framework is based on the Data Analytics Software Framework (DASF), chosen for its decentralized, secure, and publisher-subscriber-based (pub-sub-based) architecture, which enables the execution of data analysis backends anywhere without exposing sensitive IT systems to the internet. The challenges of analyzing Earth System data on the web are multifaceted. Data heterogeneity arises from the diverse sources, formats, and structures of earth system data, making seamless integration and analysis a complex task. The sheer volume of data compounds the challenge, demanding scalable solutions to handle vast amounts of information efficiently. Additionally, the computational power required for meaningful analysis is often expensive and can become a bottleneck in traditional data processing pipelines. Moreover, the distributed nature of data across multiple centers poses logistical challenges in terms of accessibility, security, and coordination. To address these challenges, the integration of DASF into the marine data portal presents a comprehensive solution. DASF offers a secure and decentralized pub-sub-based remote procedure call framework, providing a flexible environment for executing data analysis backends. One of the key advantages of DASF is its ability to allow these backends to run anywhere without the need to expose sensitive IT systems to the internet, addressing the security concerns associated with data analysis. The decentralized nature of DASF also mitigates data heterogeneity challenges by offering a unified platform for data integration and analysis. With DASF, disparate data sources can seamlessly communicate, facilitating interoperability and enabling comprehensive analysis across diverse datasets. The pub-sub mechanism ensures efficient communication between components, streamlining the flow of data through the analysis pipeline. Security is a critical aspect of implementing a robust data analysis framework. DASF addresses this concern by incorporating an OAuth-based authentication mechanism at the message broker level. This ensures that only authorized users can access and interact with the data analysis functionalities. Additionally, the integration with the Helmholtz AAI empowers the sharing of analysis routines with users from other research centers or the general public. The cost-effectiveness of DASF further enhances its appeal, as it optimizes the utilization of computational resources. By enabling the deployment of analysis components on diverse hardware environments, organizations can leverage existing infrastructure without significant additional investments. In conclusion, the integration of DASF into the DAM portal marks a significant step toward overcoming the challenges inherent in analyzing Earth System data on the web. By addressing data heterogeneity, accommodating vast datasets, and providing a secure and decentralized architecture, DASF emerges as a key enabler for efficient and scalable data analysis. The adoption of DASF in the marine data portal promises to enhance the accessibility, security, and cost-effectiveness of data analytics, and finally facilitates open science in the research field Earth and Environment.
Research software development is crucial for scientific advancements, yet the sustainability and maintainability of such software pose significant challenges. In this tutorial, we present a comprehensive demonstration on leveraging software templates to establish best-practice implementations for research software, aiming to enhance its longevity and usability. Our approach is grounded in the utilization of Cookiecutter, augmented with a fork-based modular Git strategy, and rigorously unit-tested methodologies. By harnessing the power of Cookiecutter, we streamline the creation process of research software, providing a standardized and efficient foundation. The fork-based modular Git approach enables flexibility in managing variations, facilitating collaborative development while maintaining version control and traceability. Central to our methodology is the incorporation of unit testing, ensuring code integrity and reliability of the templates. Moreover, we employ Cruft, a tool tailored to combat the proliferation of boilerplate code, often referred to as the "boilerplate-monster." By systematically managing and removing redundant code, Cruft significantly enhances the maintainability and comprehensibility of research software. This proactive approach mitigates the accumulation of technical debt and facilitates long-term maintenance. The open-source templates are available at https://codebase.helmholtz.cloud/hcdc/software-templates/. In the first 30 minutes of the tutorial, participants will gain insights into the structured organization of these software templates, enabling them to understand the framework’s architecture and application to their own software products. The subsequent 30 minutes will be dedicated to a hands-on tutorial, allowing participants to engage directly with the templates, guiding them through the process of implementing and customizing them for their specific research software projects. Maintaining research software presents distinct challenges compared to traditional software development. The diverse skill sets of researchers, time constraints, lack of standardized practices, and evolving requirements contribute to the complexity. Consequently, software often becomes obsolete, challenging to maintain, and prone to errors. Through our tutorial, we address these challenges by advocating for the adoption of software templates. These templates encapsulate best practices, enforce coding standards, and promote consistent structures, significantly reducing the cognitive load on developers. By providing a well-defined starting point, researchers can focus more on advancing their scientific endeavors rather than grappling with software complexities. Furthermore, the utilization of software templates fosters collaboration and knowledge sharing within research communities. It encourages the reuse of proven solutions, accelerates the onboarding process for new contributors, and facilitates better documentation practices. Ultimately, this approach leads to a more sustainable ecosystem for research software, fostering its evolution and ensuring its relevance over time. In summary, our tutorial offers a practical and comprehensive guide to creating and utilizing software templates for research software development. By harnessing Cookiecutter with Git-based modularity, unit testing, and the power of Cruft, we aim to empower researchers in building robust, maintainable, and sustainable software, thereby advancing scientific progress in an efficient and impactful manner.
Making Earth-System-Model (ESM) Data accessible is challenging due to the large amount of data that we are facing in this realm. The upload is time-consuming, expensive, technically complex, and every institution has their own procedures. Non-ESM experts face a lot of problems and pure data portals are hardly usable for inter- and trans-disciplinary communication of ESM data and findings, as this level of accessibility often requires specialized web or computing services. With the Model Data Explorer, we want to simplify the generation of web services from ESM data, and we provide a framework that allows us to make the raw model data accessible to non-ESM experts. Our decentralized framework implements the possibility for an efficient remote processing of distributed ESM data. Users interface with an intuitive map-based front-end to compute spatial or temporal aggregations, or select regions to download the data. The data generators (i.e. the scientist with access to the raw data) use a light-weight and secure python library based on the Data Analytics Software Framework (DASF, https://digital-earth.pages.geomar.de/dasf/dasf-messaging-python) to create a back-end module. This back-end module runs close to the data, e.g. on the HPC-resource where the data is stored. Upon request, the module generates and provides the required data for the users in the web front-end. Our approach is intended for scientists and scientific usage! We aim for a framework where web-based communication of model-driven data science can be maintained by the scientific community. The Model Data Explorer ensures fair reward for the scientific work and adherence to the FAIR principles without too much overhead and loss in scientific accuracy. The Model Data Explorer is in the progress of development at the Helmholtz-Zentrum Hereon, together with multiple scientific and data management partners in other German research centers. The full list of contributors is constantly updated and can be accessed at https://model-data-explorer.readthedocs.io.
Collaboration Platforms for harmonization and building a shared understanding of communities are essential components in today’s academic environment. With the help of modern software tools and advancing digitization, our communities can improve collaboration via event, project and file management, and communication. The variety of tasks and tools needed in interdisciplinary communities, however, pose a considerable obstacle for community members. We see them in the administration, and especially when on-boarding new members with different levels of experience (from student to senior scientist). Here, user-friendly, technical support is needed. We are involved in many communities, particularly in the Climate Limited-area Modelling Community (CLM-Community) and the Helmholtz Metadata Collaboration (HMC). With the input of these (and more) communities, we are currently working on the DJAC-Platform, an open-source, Python (Django)-based website. DJAC manages communities from a single institute to an (inter-)national community with hundreds and more participating research institutions. DJAC is available at codebase.helmholtz.cloud.
A common challenge for projects with multiple involved research institutes is a well-defined and productive collaboration. All parties measure and analyze different aspects, depend on each other, share common methods, and exchange the latest results, findings, and data. Today this exchange is often impeded by a lack of ready access to shared computing and storage resources. In our talk, we present a new and innovative remote procedure call (RPC) framework. We focus on a distributed setup, where project partners do not necessarily work at the same institute, and do not have access to each others resources. We present an application programming interface (API) developed in Python that enables scientists to collaboratively explore and analyze sets of distributed data. It offers the functionality to request remote data through a comfortable interface, and to share and invoke single computational methods or even entire analytical workflows and their results. The prototype enables researchers to make their methods accessible as a backend module running on their own infrastructure. Hence researchers from other institutes may apply the available methods through a lightweight python or Javascript API. In the end, the overhead for both, the backend developer and the remote user, is very low. The effort of implementing the necessary workflow and API usage equalizes the writing of code in a non-distributed setup. Besides that, data do not have to be downloaded locally, the analysis can be executed "close to the data" while using the institutional infrastructure where the eligible data set is stored. With our prototype, we demonstrate distributed data access and analysis workflows across institutional borders to enable effective scientific collaboration. This framework has been developed in a joint effort of the DataHub and Digitial Earth initiatives within the Research Centers of the Helmholtz Association of German Research Centres, HGF.
psyplot (https://psyplot.github.io) is an open-source data visualization framework that integrates rich computational and mathematical software packages (such as xarray and matplotlib) into a flexible framework for visualization. It differs from most of the visual analytic software such that it focuses on extensibility in order to flexibly tackle the different types of analysis questions that arise in pioneering research. The design of the high-level API of the framework enables a simple and standardized usage from the command-line, python scripts or Jupyter notebooks. A modular plugin framework enables a flexible development of the framework that can potentially go into many different directions. The additional enhancement with a graphical user interface (GUI) makes it the only visualization framework that can be handled from the convenient command-line or scripts, as well as via point-click handling. It additionally allows to build further desktop applications on top of the existing framework. In this presentation, I will show the main functionalities of psyplot, with a special focus on the visualization of unstructured grids (such as the ICON model by the German Weather Service (DWD)), and the usage of psyplot on the HPC facilities of the DKRZ (mistral, jupyterhub, remote desktop, etc.). My demonstration will cover the basic structure of the psyplot framework and how to use psyplot in python scripts (and Jupyter notebooks). I will demonstrate a quick demo of to the psyplot GUI and psy-view, a ncview-like interface built upon psyplot, and talk about different features such as reusing plot configurations and exporting figures.
The complexity of Earth System and Regional Climate Models represents a considerable challenge for developers. Tuning but also improving one aspect of a model can unexpectedly decrease the performance of others and introduces hidden errors. Reasons are in particular the multitude of output parameters and the shortage of reliable and complete observational datasets. One possibility to overcome these issues is a rigorous and continuous scientific evaluation of the model. This requires standardized model output and, most notably, standardized observational datasets. Additionally, in order to reduce the extra burden for the single scientist, this evaluation has to be as close as possible to the standard workflow of the researcher, and it needs to be flexible enough to adapt it to new scientific questions. We present the Free Evaluation System Framework (Freva) implementation within the Helmholtz Coastal Data Center (HCDC) at the Institute of Coastal Research in the Helmholtz-Zentrum Geesthacht (HZG). Various plugins into the Freva software, namely the HZG-EvaSuite, use observational data to perform a standardized evaluation of the model simulation. We present a comprehensive data management infrastructure that copes with the heterogeneity of observations and simulations. This web framework comprises a FAIR and standardized database of both, large-scale and in-situ observations exported to a format suitable for data-model intercomparisons (particularly netCDF following the CF-conventions). Our pipeline links the raw data of the individual model simulations (i.e. the production of the results) to the finally published results (i.e. the released data). Another benefit of the Freva-based evaluation is the enhanced exchange between the different compartments of the institute, particularly between the model developers and the data collectors, as Freva contains built-in functionalities to share and discuss results with colleagues. We will furthermore use the tool to strengthen the active communication with the data and software managers of the institute to generate or adapt the evaluation plugins.
Established in 2011, the Eurasian Modern Pollen Database (EMPD) is a standardized, fully documented and quality-controlled dataset of over 8000 modern pollen samples which can be openly accessed, and to which scientists can also contribute and help maintain. The database has recently been upgraded to include an intuitive client-based JavaScript web-interface hosted on the version control system Github, allowing data and metadata to be accessed and viewed using a clickable map. We present how we manage the FAIR principles, such as well-documented access and handling of data and metadata using the free Github services for open source development, as well as other critical points for open research data, such as data accreditation and referencing. Our community-based framework allows automated and transparent quality checks through continuous integration, fast and intuitive access to the data, as well as transparency for data contributors and users concerning changes and bugs in the EMPD. Furthermore, it allows a stable and long-lasting access to the web interface (and the data) without any funding requirements for servers or the risk of security holes.
Pollen data remains one of the most widely geographically distributed, publicly accessible and most thoroughly documented sources of quantitative palaeoclimate data. It represents one of the primary terrestrial proxies in understanding the spatial pattern of past climate change at centennial to millennial timescales, and a great example of ’big data’ in the palaeoclimate sciences. The HORNET project is based on the synthesis and analysis of thousands of fossil and modern pollen samples to create a spatially and seasonally explicit record of climate change covering the whole Northern Hemisphere over the last 12,000 years, using a common reconstruction and error accounting methodology. This type of study has been made possible only through long-term community led efforts to advance the availability of ’open big data’, and represents a good example of what can now be achieved within this new paradigm. Primary pollen data for the HORNET project was collected not only from open public databases such as Neotoma, Pangaea and the European Pollen Database, but also by encouraging individual scientists and research groups to share their data for the purposes of the project and these open databases, and through the use of specifically developed digitisation tools which can bring previously inaccessible data into this open digital world. The resulting project database includes over 3000 fossil pollen sites, as well as 16000 modern pollen samples for use in the pollen-climate calibration transfer-function. Building and managing such a large database has been a considerable challenge that has been met primarily through the application and development of open source software, which provide important cost and resource effective tools for the analysis of open data. The HORNET database can be interfaced through a newly developed, simple, freely accessible, and intuitive clickable map based web interface. This interface, hosted on the version control system Github, has been used mainly for quality control, method development and sharing the results and source database. Additionally, it provides the opportunity for other applications such as the comparison with other reconstructions based on other proxies, which we have also included in the database. We present the challenges in building and sharing such a large open database within the typically limited resources and funding that most scientific projects operate. Pollen data remains one of the most widely geographically distributed, publicly accessible and most thoroughly documented sources of quantitative palaeoclimate data. It represents one of the primary terrestrial proxies in understanding the spatial pattern of past climate change at centennial to millennial timescales, and a great example of ’big data’ in the palaeoclimate sciences. The HORNET project is based on the synthesis and analysis of thousands of fossil and modern pollen samples to create a spatially and seasonally explicit record of climate change covering the whole Northern Hemisphere over the last 12,000 years, using a common reconstruction and error accounting methodology. This type of study has been made possible only through long-term community led efforts to advance the availability of ’open big data’, and represents a good example of what can now be achieved within this new paradigm. Primary pollen data for the HORNET project was collected not only from open public databases such as Neotoma, Pangaea and the European Pollen Database, but also by encouraging individual scientists and research groups to share their data for the purposes of the project and these open databases, and through the use of specifically developed digitisation tools which can bring previously inaccessible data into this open digital world. The resulting project database includes over 3000 fossil pollen sites, as well as 16000 modern pollen samples for use in the pollen-climate calibration transfer-function. Building and managing such a large database has been a considerable challenge that has been met primarily through the application and development of open source software, which provide important cost and resource effective tools for the analysis of open data. The HORNET database can be interfaced through a newly developed, simple, freely accessible, and intuitive clickable map based web interface. This interface, hosted on the version control system Github, has been used mainly for quality control, method development and sharing the results and source database. Additionally, it provides the opportunity for other applications such as the comparison with other reconstructions based on other proxies, which we have also included in the database. We present the challenges in building and sharing such a large open database within the typically limited resources and funding that most scientific projects operate.
The development, usage and analysis of climate models often requires the visualization of the data. This visualization should ideally be nice looking, simple in application, fast, easy reproducible and flexible. There exist a wide range of software tools to visualize model data which however often lack in their ability of being (easy) scriptable, have low flexibility or simply are far too complex for a quick look into the data. Therefore, we developed the open-source visualization framework psyplot that aims to cover the visualization in the daily work of earth system scientists working with data of the climate system. It is build (mainly) upon the python packages matplotlib, cartopy and xarray and integrates the visualization process into data analysis. This data can either be stored in a NetCDF, GeoTIFF, or any other format that is handled by the xarray package. Due to its interactive nature however, it may also be used with data that is currently processed and not already stored on the hard disk. Visualizations of rastered data on the glob are supported for rectangular grids (following or not following the CF Conventions) or on a triangular grid (following the CF Conventions (like the earth system model ICON) or the unstructured grid conventions (UGRID)). Furthermore, the package visualizes scalar and vector fields, enables to easily manage and format multiple plots at the same time. Psyplot can either be used with only a few lines of code from the command line in an interactive python session, via python scripts or from through a graphical user interface (GUI). Finally, the framework developed in this package enables a very flexible configuration, an easy integration into other scripts using matplotlib.
In an age of digital data analysis, gaining access to data from the pre-digital era - or any data that is only available as a figure on a page - remains a problem and an under-utilized scientific resource. Whilst there are numerous programs available that allow the digitization of scientific data in a simple x-y graph format, we know of no semi-automated program that can deal with data plotted with multiple horizontal axes that share the same vertical axis, such as pollen diagrams and other stratigraphic figures that are common in the Earth sciences. STRADITIZE (Stratigraphic Diagram Digitizer) is a new open-source program that allows stratigraphic figures to be digitized in a single semi-automated operation. It is designed to detect multiple plots of variables analyzed along the same vertical axis, whether this is a sediment core or any similar depth/time series. The program is written in python and supports mixtures of many different diagram types, such as bar plots, line plots, as well as shaded, stacked, and filled area plots. The package provides an extensively documented graphical user interface for a point-and-click handling of the semi-automatic process, but can also be scripted or used from the command line. Other features of STRADITIZE include text recognition to interpret the names of the different plotted variables, the automatic and semi-automatic recognition of picture artifacts, as well an automatic measurement finder to exactly reproduce the data that has been used to create the diagram. Evaluation of the program has been undertaken comparing the digitization of published figures with the original digital data. This generally shows very good results, although this is inevitably reliant on the quality and resolution of the original figure.
Straditize (Stratigraphic Diagram Digitizer) is a new open-source program that allows stratigraphic diagrams to be digitized in a single semi-automated operation. It is specifically designed for figures that have multiple horizontal axes plotted against a shared vertical axis (e.g. depth/age), such as pollen diagrams.
Accurate modelling of large-scale vegetation dynamics, hydrology, and other environmental processes requires meteorological forcing on daily timescales. While meteorological data with high temporal resolution is becoming increasingly available, simulations for the future or distant past are limited by lack of data and poor performance of climate models, e.g., in simulating daily precipitation. To overcome these limitations, we may temporally downscale monthly summary data to a daily time step using a weather generator. Parameterization of such statistical models has traditionally been based on a limited number of observations. Recent developments in the archiving, distribution, and analysis of "big data" datasets provide new opportunities for the parameterization of a temporal downscaling model that is applicable over a wide range of climates. Here we parameterize a WGEN-type weather generator using more than 50 million individual daily meteorological observations, from over 10’000 stations covering all continents, based on the Global Historical Climatology Network (GHCN) and Synoptic Cloud Reports (EECRA) databases. Using the resulting "universal" parameterization and driven by monthly summaries, we downscale mean temperature (minimum and maximum), cloud cover, and total precipitation, to daily estimates. We apply a hybrid gamma-generalized Pareto distribution to calculate daily precipitation amounts, which overcomes much of the inability of earlier weather generators to simulate high amounts of daily precipitation. Our globally parameterized weather generator has numerous applications, including vegetation and crop modelling for paleoenvironmental studies.
The development and use of climate models often requires the visualization of geo-referenced data. Creating visualizations should be fast, attractive, flexible, easily applicable and easily reproducible. There is a wide range of software tools available for visualizing raster data, but they often are inaccessible to many users (e.g. because they are difficult to use in a script or have low flexibility). In order to facilitate easy visualization of geo-referenced data, we developed a new framework called "psyplot," which can aid earth system scientists with their daily work. It is purely written in the programming language Python and primarily built upon the python packages matplotlib, cartopy and xray. The package can visualize data stored on the hard disk (e.g. NetCDF, GeoTIFF, any other file format supported by the xray package), or directly from the memory or Climate Data Operators (CDOs). Furthermore, data can be visualized on a rectangular grid (following or not following the CF Conventions) and on a triangular grid (following the CF or UGRID Conventions). Psyplot visualizes 2D scalar and vector fields, enabling the user to easily manage and format multiple plots at the same time, and to export the plots into all common picture formats and movies covered by the matplotlib package. The package can currently be used in an interactive python session or in python scripts, and will soon be developed for use with a graphical user interface (GUI). Finally, the psyplot framework enables flexible configuration, allows easy integration into other scripts that uses matplotlib, and provides a flexible foundation for further development.
Accurate modelling of large-scale vegetation dynamics, hydrology, and otherenvironmental processes requires meteorological forcing on daily timescales. Whilemeteorological data with high temporal resolution is becoming increasingly available,simulations for the future or distant past are limited by lack of data and poor perfor-mance of climate models, e.g., in simulating daily precipitation. To overcome theselimitations, we may temporally downscale monthly summary data to a daily timestep using a weather generator. Parameterization of such statistical models has tradi-tionally been based on a limited number of observations. Recent developments in thearchiving, distribution, and analysis of big data datasets provide new opportunities forthe parameterization of a temporal downscaling model that is applicable over a widerange of climates. Here we parameterize a WGEN-type weather generator using morethan 50 million individual daily meteorological observations, from over 10’000 stationscovering all continents, based on the Global Historical Climatology Network (GHCN)and Synoptic Cloud Reports (EECRA) databases. Using the resulting “universal”parameterization and driven by monthly summaries, we downscale mean temperature(minimum and maximum), cloud cover, and total precipitation, to daily estimates.We apply a hybrid gamma-generalized Pareto distribution to calculate daily precipi-tation amounts, which overcomes much of the inability of earlier weather generatorsto simulate high amounts of daily precipitation. Our globally parameterized weathergenerator has numerous applications, including vegetation and crop modelling for pa-leoenvironmental studies.