Data engineer with a bioinformatics background.
Remote · Project-based · Europe-wide
Available
Worked across multiple languages depending on the context: Python for data and research work, Kotlin and Java for backend services and APIs.
Extracting data from various sources including APIs, databases, web scraping, and parsing unstructured text files. Transforming, cleaning, and combining it into new datasets or specific formats. Visualizing results through matplotlib, Jupyter notebooks, or dashboards in Grafana and Datadog.
Training various models and classifiers with scikit-learn and TensorFlow while making sure to follow proper procedures for validation and train/test splits. Running experiments on feature engineering and selection, and comparing multiple approaches against each other.
Setting up automated pipelines for testing and deployment with GitHub Actions and Jenkins. Experience covers containerized deployments with Docker and Kubernetes, and monitoring running systems including setting up alerts and responding to incidents.
Worked on research projects involving genetic sequence data, enzyme kinetics, and microbial phenotypes. Experience includes building phylogenetic trees, applying machine learning to biological data, and extracting structured biological data from messy sources.
Built and consumed REST APIs in different contexts, microservices backends in Kotlin/Java and data integrations in Python. Familiar with REST principles, OpenAPI/Swagger documentation, and common authentication patterns like bearer tokens.
Also familiar with: Git · Docker · SQL · Linux · Kubernetes · Microservices · Kotlin · Java · Google Cloud Platform (GCP) · Flutter · AI-assisted development
Built a pipeline to extract structured bacterial phenotype data from Bergey's Manual of Systematic Bacteriology. The source material was messy and inconsistently formatted across volumes. The resulting dataset extended the validation of Traitar, an open-source ML tool for predicting microbial traits from genome sequences, by providing phenotype annotations for 296 additional sequenced bacterial species. Published as second author (papers appear under my maiden name). paper
Also contributed to two additional published studies during this period, primarily visualization work and analysis tasks as a research assistant. paper 1 paper 2
Built a Python program to train and systematically compare multiple ML algorithms for predicting substrate specificity from protein sequence data. Used scikit-learn and TensorFlow to implement random forests, SVMs, recurrent neural networks, Bayesian classifiers and linear regression, and analyzed the results across all approaches.
Adapted the ML pipeline I built for my thesis to predict kinetic constants from protein data, using random forests as the classifier. Focused on feature selection and feature engineering to improve predictions, supplementing the dataset with additional features from protein databases. This was during a PhD candidacy; I left after 1.5 years of independent research when the topic turned out not to be the right fit.
Built a new customer master data system with a redesigned data model, onboarding countries one by one while keeping old and new systems in sync at all times. The work included data mapping for country-specific edge cases, handling Disaster Recovery Plans, and building pre-computed views. Spearheaded the GCP migration by evaluating technologies, setting up test environments and reworking the deployment pipelines. Monitoring and alerting in Datadog.
Python · HTML/CSS/JS · open source data · geodata
An interactive map of all German licence plate districts. It is powered by a data pipeline written in Python that takes raw license plate data from Wikipedia, enriches it with geolocation data from OpenStreetMap, and simplifies the boundary files to keep them small enough to serve to visitors. Live, with daily users.
Taking the license plate data from Wikipedia meant it came with a lot of inconsistencies: different formatting for entries with multiple areas, mixed use of area prefixes, and footnotes or remarks scattered throughout that needed to be cleaned out.
To find the right geographic area for each code I queried the Nominatim API. The challenge was that the same name could return multiple hits (a city and its surrounding district share a name), and the fields used to tell them apart were filled inconsistently across entries. Some codes also don't map to a place at all, like codes for nationwide services, which needed to be handled separately.
The raw boundary files were too large to serve to visitors. I had to simplify the GeoJSON enough to keep download sizes reasonable without making the district shapes look wrong.
App Development · Flutter · local-only architecture
A minimal dream journal for Android. The app stores everything locally on the phone and does not track or transfer any data. Published on the Play Store.
Without any telemetry there is no visibility into what goes wrong for users. I needed a way for non-technical users to report issues without it being too complicated. The solution was internal logging in the app, an export logs feature, and a button that lets users send an email with the logs attached.
From picking up Flutter and learning UI/UX basics, over writing data privacy documentation, to navigating the Google Play release process. None of this was familiar territory before this project.
Looking for someone to take on a data or bioinformatics project? I'd love to hear about it.
hello@kyralux.de