Gender Inference at Scale: Mapping Scientist Migration

Who MPIDR Are

The Max Planck Institute for Demographic Research (MPIDR) in Rostock is one of the world's leading population science institutions, operating as part of the Max Planck Society — Germany's foremost independent research organisation, funding 86 institutes across the natural, life, and social sciences.

MPIDR's Digital and Computational Demography department, led by Director Emilio Zagheni, is unusual among demographic research groups. It treats large-scale digital trace data — publication records, social media traces, administrative datasets — as primary research instruments rather than supplements to surveys. The department's work has appeared in Nature, PNAS, and the proceedings of major data science and informatics conferences. ISSI — the International Society for Scientometrics and Informetrics — publishes biennial conference proceedings; the scientist migration and gender disparity research supported by this engagement was published in the ISSI 2021 proceedings.

What makes MPIDR technically demanding as a research partner: findings have to meet two standards simultaneously — rigorous enough to survive peer review from demographic scientists, and computationally reproducible enough to satisfy the open science requirements increasingly enforced by major funders including the German Research Foundation (DFG) and the European Research Council (ERC). A result that cannot be replicated from published code is not an acceptable result.

The Problem

The study covered scientist migration to and from Germany from 1996 to 2020 — 25 years of Scopus bibliometric data. Scopus is the world's largest curated abstract and citation database; 8 million records over that period represented a substantial fraction of the world's published scientific output relevant to the study scope.

Two problems had to be solved simultaneously.

Migration detection: tracking affiliation changes across an author's publication history over time. If a researcher publishes from a German institution in 2008 and from a Swiss institution in 2013, that is candidate evidence of emigration — but it requires resolving that both records belong to the same person, and that the affiliation change reflects actual relocation rather than a visiting position, sabbatical, or multi-affiliation co-authorship. At 8 million records, there was no manual path through this.

Gender inference: assigning a gender label to over 1 million distinct researcher records drawn from 80+ countries. Scopus records contain author names but not gender. Names are the only available signal. For a study specifically examining whether gender composition differs between migrating and non-migrating researchers, gender inference accuracy is not a data quality nicety — it is the measurement instrument for the central hypothesis.

According to UNESCO's Institute for Statistics (UIS) 2024 data on women in science, women represent approximately 33% of researchers worldwide, but this figure masks sharp variation: Central Asia and Latin America approach gender parity in research, while East Asia, South Asia, and sub-Saharan Africa remain below 25% female in most STEM disciplines. The global average conceals the national-level variation that studies like this one are designed to quantify. If the gender inference pipeline has systematic error concentrated on specific national name corpora, the study is not measuring what it claims to measure.

According to the OECD's Science, Technology and Innovation Outlook 2024, international researcher mobility is concentrated in high-income country corridors, with Germany consistently ranking among the world's top destinations for internationally mobile researchers — making it one of the most important cases for understanding how international recruitment intersects with gender composition at the discipline level.

What We Built

The engagement delivered a complete research data platform: pipeline, models, analysis, visualisations, and a fully reproducible package published alongside the ISSI 2021 conference paper.

ETL pipeline: 8M+ Scopus records ingested, cleaned, and normalised using Python (Pandas, Dask). Dask's out-of-core DataFrames handled stages where in-memory processing was not feasible on the available hardware.
Author disambiguation: custom entity resolution combining fuzzy string matching for institution name deduplication and embedding-similarity clustering for author identity resolution across name variants and transliteration differences.
Three-source gender inference ensemble: NamSor API + Genderize.io + custom XGBoost classifier trained on character n-grams and country-of-origin features — 89% overall accuracy on the held-out evaluation set.
Migration flow matrices: 1M+ researcher trajectories mapped across 80+ source/destination country pairs, with discipline-level disaggregation across 27 Scopus subject classifications.
Statistical analysis: R (tidyverse, ggplot2) for regression modelling, chi-squared tests, and citation impact comparisons.
Network visualisation: NetworkX migration graphs, Matplotlib/Seaborn/Plotly for publication-ready figures, Dash for interactive dashboards.
Reproducibility infrastructure: Docker-containerised analysis environment, SLURM job scripts for HPC cluster execution, full code and data pipeline published alongside the paper.

How We Built It

The ETL Pipeline for 8 Million Records

Scopus export data arrives as structured but messy CSV files: inconsistent encoding, multiple affiliation fields per record, name variants that differ by transliteration convention, diacritic handling, and abbreviation style. Pandas handled initial load and normalisation. Dask was introduced for the affiliation deduplication stage — fuzzy matching across millions of institution name strings exceeded available RAM on the analysis workstation.

The pipeline was designed for full re-runs from raw data: no manual steps, no intermediate files without checksums, no undocumented transformations. MPIDR's open science commitments required the pipeline to be runnable by a reviewer from a clean repository checkout.

Author Disambiguation and Affiliation Tracking

Author disambiguation in bibliometrics combines name similarity with co-authorship network analysis — authors who share co-authors are more likely to be the same person. We used this alongside institution name matching and ORCID linkage where available (ORCID coverage was sparse in the 1996–2005 period of the dataset).

Affiliation tracking required a separate decision layer: not every affiliation change is a migration. We defined migration operationally as a change in primary institutional affiliation country persisting for at least two consecutive publication years — filtering out single-paper visiting positions. This threshold was calibrated against a manually verified sample of approximately 500 researcher records where migration status was independently determined.

Building the Gender Inference Ensemble

The ensemble architecture is worth describing in full, because this is where most of the iteration happened.

How the three-source ensemble works:

NamSor API — Strength: excellent on European, Latin American, and South Asian names; calibrated confidence scores. Weakness: systematic error on transliterated East Asian and Arabic names.
Genderize.io — Strength: broader name coverage; useful where NamSor data is sparse. Weakness: lower per-name model sophistication; same transliteration blind spot.
Custom XGBoost classifier — Strength: character n-grams + country-of-origin features; trained on underperforming name subsets. Weakness: requires curated training data; slower to run than API calls.

Decision logic — first attempt: confidence-weighted voting between NamSor and Genderize. Where both agreed above a confidence threshold, accept. Where they disagreed or fell below threshold, mark as unknown.

This produced acceptable aggregate accuracy — but the audit by national naming corpus revealed the failure. Both APIs returned high-confidence, concordant labels for a significant fraction of transliterated East Asian and Arabic names, and those labels were systematically wrong. High confidence plus agreement is normally a reliable signal. For this name population, trained predominantly on Western name data, it was misleading: the models were pattern-matching on features that associated certain name structures with a gender, and those associations did not hold across transliteration systems.

The XGBoost classifier was trained on character n-grams (2–4 character windows) from the romanised name string, plus country-of-origin features extracted from the Scopus institutional affiliation field. Country of origin shifts the classification prior — a name that is ambiguously gendered in isolation becomes more predictable given knowledge that the researcher is affiliated with a Chinese, Egyptian, or Polish institution. The classifier achieved 89% overall accuracy on the held-out evaluation set, with the largest accuracy gains on exactly the East Asian and Arabic transliterated name subsets where the API ensemble failed.

Final decision logic: where NamSor and Genderize agreed above threshold AND the XGBoost classifier agreed, accept with high confidence. Where the APIs disagreed or the classifier dissented, use the XGBoost output where country-of-origin signal was strong; otherwise mark as uncertain and exclude from precision-critical analysis.

Statistical Modelling and Migration Flow Matrices

Migration flow matrices were constructed at country-pair and discipline level for each five-year cohort in the study window. Gender ratios were computed separately for inflow, outflow, and non-migrating researcher populations within each discipline, with chi-squared tests for significance of ratio differences. Regression models predicted citation impact from migration status and gender, controlling for discipline and cohort. All models were written as reproducible R Markdown documents with figures generated directly from analysis code.

Reproducibility Infrastructure

SLURM job scripts managed compute-intensive stages on the HPC cluster. Docker containerised the full Python and R environment at specific package versions. A researcher with cluster access and the published repository could reproduce the full analysis from raw Scopus exports — a requirement, not a goal.

What Made It Hard

The Gender Inference Confidence Problem

The core issue is in the opening — but the fix required a step most pipelines skip: auditing accuracy broken down by country of origin, not just in aggregate.

The API ensemble's overall accuracy looked fine initially. It was only when we stratified the accuracy metrics by national naming corpus that the failure became visible. European names — where both APIs had the most training data — performed well. Transliterated East Asian and Arabic names had error rates three to four times higher. In a study specifically about internationally mobile researchers, whose names disproportionately come from exactly those underperforming categories, that accuracy gap was not an acceptable residual.

Building the XGBoost classifier required sourcing and curating a name-nationality-gender training dataset that covered the underperforming subsets adequately. This was a data curation problem before it was a modelling problem. The headline accuracy number from any single-source gender inference tool is almost always more optimistic than the number for the specific name subset your research depends on.

Author Disambiguation at 8 Million Records

The "J. Smith" problem is well-documented in bibliometrics — multiple distinct people sharing a common name. Less documented is the inverse: the same researcher appearing under different name representations across 25 years. A Chinese researcher whose name appears in pinyin in early publications and in an anglicised form in later ones. An Eastern European researcher whose name retains diacritics in some journals and loses them in others. An author who publishes under a maiden name through 2008 and a married name thereafter.

The disambiguation system had to handle both directions simultaneously: splitting records that share a name but belong to different people, and merging records that share a person but not a consistent name string. Errors in either direction compound across the migration detection layer — a split creates a false migration event; a merge creates a false non-migration. Embedding-similarity handled romanisation variants effectively; ORCID linkage was valuable where it existed; a manually verified validation sample provided the benchmark for tuning merge and split thresholds.

The False Migration Signal

Defining migration from affiliation changes in publication records is inherently noisy. Researchers take sabbaticals, visiting positions, bilateral exchange fellowships, and extended international collaborations — all of which generate affiliation changes that look like migration in the raw data but are not.

The two-consecutive-year threshold filtered the most obvious cases but did not eliminate all noise. Researchers on two-year visiting positions at German institutions appeared as inflows; researchers on extended international fellowships appeared as outflows. We ran sensitivity analysis across threshold definitions — one year, two years, three years — and reported findings under multiple threshold conditions rather than presenting a single operational definition as ground truth. This is the correct approach for a conference paper where the measurement definition is itself a methodological choice, not a fixed fact.

What Changed

Three findings were directly relevant to MPIDR's policy research agenda:

1. Migration concentration — the US, UK, and Switzerland together account for approximately 58% of German researcher emigration in the 1996–2020 period. Bilateral corridor concentration is higher than prior estimates had suggested, and has remained stable across five-year cohorts rather than diversifying over time. 2. Gender balance in international flows — internationally mobile researchers arriving in Germany are measurably more gender-balanced than the domestic German researcher population in disciplines with the most severe gender skew — Engineering (11:1 male-to-female domestic ratio) and Physics (8:1). The arrival of internationally recruited researchers is associated with modest discipline-level improvement over the study period. 3. Citation impact and mobility — internationally mobile researchers have statistically higher citation impact than non-mobile researchers matched on discipline, cohort, and institutional prestige — consistent with prior bibliometric literature, here demonstrated at national scale with a 25-year longitudinal window.

The reproducible pipeline became a research infrastructure asset: subsequent MPIDR researchers working on related bibliometric questions reused the disambiguation and gender inference components without rebuilding them. For population-level data analysis for health and demographic outcomes, the infrastructure you leave behind often has longer-term value than the specific findings it was built to generate.

What's Next

The MPIDR pipeline has clear extension directions:

Real-time bibliometric monitoring — continuously ingesting new Scopus records to enable rolling analysis of current talent flows rather than retrospective cohort studies
Predictive mobility modelling — sequence models (LSTM or Transformer) trained on career trajectory data to predict likely migration corridors for early-career researchers; relevant for targeted international recruitment policy
Cross-platform data fusion — integrating Web of Science, OpenAlex, and ORCID to improve coverage, particularly for researchers underrepresented in Scopus across certain regions and disciplines
Policy simulation dashboard — an interactive tool for policymakers to model the impact of funding changes, visa reforms, or salary benchmarks on projected talent attraction and retention

Bibliometric Analysis & Gender Inference: 4 Common Questions

How accurate is gender inference from names, and what actually affects the accuracy? Commercial APIs like NamSor and Genderize.io achieve 90–95% accuracy on European names with large training datasets. Accuracy degrades significantly for transliterated East Asian, Arabic, and Slavic names — often to 70–80% — because those name corpora are underrepresented in training data. The best approach for research applications is a multi-source ensemble: use two commercial APIs for coverage, then add a custom classifier trained on the underperforming subsets, with country-of-origin as a feature to shift the prior. The critical step most implementations skip is auditing accuracy by national naming corpus rather than globally. The headline accuracy number is almost always more optimistic than the number for the name subset your specific analysis depends on.

How do you process 8 million academic records efficiently in Python? Pandas handles most operations at this scale with careful dtype management and chunked reading. The specific bottleneck in the MPIDR pipeline was fuzzy string matching for institution name deduplication — pairwise comparison across millions of affiliation strings is quadratic in the naive case. We used Dask for out-of-core processing on that stage, with blocking rules (comparing only records sharing the same country code and first five characters of institution name) to reduce the comparison space. For HPC cluster stages, SLURM job arrays parallelised across researchers by the first letter of their normalised surname. The full pipeline ran in approximately 18 hours on the cluster; analysis-only stages ran in under 2 hours on a standard workstation.

What is bibliometric migration analysis and what policy questions can it answer? Bibliometric migration analysis uses publication records — specifically, changes in institutional affiliations listed on papers — to reconstruct international researcher mobility patterns at scale. Because publication data is longitudinal and covers most active researchers in publishing disciplines, it can answer questions that survey data cannot: How have Germany's research talent flows changed over 25 years? Which disciplines are losing the most researchers to the US? Are internationally recruited researchers more likely to remain long-term than domestically trained ones? The limitation is that publication-based migration detection is noisy (visiting positions resemble migration in the data) and researchers who stop publishing become invisible. It works best for studying active researchers in internationally publishing disciplines — which is the population most relevant for science and immigration policy.

How do you build a fully reproducible research data pipeline for a conference paper? Four requirements: (1) containerise the environment with Docker so the full dependency graph is pinned and re-runnable; (2) manage compute-intensive stages via HPC job scripts that are part of the repository; (3) make every intermediate artefact reproducible from code, with checksums on stored outputs; (4) write statistical analysis in R Markdown or executed notebooks so figures and tables are generated directly from code. The most common reproducibility failure is the last one — figures regenerated manually after final edits, breaking the direct code-to-output link. Building the training dataset for a custom ML classifier is often where reproducibility breaks down first: document data curation decisions as rigorously as the modelling code, or the pipeline is not reproducible in the sense that matters for peer review.

The most underestimated part of the MPIDR project was the gender inference audit. The system appeared to work until we stratified accuracy by national naming corpus — and then it was clear that what we had built was measuring something systematically different from what the study needed, for exactly the researchers the study was most focused on. Catching that before the analysis ran was the critical quality control step. It cost time. It was worth it.

If you are working on a large-scale data engineering problem for research, policy analysis, or demographic modelling, explore our work across research and analytics projects or book a discovery call to discuss your specific dataset and research question.

Gender Inference at Scale: Mapping Scientist Migration for Max Planck MPIDR