Berlin's Digital Archives Are Drowning in Duplicate Images — Here's What the Numbers Show
From Mitte to Marzahn, municipal databases are bloated with redundant image files, and the cleanup bill is only growing.
From Mitte to Marzahn, municipal databases are bloated with redundant image files, and the cleanup bill is only growing.

Berlin's public sector holds more than 14 million digitised images across its network of borough archives, cultural institutions and administrative portals — and an estimated one in five of those files is a duplicate. That figure, drawn from a 2025 internal audit commissioned by the Senate Department for Digital Transformation, has quietly become one of the more expensive housekeeping problems in the city's push toward a fully paperless government.
The timing matters. Berlin's digital infrastructure budget for 2026 sits at roughly €340 million, with a significant share earmarked for the city's ongoing e-government rollout under the Berliner E-Government-Gesetz. When storage costs are inflated by redundant data, every euro spent warehousing a photograph of the same Alexanderplatz construction site taken three times from the same angle is a euro not spent on actual services. The Senatsverwaltung für Inneres und Digitales has flagged duplicate image management as a tier-two priority for the second half of this year.
The Landesarchiv Berlin, headquartered on Eichborndamm in Reinickendorf, manages roughly 1.3 million digitised photographic records. Staff there estimate that deduplication work completed in late 2024 removed approximately 87,000 redundant files — about 6.7 percent of the photographic collection at that point. That sounds modest until you account for the cloud storage overhead: each percentage point reduction in duplicate files translated to a saving of around €4,200 annually under the archive's current contract with a Frankfurt-based storage provider.
The numbers are starker at the Stadtmuseum Berlin, which operates across multiple sites including the Märkisches Museum near Köllnischer Park in Mitte. The museum's digitisation team processed over 200,000 object photographs between 2022 and 2025 as part of the federal-funded Neustart Kultur programme. Internal records show that roughly 23 percent of those image files arrived from partner institutions as duplicates or near-duplicates — same object, marginally different crop or compression setting. Resolving which version to keep, which metadata to retain, and how to flag the discarded file for audit purposes took an average of 11 minutes per image pair when handled manually.
Multiply that across tens of thousands of pairs and the labour cost becomes significant. At Berlin's public sector pay scales under TVöD Entgeltgruppe 9b, that manual processing time adds up to the equivalent of roughly two full-time positions per year, absorbed invisibly into existing workloads.
The city is not standing still. The CityLAB Berlin, based on Platz der Luftbrücke in Tempelhof, has been piloting an open-source duplicate detection tool since March 2026 as part of its broader algorithmic governance research strand. The pilot targets image hashing — a technique that assigns each photograph a unique numerical fingerprint, allowing near-instant comparison across large datasets. Early results from a trial run against 50,000 files from the Bezirksamt Friedrichshain-Kreuzberg suggest the tool can flag probable duplicates with roughly 94 percent accuracy, dramatically cutting the manual review burden.
The catch is integration. Berlin's borough administrations run on at least seven distinct document management systems, not all of which share compatible metadata standards. Connecting a deduplication tool to all of them requires either a centralised data gateway — something the Senatsverwaltung has been promising since 2023 under its One-Stop-Government initiative — or borough-by-borough rollouts that take years and multiply costs.
For institutions managing their own collections now, the practical advice from archivists at the Landesarchiv is straightforward: establish a single ingest point for all incoming image files, apply hash-based checking at the point of upload rather than retrospectively, and build duplication review into digitisation contracts before a project starts rather than treating it as a post-production problem. The cost of prevention, they note, is a fraction of the cost of the cleanup bill Berlin is now working through.
How does this story make you feel?
Spread the word
About this article
Published by The Daily Berlin
Daily brief
Free, in your inbox before 7am. Weekdays.
More in News