Berlin's Duplicate Image Problem: The Numbers Exposing a Hidden Cost in the City's Digital Archives
Thousands of redundant files are quietly draining server budgets and slowing public databases across Berlin's government and cultural institutions.
Thousands of redundant files are quietly draining server budgets and slowing public databases across Berlin's government and cultural institutions.

Berlin's public institutions are sitting on a data problem they can measure but have struggled to fix. Across the Senatsverwaltung für Kultur und Gesellschaftlichen Zusammenhalt, the Stadtbibliothek network, and the city's sprawling municipal archive system, duplicate image files account for an estimated 18 to 22 percent of total stored visual assets — a redundancy rate that IT procurement officers say translates directly into wasted licensing fees, bloated server contracts, and slower public-facing databases.
The issue has landed back on desks this summer because Berlin's Senate approved a new digital infrastructure framework in March 2026 that ties institutional funding renewals to measurable data hygiene benchmarks. Institutions that cannot demonstrate a duplicate-reduction plan by September 2026 risk seeing their cloud storage subsidies — currently calculated on a per-terabyte basis under the Berliner Digitalisierungsbudget — clawed back. That budget line ran to roughly €4.2 million in the 2025 fiscal year across all eligible public bodies, according to figures published in the Senate's annual Haushaltsplan.
The Stadtmuseum Berlin, whose collections span sites from the Ephraim-Palais in Mitte to the Märkisches Museum on the Köllnischer Park, flagged the problem internally after a digitisation push that ran from 2022 through early 2025. That effort scanned more than 340,000 physical objects. Post-project audits found that nearly 61,000 image files had at least one exact or near-duplicate stored in a separate folder or under a different filename convention — a byproduct of multiple contractors using different cataloguing software simultaneously.
The Zentral- und Landesbibliothek Berlin on Breite Straße in Mitte faces a structurally similar issue. Its digital image repository, which includes historical Berlin street photography and press archive donations, grew by roughly 2.4 terabytes in 2024 alone. A cross-referencing audit completed in April 2026 identified around 14 percent of those new files as duplicates or near-duplicates of material already held, often ingested through separate departmental upload pipelines that lacked a common deduplication checkpoint.
Storage costs in Berlin's municipal cloud contracts — primarily held with European providers under GDPR-compliant frameworks — run at approximately €22 per terabyte per month for hot storage. That may sound modest, but when an institution holds 80 terabytes of image data and a fifth of it is redundant, the monthly waste reaches into four figures before any staff-time costs are counted. Across a dozen mid-sized public cultural institutions, annual losses attributable purely to duplicate image storage are conservatively estimated in the low six figures in euros — based on standard contract rates, not on any single institution's disclosed figures.
Fixing the problem is less a technology question than an organisational one. Perceptual hashing tools — software that generates a fingerprint for each image and flags near-matches even when filenames differ — are mature, widely available, and in some cases open-source. The Fraunhofer FOKUS institute on Kaiserin-Augusta-Allee in Charlottenburg has worked with several Berlin public bodies on exactly this kind of content-fingerprinting pipeline. The technical lift is not large. The harder part is agreeing on which file to keep, which metadata record is authoritative, and which department owns the decision.
The Senate's September 2026 deadline is creating pressure that earlier voluntary guidelines did not. Institutions that submit a credible deduplication audit by that date — showing the volume of redundant files identified, the methodology used, and a deletion or merge schedule — will retain their per-terabyte subsidy at the current rate. Those that miss the deadline face a tiered reduction, starting at 15 percent in the first quarter of non-compliance.
For smaller Bezirk-level archives, the practical advice is to start with a free perceptual-hash scan of existing image directories before commissioning any consultancy work. The data on scope tends to be sobering enough to generate internal momentum on its own. Berlin's problem is not unique — comparable redundancy rates have been documented in Hamburg's Kulturbehörde and in Vienna's Wienbibliothek im Rathaus — but the city's new funding mechanism means the cost of inaction is, for once, quantified and enforced.
How does this story make you feel?
Spread the word
About this article
Published by The Daily Berlin
Daily brief
Free, in your inbox before 7am. Weekdays.
More in News