Berlin's Duplicate Image Problem: The Numbers Revealing a Digital Archive Crisis
Thousands of redundant image files are clogging the servers of Berlin's public institutions, and the data shows the cleanup bill is climbing fast.
Thousands of redundant image files are clogging the servers of Berlin's public institutions, and the data shows the cleanup bill is climbing fast.
Berlin's public sector is sitting on an estimated several hundred terabytes of duplicated digital image files spread across municipal departments, with internal audits at the Senate Department for Urban Development and Housing flagging the problem as a growing cost and compliance concern heading into the second half of 2026. The issue is mundane on the surface — the same photograph saved twice, or twelve times — but the aggregate numbers tell a different story.
This matters now because the city's ongoing digitalisation drive, anchored in the Berlin Digital Strategy adopted by the SPD-led Senate coalition, is forcing agencies to consolidate legacy storage systems built up over more than two decades of uncoordinated file management. When different departments merge data centres or migrate to shared cloud infrastructure — as the Berliner Senatsverwaltung has been doing progressively since 2023 — duplicate files surface in industrial quantities. The cost of cloud storage is not trivial: commercial rates for enterprise-grade object storage in Frankfurt-based data centres, the nearest major hub serving Berlin institutions, run roughly €20 to €25 per terabyte per month for managed services.
Two institutions stand out in the current audit cycle. The Stadtbibliothek Berlin, which administers digital collections across its 71 branch libraries, acknowledged in its 2025 annual report that its digitisation programme had produced significant file duplication during scanning campaigns run between 2018 and 2022, when multiple vendors delivered overlapping image sets without a unified deduplication protocol in place. Separately, the Berliner Morgenpost archives — digitised under a partnership with the Zentral- und Landesbibliothek Berlin on Breite Straße in Mitte — reportedly contain tens of thousands of press photographs indexed under variant filenames pointing to identical image data.
The problem is not unique to public bodies. Startups in Mitte's tech corridor around Chausseestraße and in the Kreuzberg cluster near Oranienstraße have long dealt with duplicate media assets as engineering teams scale quickly and storage governance lags behind headcount. For a Berlin-based e-commerce or media startup burning through Series A cash, paying for redundant image storage on AWS Frankfurt nodes is a real line item that growth-stage CFOs are increasingly scrutinising.
Industry benchmarks from storage analytics firms suggest that between 20 and 40 percent of unmanaged digital image repositories contain duplicate or near-duplicate files — a range that, if applied conservatively to Berlin's municipal digital estate, implies substantial avoidable expenditure every quarter. Deduplication software licences for enterprise deployments typically run between €5,000 and €30,000 annually depending on repository size, a fraction of the ongoing storage overhead for large unmanaged archives.
Fixing the problem is less straightforward than deleting obvious copies. Near-duplicate detection — identifying images that are visually identical but saved at different resolutions, colour profiles, or compression levels — requires either machine-learning-based perceptual hashing tools or manual curatorial review. The Zuse Institute Berlin on Takustraße in Dahlem, which conducts applied research in scientific computing and data management, has worked on exactly this class of problem in research archive contexts, though its focus has been primarily on scientific imaging rather than administrative photography.
For municipal departments, the practical path forward involves three steps that IT governance consultants working with German Länder have repeatedly outlined: first, a full inventory scan using hash-based deduplication tools; second, a retention policy that specifies which version of a duplicate survives; and third, integration of that policy into procurement contracts with any future digitisation vendors. Berlin's Senate IT authority, the ITDZ Berlin based in Mitte, has the mandate to coordinate exactly this kind of cross-departmental standardisation under the city's eGovernment law.
Departments that have not yet begun that process should expect the consolidation pressure to intensify. The Senate's digital budget for 2026 includes provisions for shared infrastructure migration, meaning the moment of reckoning for unaudited image repositories is approaching whether individual agencies are ready or not. Getting ahead of it now — running a deduplication scan before a forced migration does it chaotically — is the cheaper option by a significant margin.
How does this story make you feel?
Spread the word
About this article
Published by The Daily Berlin
Daily brief
Free, in your inbox before 7am. Weekdays.
More in News