Berlin's public sector holds an estimated 40 to 60 percent redundant image files across its digitised archival collections, according to internal assessments circulating among IT procurement officers at the Senate Department for Culture and Social Cohesion. The problem is not abstract: duplicate images consume physical server capacity, inflate licensing costs for storage infrastructure, and are degrading the reliability of databases that underpin everything from planning applications in Mitte to heritage records in Spandau.
The timing matters because Berlin is midway through a €14 million digitisation drive launched under the 2024–2027 Senate digital strategy. That programme was supposed to bring the city's fragmented archival holdings into a unified, searchable format. Instead, administrators are discovering that migrating collections from legacy systems has compounded the duplicate problem rather than solved it — each transfer cycle copies files without first running deduplication routines, leaving multiple versions of the same photograph or architectural drawing sitting on separate servers.
What the Data Actually Shows
The Landesarchiv Berlin, based on Eichborndamm in Reinickendorf, manages roughly 32 linear kilometres of physical records alongside a growing digital estate. Staff there have flagged that image ingestion workflows built on the Digitool platform — the system used before the migration to Rosetta began in 2023 — left behind systematic file duplication wherever batch imports occurred without hash-based verification. A single photograph of the Kaiser Wilhelm Memorial Church, for instance, may exist in a collection in 14 separate TIFF files across three different departmental repositories, each digitised independently.
The Stadtmuseum Berlin, whose main facility sits on the Klosterstrasse in the Mitte district, faces a version of the same problem with its photographic holdings from the Cold War era. Deduplication software trials conducted on a sample of 200,000 image files identified a redundancy rate of roughly 38 percent — meaning more than 76,000 files were near-identical or exact copies consuming storage that cost the institution money to maintain without adding research value.
Storage is not cheap. Enterprise-grade archival storage in Germany runs between €0.018 and €0.045 per gigabyte per month depending on redundancy tier. High-resolution TIFF files used in archival digitisation average around 80 to 120 megabytes each. Across a collection of one million images — modest by the standards of the Landesarchiv — duplicate bloat at a 40 percent rate translates to 400,000 unnecessary files, or potentially 40 terabytes of excess data. At median pricing, that is a recurring annual cost of roughly €20,000 to €25,000 for storage alone, before accounting for backup, retrieval bandwidth, or staff time spent managing conflicting file versions.
Why Deduplication Has Lagged
The core issue is procedural, not technical. Tools capable of identifying duplicates through perceptual hashing — software that matches visually similar images even when file metadata differs — have been commercially available for years. The German Federal Archives in Koblenz began deploying such tools in 2021. Berlin's institutions have been slower, partly because procurement rules require individual departmental sign-off on software contracts, and partly because digitisation project budgets were allocated for scanning volume rather than data quality.
The BVG, the city's public transport authority, ran into a smaller but illustrative version of the same problem in 2024 when it migrated its internal engineering image libraries to a new asset management system ahead of the U-Bahn fleet modernisation programme. IT teams reportedly spent six weeks on manual deduplication work that automated tooling could have completed in days.
For the Senate's digital strategy to deliver what it promised by 2027, institutions will need to retrofit deduplication steps into existing ingestion pipelines before the next major migration cycle begins — currently scheduled for late 2026. The practical advice from archivists who have navigated this elsewhere is straightforward: run hash verification at the point of ingest, not after the fact. Every file added to a clean system without that check makes the eventual cleanup more expensive and more disruptive. Berlin's collections are large enough that waiting is no longer a neutral choice.