Berlin's public digital infrastructure is drowning in copies of itself. A growing audit effort across several Senatsverwaltung departments has found that duplicate and near-duplicate images — scanned planning documents, infrastructure photographs, cultural archive assets — account for a disproportionate share of the city's ballooning data storage costs. The problem is not glamorous. It is, however, expensive.
The timing matters because Berlin is mid-way through a €47 million IT modernisation programme launched under the city's Digital Strategy 2025–2030, which targets a unified data infrastructure across all 12 boroughs. That ambition depends on clean, deduplicated datasets. Without systematic removal of redundant files, the programme risks importing the same mess into a newer, more expensive system.
What the Numbers Actually Show
Storage is not cheap at municipal scale. Enterprise-grade archive storage for public authorities in Germany typically runs between €0.08 and €0.22 per gigabyte per month, depending on contract terms and redundancy tier — figures that compound quickly when unmanaged image libraries swell unchecked. Internal assessments presented to the Abgeordnetenhaus committee on digitalisation in spring 2026 suggested that across the Senatsverwaltung für Stadtentwicklung alone, image duplication rates in planning document repositories were running at roughly 34 percent — meaning more than one in three stored images was a functional copy of another file already in the system.
The Stadtentwicklung office on Württembergische Straße in Wilmersdorf holds tens of thousands of scanned building permit records, aerial survey images and construction-phase photographs stretching back decades. Much of this material was digitised in multiple batches under different procurement cycles, which is precisely how duplicates accumulate: separate contractors, separate scanning sessions, no cross-referencing at point of ingestion. The same dynamic played out at the Landesarchiv Berlin on Eichborndamm in Reinickendorf, where digitisation of photographic collections from the 1950s through the 1980s produced overlapping image sets that curators have been working since 2023 to reconcile.
The Bezirk of Friedrichshain-Kreuzberg ran a pilot deduplication project in late 2025 using open-source perceptual hashing tools — software that identifies visually identical or near-identical images even when file names and metadata differ. The pilot covered approximately 180,000 image files held by the district's urban planning and building permits office on Frankfurter Allee. Results identified around 41,000 candidate duplicates, of which human review confirmed roughly 28,000 as safe to archive or delete. That is a 15.5 percent confirmed duplication rate in a single district's holdings.
Why Deduplication Is Harder Than It Sounds
The challenge is not purely technical. Berlin's public records law, the Berliner Informationsfreiheitsgesetz, requires that original documents be preserved in their submitted form for defined retention periods. That creates a legal constraint on automated deletion: an image flagged as a duplicate by an algorithm may still be the legally authoritative version of a specific submission. Institutions cannot simply run a script and purge. They need human sign-off, audit trails and, in many cases, sign-off from the Landesbeauftragte für Datenschutz und Informationsfreiheit.
The BVG, Berlin's public transport operator, faces a parallel version of this problem in its engineering documentation. The network's infrastructure image library — covering everything from U-Bahn station surveys to overhead line inspection photographs — has grown sharply since the operator accelerated its maintenance digitalisation push after 2022. Deduplication there is further complicated by the fact that images are often near-duplicates rather than exact copies: the same section of track photographed 48 hours apart for inspection purposes, visually almost identical but legally distinct records.
The practical path forward being discussed in the Senatsverwaltung für Inneres und Digitales involves a phased approach: automated first-pass identification using perceptual hashing, followed by metadata cross-referencing, followed by a human review queue. Institutions running on the city's shared cloud infrastructure through ITDZ Berlin, the state's IT service provider, are expected to begin mandatory deduplication reporting by the first quarter of 2027. For departments not yet on shared infrastructure — still a significant number — the timeline remains voluntary, which is to say, uncertain.