Berlin's public sector has a clutter problem, and it lives on hard drives. This week, the Senatsverwaltung für Kultur confirmed it is advancing a tender for automated duplicate-image detection software, a move that follows months of internal audits revealing that legacy digitisation projects had, in some cases, stored the same photograph three or four times across disconnected databases. The redundancy has ballooned storage costs and, more critically, complicated public access to the city's visual heritage.
The timing matters. Germany's federal Digital Strategy 2025 framework set a deadline for public institutions to consolidate open-data repositories by the end of this year. Berlin, which operates one of the more fragmented municipal archive systems in the country, is behind. The Landesarchiv Berlin, based on Eichborndamm in Reinickendorf, holds roughly 2.3 million digitised images. Staff have known for at least two years that a significant share of that catalogue contains near-identical duplicates created when departments transferred files from obsolete systems without any deduplication step.
What Happened This Week
Three developments converged in quick succession. On Tuesday, the Stadtmuseum Berlin — which manages collections across the Märkisches Museum near Köllnischer Park and the Ephraim-Palais in the Nikolaiviertel — published an internal review acknowledging that its own photographic archive had identified more than 40,000 candidate duplicate records since January, when it piloted a hash-based image matching tool. The review does not yet confirm how many of those are true duplicates versus visually similar but distinct images, and the museum said a manual verification phase will run through September.
On Wednesday, the Berlin Senate's IT coordination unit, ITDZ Berlin, held a closed briefing for representatives from seven Bezirk-level archive offices. Participants — none of whom spoke publicly — were shown a prototype workflow using perceptual hashing combined with machine-learning classification, designed to flag duplicates while preserving deliberate near-copies that carry historical distinctions, such as cropped press photographs versus uncropped originals. The ITDZ declined to provide details of the briefing on the record.
Then on Thursday, the Zentralen Landesbibliothek Berlin posted a procurement notice for a 24-month software licence covering automated metadata reconciliation, with duplicate-image filtering listed as a core requirement. The contract ceiling listed in the notice is €380,000 — a figure that drew immediate attention from open-source advocates who argue freely available tools could handle a large portion of the task at no licensing cost.
Why the Numbers Are Alarming
Storage is not cheap at scale. According to the Senatsverwaltung's 2025 digital infrastructure report, the city's cultural sector collectively held around 14 petabytes of digitised material as of December 2025, with projected annual storage costs exceeding €2.1 million. If even ten percent of image files are genuine duplicates — a conservative estimate based on the Stadtmuseum's preliminary findings — eliminating them could free meaningful budget for actual digitisation of the estimated 60 percent of the Landesarchiv's physical holdings that have not yet been scanned at all.
The problem is not unique to Berlin. Vienna's Wienbibliothek ran a comparable deduplication exercise in 2024 and reported removing roughly 180,000 redundant image records from its Wiener Bilder collection. But Berlin's archive system is considerably more decentralised, spread across Bezirk-level offices from Spandau to Lichtenberg, which makes a unified software solution harder to deploy consistently.
For researchers, photographers, and journalists who regularly access portals like the Berlin Picture Portal — which aggregates images from multiple institutions — the practical consequence is wasted search time and inconsistent licensing metadata. A photograph of Potsdamer Platz from 1989, for instance, might appear under four separate catalogue numbers with slightly different rights statements, none of them clearly flagged as the authoritative record.
The Senatsverwaltung has not set a public deadline for resolving the duplication backlog, but the federal Digital Strategy pressure means institutions cannot defer indefinitely. Archive offices that fail to demonstrate consolidated, interoperable databases risk losing access to federal co-financing streams. The next formal review under that framework is scheduled for November 2026. Between now and then, anyone relying on Berlin's digital collections should expect search results to remain messier than they should be — and should cross-reference catalogue numbers carefully before licensing or republishing any historical image.