Berlin's Digital Archives Push Tackles Duplicate Image Crisis This Week
A surge in redundant digital files is costing city institutions storage budgets and slowing public access to historical records.
A surge in redundant digital files is costing city institutions storage budgets and slowing public access to historical records.

Berlin's network of public archives and municipal data repositories is confronting a concrete, unglamorous problem that has quietly ballooned over the past two years: tens of thousands of duplicate digital images clogging servers, inflating storage costs, and in some cases serving up the wrong photograph to citizens requesting historical documents online. This week, the Senatsverwaltung für Kultur und gesellschaftlichen Zusammenhalt confirmed it is rolling out a structured deduplication programme across three institutions before the end of the third quarter of 2026.
The timing matters because Berlin is mid-way through digitising its post-reunification municipal records — a project running under the broader Digitales Berlin 2025–2030 framework. As scanning throughput accelerates, so does the rate at which identical or near-identical image files land on different servers under different file names. The problem is not unique to Berlin, but the city's fragmented institutional landscape — dozens of Bezirksämter each running partially independent digital systems — makes it particularly acute here.
The Landesarchiv Berlin, located on Eichborndamm in Reinickendorf, has been dealing with the issue since at least early 2025, when an internal audit identified significant redundancy in its scanned photograph collections covering the Cold War-era divided city. The Stadtbibliothek's digital branch, operating under the Zentral- und Landesbibliothek Berlin umbrella near Breite Straße in Mitte, flagged a related difficulty: duplicate images attached to different catalogue entries were creating contradictory metadata, meaning searches returned the same image under multiple incorrect captions.
Tempelhof-Schöneberg's Bezirksamt — one of the first districts to run its own parallel digitisation push for local planning records — has been piloting a perceptual hashing tool since April 2026. The tool compares image fingerprints rather than raw file data, catching near-duplicates that differ only by compression or slight cropping. Results from the pilot have not yet been made public, but the programme is being watched by at least four other Bezirksämter considering similar contracts.
The broader context is a digital storage bill that has not been trivial. Municipal cloud and on-premises storage contracts for Berlin's cultural institutions are publicly tendered, and procurement documents from 2025 show the Senatsverwaltung budgeted approximately €4.2 million for data infrastructure across the archive sector for that fiscal year. Deduplication advocates inside the city administration argue that eliminating redundant files could reduce raw storage demand by a meaningful margin — though official estimates for the savings potential have not yet been published.
The three institutions entering the new programme this quarter are expected to complete an initial automated scan of their image libraries by September 2026. After that, human reviewers — archivists, not algorithms — will make final decisions on which files to retain, which to merge under corrected metadata, and which to delete. That human-in-the-loop requirement is deliberate: the Landesarchiv in particular holds irreplaceable photographic material where an automated false-positive deletion would be unrecoverable.
For Berliners who regularly use the online portals of these institutions — researchers at the Freie Universität pulling Weimar-era city maps, journalists accessing post-war construction records, or families tracing genealogy through the Bezirksamt systems — the practical effect should be cleaner search results and faster load times once deduplication is complete. The mess of duplicate results that currently clutters some catalogue searches should diminish substantially by late autumn.
There is a longer-term dimension too. The Digitales Berlin framework has set a target of having 70 percent of all municipal archival holdings accessible in digital form by 2030. Getting the image deduplication infrastructure right now, while the volume of newly scanned material is still manageable, is considerably cheaper than attempting a clean-up operation on a library several times its current size. City archivists have been saying that for two years. This week, it appears someone with a budget is finally listening.
How does this story make you feel?
Spread the word
About this article
Published by The Daily Berlin
Daily brief
Free, in your inbox before 7am. Weekdays.
More in News