More than 340,000 duplicate image files are sitting inside the servers of Berlin's Landesarchiv, clogging storage infrastructure that already runs close to capacity, according to internal audit figures reviewed by The Daily Berlin. The problem is not unique to one building or one department — it runs across at least a dozen public bodies, from the Senatsverwaltung für Kultur und Gesellschaftlichen Zusammenhalt to district-level Bezirksämter managing neighbourhood planning records.
The timing matters. Berlin is mid-way through a city-wide digitisation push that began under a 2023 coalition agreement between the SPD, Grüne and the now-departed Linke. That programme committed roughly €47 million over four years to bring analogue municipal records into digital form. The problem is that scanning pipelines at multiple sites have been running without deduplication software, meaning that a single photograph of, say, a 1970s Kreuzberg street market may have been ingested four or five times under slightly different file names. Each copy eats storage. Each copy can surface in separate search results. Each copy confuses archivists and researchers trying to reconstruct an accurate historical picture.
Where the Bottlenecks Are Worst
The Zentral- und Landesbibliothek Berlin on Breite Straße in Mitte runs one of the city's largest public-facing digitised collections, with roughly 1.2 million image assets accessible through its online portal. Staff there have flagged that an estimated 8 to 12 percent of the photographic holdings may contain near-duplicate images — scans made on different days, at different resolutions, by different contractors, of the same physical source material. At current storage pricing on the city's preferred cloud-adjacent infrastructure, each unnecessary terabyte costs approximately €18 per month. Across a collection of this size, the bill adds up fast.
The Stadtmuseum Berlin, which manages collections spread across venues including the Ephraim-Palais in the Nikolaiviertel and the Märkisches Museum on Am Köllnischen Park, faces a similar audit backlog. A digitisation tender completed in early 2025 brought in a third-party contractor to scan roughly 60,000 objects from the Mitte collections. Post-project review found that approximately 4,200 image pairs were flagged as likely duplicates using perceptual hashing — a technique that compares pixel patterns rather than file names. The contractor's contract did not include a deduplication clause. The Stadtmuseum has since updated its tender templates for future rounds.
What the Data Actually Shows
Deduplication is not a marginal technical footnote. A 2024 benchmarking exercise conducted by the Fraunhofer-Institut für Offene Kommunikationssysteme, based in Berlin's Charlottenburg district, found that cultural heritage institutions in German-speaking cities were storing between 15 and 22 percent redundant image data on average across their digitised collections. For Berlin's public archive network, applying that range to the estimated 4.8 million images currently held in institutional storage implies somewhere between 720,000 and just over one million files that are candidates for review or deletion.
Storage is the visible cost. The hidden cost is human time. Archivists at the Landesarchiv on Eichborndamm in Reinickendorf report that catalogue integrity suffers when duplicate entries attach conflicting metadata — wrong dates, wrong neighbourhood attributions, wrong photographer credits. Correcting a single disputed record can take several hours. Multiply that across tens of thousands of flagged files and the labour cost dwarfs the server bill.
The Senatsverwaltung für Digitalisierung und Verwaltungsmodernisierung has been in contact with several of the affected institutions about piloting a shared deduplication service, potentially building on tools already in use by the Bundesarchiv in Koblenz. No contract has been signed. A formal procurement process, if launched before the autumn budget window, could see a working system deployed across at least three Berlin institutions by the second quarter of 2027. For researchers working the reading rooms in Mitte or Reinickendorf right now, the practical advice is simple: always cross-reference catalogue entries against at least two separate search terms before assuming a record is unique. The numbers suggest your first result may well be one of several.