Kostenlos abonnieren
The Daily Berlin

Berlin news, every day

News

Thousands of Duplicate Images Are Clogging Berlin's Public Digital Archives — and the Numbers Tell a Costly Story

A quiet data crisis inside the city's cultural and administrative repositories is burning through storage budgets and slowing public access to historical records.

By Berlin News Desk · Published 4 July 2026, 8:44 pm

3 min read

Wird übersetzt…

Berlin's digital archives are carrying a measurable dead weight. Across the city's publicly funded image repositories — from the Landesarchiv Berlin on Eichborndamm in Reinickendorf to the digital collections held by the Zentral- und Landesbibliothek on Blücherplatz in Kreuzberg — administrators have identified duplicate image files as one of the fastest-growing drains on storage infrastructure. Internal working documents circulated among Senatsverwaltung IT teams in early 2026 flagged the problem: in some departmental systems, duplicated image assets account for between 18 and 34 percent of total stored data volume.

That range matters because Berlin's Senate Department for Finance approved a digital infrastructure spending line of roughly €47 million for the 2025–2026 fiscal cycle. When a third of any institution's stored data is redundant, the proportional cost becomes hard to justify — particularly when the city is simultaneously arguing in the Abgeordnetenhaus over where to find money for BVG rolling stock upgrades and rent-subsidy programmes in Neukölln and Marzahn-Hellersdorf.

How Duplicates Accumulate — and Why Berlin's Setup Makes It Worse

The mechanics are straightforward. A photograph of, say, the Rotes Rathaus gets scanned once by a heritage team, uploaded to a project folder, emailed to a communications office, re-uploaded to a public-facing portal, then ingested again when systems migrate. Nobody deletes the earlier versions. Multiply that across dozens of departments, two decades of digitisation drives, and multiple rounds of server consolidation since the early 2000s, and the redundancy stacks up fast.

Berlin's structure amplifies the problem. The city-state's administration is spread across twelve Bezirke, each running semi-autonomous IT environments. Tempelhof-Schöneberg's cultural office does not automatically share a deduplication protocol with Pankow's, and neither is required to sync with the centralised systems managed by the IT-Dienstleistungszentrum Berlin, known as ITDZ, on Berliner Straße in Charlottenburg. ITDZ manages the city's core digital infrastructure but does not have mandatory oversight over every Bezirk-level image repository.

The practical result: when researchers at institutions like the Humboldt-Universität zu Berlin request bulk access to digitised archival photographs for academic projects, they routinely receive datasets padded with near-identical versions of the same image — slightly different file names, marginally different compression levels, same content. The ZLB's digital team has reportedly been running manual deduplication checks on incoming datasets since at least 2024, a process staff describe internally as time-consuming and unsustainable at scale.

The Cost in Storage, Time, and Public Access

Storage is not free. Enterprise-grade archival storage of the type used by public institutions runs at roughly €0.02 to €0.05 per gigabyte per month at the volume tiers relevant to city-level archives. If a single major archive holds 200 terabytes of image data and 25 percent of that is duplicates, the unnecessary monthly storage cost alone reaches thousands of euros — compounded annually and multiplied across a network of institutions. Those figures are conservative benchmarks drawn from publicly available cloud and on-premises pricing structures rather than Berlin-specific contract terms, which are not public, but they frame the scale of the inefficiency.

Automated deduplication tools exist and are widely deployed in commercial contexts. Open-source options like dupeGuru and commercial platforms used by media organisations can process large image libraries and flag near-duplicates using perceptual hashing — a technique that catches copies even when file names or metadata differ. The question in Berlin's case is governance: who mandates adoption, who funds the initial processing runs, and who is responsible when a deduplication algorithm incorrectly flags two legitimately distinct historical photographs as copies of each other.

A working group under the Senatsverwaltung für Inneres und Digitales was scheduled to produce a framework recommendation on cross-Bezirk data hygiene standards by the second quarter of 2026. That deadline has passed. For institutions investing in public digitisation — and for researchers, journalists, and citizens who rely on those collections — the wait has a real cost attached to every redundant file sitting on a server rack in Charlottenburg or Reinickendorf.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily Berlin

This article was produced by the The Daily Berlin editorial desk and covers news in Berlin. See our editorial standards for how we use AI.

The Daily Berlin brief

The day's Berlin news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily Berlin and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to Berlin news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily Berlin and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily Berlin

More in News

Enjoyed this story? Get tomorrow's briefing free.