Berlin's public digital archives contain tens of thousands of duplicate photographs and scanned documents — redundant files that are clogging storage systems, confusing researchers, and costing the city money it doesn't have budgeted for the problem. That, at least, is the picture emerging from conversations across Berlin's library and archival community this summer, as the Senate Department for Culture and Social Cohesion quietly reviews its digitisation strategy ahead of a planned 2027 infrastructure overhaul.
The issue isn't new, but urgency around it is sharpening. Berlin's Abgeordnetenhaus approved a fresh tranche of funding in early 2026 for expanded public digitisation work, building on the Berliner Digitalisierungsstrategie framework adopted in 2023. That money has accelerated the scanning of physical collections — but faster intake without better deduplication tools means the underlying problem is compounding faster than it's being solved.
What Experts Are Saying
Archivists and data specialists working with Berlin institutions describe the duplicate image problem as a structural one, not a technical glitch. The Zentral- und Landesbibliothek Berlin, which operates reading rooms at both its Amerika-Gedenkbibliothek site on Blücherplatz in Kreuzberg and the Breite Straße location in Mitte, has been grappling with overlapping digital asset collections inherited from multiple predecessor institutions. When legacy systems are merged without a unified metadata standard, the same photograph — say, a 1960s shot of the Gedächtniskirche construction site — can end up filed under three different catalogue numbers with no automated flag to catch it.
Specialists in digital preservation point to perceptual hashing and AI-assisted image fingerprinting as the most practical near-term remedies. These technologies compare images at the pixel-structure level rather than relying on file names or metadata tags, which are frequently inconsistent in public sector databases. Pilot programs using similar tools have been run at institutions including the Deutsche Digitale Bibliothek, the Frankfurt-based national aggregation platform that pulls records from more than 40,000 German cultural institutions, including dozens of Berlin collections.
The cost of inaction is quantifiable. Cloud storage for uncompressed archival image files runs to roughly €0.023 per gigabyte per month on standard public-sector procurement contracts — a figure that scales quickly when a single collection duplication event can generate hundreds of gigabytes of redundant data. Deduplication tools, by contrast, are available through open-source frameworks such as FIDO and commercial vendors at a fraction of that ongoing expense.
City Programs and What Comes Next
The Stadtarchiv Berlin, which holds records spanning back centuries and operates out of its Breite Straße facility, is among the institutions expected to participate in a Senate-coordinated working group being assembled this autumn. The group's remit, according to publicly circulated planning documents from the Senatsverwaltung für Kultur, will include establishing a common deduplication protocol for institutions receiving public digitisation grants.
Technologists advising the process say the critical decision point is whether Berlin adopts a centralised deduplication layer — a shared service that all funded institutions pipe their uploads through — or pushes responsibility down to individual archive managers. The centralised model is faster and more consistent; the decentralised one is more politically palatable to institutions protective of their cataloguing autonomy. Neither approach has been formally endorsed yet.
For researchers using facilities like the Staatsbibliothek zu Berlin on Potsdamer Straße in Tiergarten, the practical consequence of unresolved duplicates is wasted time: catalogue searches return multiple entries for identical images, provenance notes conflict between versions, and requests for high-resolution copies sometimes retrieve a lower-quality duplicate rather than the canonical original. Fixing that experience is, ultimately, what is driving the political pressure on administrators to move faster than archival bureaucracies typically do.
The Senate's digitisation review is expected to produce a formal recommendation by the end of the third quarter of 2026. If the working group's timeline holds, procurement for new deduplication tooling could begin before the end of the year — putting Berlin on track to have a functioning system in place before the 2027 infrastructure build-out locks in its data architecture for the next decade.