Berlin's major public digital archives contain hundreds of thousands of duplicate or near-duplicate image files — a sprawling redundancy problem that is costing storage budgets, muddying search results and, in some cases, causing copyrighted photographs to circulate without proper attribution. The issue has landed on the agenda of the Berliner Senatskanzlei's digital infrastructure working group, which has been examining remediation strategies since early 2026.
The timing is not accidental. Across Europe and beyond, the rapid expansion of AI-generated imagery has made the problem dramatically worse. When generative tools produce thousands of visual variations from a single source file, institutional repositories — museums, city archives, news organisations — find themselves ingesting near-identical images that strain deduplication software built for an earlier era. Berlin, as a self-styled tech hub with a concentration of startups around Mitte and Prenzlauer Berg, finds itself both contributing to the problem and positioned to help solve it.
What Berlin's Institutions Are Actually Doing
The Staatliche Museen zu Berlin, which oversees 17 collections including the Pergamonmuseum on Museumsinsel, launched an internal audit of its SMB-digital image repository in January 2026. The project's goal is to flag and consolidate duplicate records across its public-facing database, which hosts digitised artefacts for researchers and educators worldwide. The museum consortium has not published its findings yet, but the audit is understood to be running alongside a broader data-quality initiative tied to the EU's European Cultural Heritage Cloud framework.
Meanwhile, the Technologiestiftung Berlin, based in Glinkastraße in the government quarter, has been piloting perceptual hashing tools — software that generates a compact fingerprint for each image and compares it against a database to identify visual duplicates even when file names or metadata differ. The foundation has worked with several Bezirk-level administrations to test the approach on digitised planning documents and urban photography collections held by district offices in Friedrichshain-Kreuzberg and Tempelhof-Schöneberg.
Perceptual hashing is not new — it has been a standard tool in content moderation since at least 2010 — but its application to civic and cultural archives at scale is relatively recent. The challenge in Berlin, as in other large federal-style cities, is that image collections are fragmented across dozens of agencies with no unified deduplication mandate.
How Other Cities Are Approaching the Same Problem
Amsterdam offers the sharpest contrast. The Gemeente Amsterdam's Stadsarchief, which holds more than 800,000 digitised images of the city dating back to the 19th century, completed a full deduplication pass in 2024 using open-source tooling developed in partnership with the Netherlands Institute for Sound and Vision in Hilversum. The archive reported a 12 percent reduction in its publicly indexed image count after the exercise — a figure that reflects genuine duplicates removed, not records deleted.
Vienna's Wienbibliothek im Rathaus adopted a different model, embedding deduplication checks directly into its ingest pipeline so that no new image enters the system without being compared against existing holdings. That approach, operational since mid-2025, prevents accumulation rather than requiring periodic cleanups. Seoul's metropolitan digital infrastructure bureau, operating under the city's Smart Seoul Data Campus initiative, went further still — deploying machine-learning classifiers that distinguish between true duplicates and legitimately similar images, such as two photographs taken seconds apart at a public event, which may both have archival value.
Berlin's patchwork approach falls somewhere behind Amsterdam and Vienna on the maturity curve. A unified citywide image governance policy does not yet exist, and the SPD-led Senate coalition has not earmarked specific funding for one in the 2026 budget cycle. The Technologiestiftung's pilot work is promising but limited in scope.
For Berliners who care about this practically — researchers using the Landesarchiv Berlin on Eichborndamm in Reinickendorf, journalists pulling images from the Senatsverwaltung's press portal, or developers building apps on open city data — the most useful step right now is to cross-reference any image sourced from a Berlin public repository against the SMB-digital catalogue and the Europeana aggregator, both of which flag known duplicates where metadata is complete. The longer fix will require the Senate to mandate common metadata standards across Bezirk archives, a reform that the digital working group is expected to formally recommend before the end of the third quarter of 2026.