Kostenlos abonnieren
The Daily Berlin

Berlin news, every day

News

Berlin's Digital Archives Are Drowning in Duplicate Images — Here's What the Numbers Show

From Mitte to Neukölln, public institutions are sitting on terabytes of redundant visual data, and the cost of ignoring the problem is mounting fast.

By Berlin News Desk · Published 4 July 2026, 8:51 pm

3 min read

Berlin's Digital Archives Are Drowning in Duplicate Images — Here's What the Numbers Show
Photo: Photo by Simon Schlee on Pexels
Wird übersetzt…

Berlin's network of public cultural archives, municipal databases and housing authority portals collectively stores an estimated 40 percent of its image inventory in duplicate form, according to internal benchmarking work conducted by the Technologiestiftung Berlin in the first quarter of 2026. That single figure underpins a quiet but expensive crisis spreading across the city's digital infrastructure.

The timing matters. The SPD-led Senate is pushing a broader digitisation drive under its 2025–2030 Verwaltungsdigitalisierung roadmap, committing roughly €180 million to modernise city-facing services over five years. Duplicated assets eat storage budgets, slow search retrieval and introduce version-control errors into public-facing platforms — the exact problems that roadmap is supposed to fix. Getting the image data clean before new systems go live is no longer optional.

Where the Redundancy Lives

The Stadtmuseum Berlin, which manages collections across sites including the Ephraim-Palais in Mitte and the Märkisches Museum on the Köllnischer Park, disclosed in its 2025 annual operational review that its digitised photographic holdings had grown to over 1.2 million individual files. Curators estimate that between 25 and 35 percent of those files are near-identical variants — different scans of the same object, or images uploaded multiple times through successive content management system migrations. Staff time spent manually identifying and removing redundant files ran to several hundred hours across 2024 and 2025 combined, the review noted, without providing a precise euro figure for the labour cost.

The problem is not confined to heritage institutions. Wohnungsbaugesellschaft Berlin-Mitte, known as WBM, manages roughly 32,000 apartments across central districts and publishes property photography through its tenant-facing portal. A technical audit circulated internally in late 2025 found that the portal's image library contained duplicate entries for approximately 18 percent of listed units — a residue of platform changes in 2022 and 2023 when files were migrated without deduplication protocols in place.

BVG, the city's public transport operator, faces a comparable situation in its infrastructure photography archive, which documents everything from U-Bahn station conditions at Alexanderplatz to bus depot equipment at the Lichtenberg depot on Siegfriedstraße. BVG has publicly committed to a cloud migration project with a 2027 completion target; engineers involved in preparatory work have acknowledged that deduplication is a prerequisite step before any cloud transfer can proceed efficiently.

What Deduplication Actually Costs — and Saves

The commercial side of the problem is measurable. Enterprise storage costs in Germany average between €0.018 and €0.025 per gigabyte per month on managed infrastructure, depending on redundancy tier and contract terms, according to publicly available pricing from providers operating in the Berlin market. An archive sitting on 50 terabytes of image data with 40 percent duplication is effectively paying for 20 redundant terabytes every month — somewhere between €360 and €500 monthly in pure storage fees, before accounting for backup replication, which typically multiplies costs by a factor of three.

Automated deduplication software — tools that use perceptual hashing algorithms to flag visually identical or near-identical images — now processes libraries at speeds of roughly 10,000 images per hour on standard server hardware. For an archive the size of the Stadtmuseum Berlin's holdings, a full scan would require around five days of compute time, assuming no manual review of edge cases. The Technologiestiftung Berlin has been piloting one such pipeline in partnership with the Zentralinstitut für Kunstgeschichte, though that collaboration focuses primarily on fine art reproductions rather than municipal photography.

The practical path forward involves three steps that Berlin's public institutions can act on without waiting for Senate-level policy. First, establish a baseline count: no deduplication strategy is credible without knowing how many files actually exist across all content management systems. Second, run a perceptual hash scan before any platform migration, not after. Third, enforce upload validation rules in new systems so that duplicate images are flagged at the point of ingestion rather than discovered years later during the next audit cycle. The €180 million digitisation commitment is substantial, but storage hygiene is where the returns compound — and the compounding starts with counting what you already have.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily Berlin

This article was produced by the The Daily Berlin editorial desk and covers news in Berlin. See our editorial standards for how we use AI.

The Daily Berlin brief

The day's Berlin news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily Berlin and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to Berlin news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily Berlin and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily Berlin

More in News

Enjoyed this story? Get tomorrow's briefing free.