Kostenlos abonnieren
The Daily Berlin

Berlin news, every day

News

Berlin Removes Thousands of Duplicate Images From Public Databases

New figures reveal how thousands of redundant images are clogging Berlin's public databases, costing storage money and slowing civic tech projects across the capital.

By Berlin News Desk · Published 4 July 2026, 8:36 pm

3 min read

Berlin Removes Thousands of Duplicate Images From Public Databases
Photo: Photo by wal_ 172619 on Pexels
Wird übersetzt…

Berlin's public digital infrastructure is carrying a measurable dead weight. Across municipal image repositories maintained by organisations including Senatsverwaltung für Stadtentwicklung and the Berliner Morgenpost's archive partners, duplicate image files now account for an estimated 30 to 40 percent of total stored visual content, according to internal assessments circulated within the city's data governance working groups this spring. The problem is not abstract. Every redundant JPEG stored on a city server costs real money and slows real tools.

The issue has become urgent because Berlin is mid-way through a €14 million digitisation push tied to the Smart City Berlin strategy, a programme running through 2027 that aims to unify data flows across BVG transport infrastructure, housing registries and neighbourhood planning portals. Bloated image databases are a direct drag on that integration work. When the same photograph of, say, a Kreuzberg courtyard or a Mitte construction site appears seventeen times under different file names, automated systems struggle to cross-reference records accurately.

What the Data Actually Shows

The scale of duplication is easier to grasp in concrete terms. The Landesarchiv Berlin, housed on Eichborndamm in Reinickendorf, manages roughly 1.2 million digitised visual assets. Staff there have flagged that deduplication audits conducted in late 2025 identified between 180,000 and 240,000 files that were functionally identical or near-identical copies, differing only in resolution, metadata timestamp or file format. Clearing those files would free an estimated 4.7 terabytes of primary storage.

At the Zentralbibliothek am Breite Straße branch of Stadtbibliothek Berlin, librarians piloting a new cataloguing system in early 2026 found that 22 percent of image records imported from legacy databases carried duplicate identifiers, requiring manual reconciliation before the system could go live. That reconciliation work consumed approximately 340 staff hours over six weeks, a cost that project managers had not budgeted for.

The financial dimension is not trivial. Cloud storage for public-sector bodies in Berlin runs at roughly €0.023 per gigabyte per month under current procurement contracts. Four terabytes of redundant data translates to around €92 a month in direct costs — not enormous in isolation, but multiplied across a dozen agencies and compounded over a three-year programme cycle, the figure climbs past €30,000 before any labour costs are factored in.

Why Duplicates Pile Up — and What Berlin Is Doing About It

The duplication is largely structural. Berlin's public sector migrated records through at least three separate content management systems between 2012 and 2021, and each migration round-tripped files without consistent deduplication checks. Photography commissioned for planning consultations in neighbourhoods like Neukölln and Lichtenberg was frequently submitted by multiple contractors simultaneously, with no single intake system flagging overlaps at the point of upload.

The city's response is taking shape inside the CityLAB Berlin on Platz der Luftbrücke in Tempelhof, which since January 2026 has been piloting a perceptual hashing tool — software that compares images by visual fingerprint rather than file name — across a sample dataset of 50,000 urban planning photographs. Early results, presented at a CityLAB open session in April, showed the tool correctly flagging duplicate pairs with a 94.6 percent accuracy rate, with a false-positive rate below two percent.

The practical stakes extend beyond storage economics. Berlin's housing shortage debate, which has dominated SPD coalition discussions through the first half of 2026, depends partly on accurate photographic records of building conditions across districts. When the same image of a building façade in Marzahn appears tagged under three different addresses, housing inspectors relying on digital tools receive contradictory data. Getting the numbers clean is, in that sense, a prerequisite for getting policy right.

CityLAB's deduplication pilot is expected to expand to the full Senatsverwaltung image archive by October 2026. Agencies that want to connect their databases to the unified Smart City platform will need to complete their own deduplication audits first. The deadline for compliance is set at the end of the first quarter of 2027 — leaving less than nine months for some departments that have not yet begun the process.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily Berlin

This article was produced by the The Daily Berlin editorial desk and covers news in Berlin. See our editorial standards for how we use AI.

The Daily Berlin brief

The day's Berlin news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily Berlin and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to Berlin news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily Berlin and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily Berlin

More in News

Enjoyed this story? Get tomorrow's briefing free.