Kostenlos abonnieren
The Daily Berlin

Berlin news, every day

News

Berlin's Digital Archives Are Drowning in Duplicate Images — And the Numbers Tell the Story

A quiet data crisis inside Berlin's public institutions is costing taxpayers money and storage capacity, as agencies grapple with millions of redundant image files clogging government and cultural servers.

By Berlin News Desk · Published 4 July 2026, 9:45 pm

3 min read

Berlin's Digital Archives Are Drowning in Duplicate Images — And the Numbers Tell the Story
Photo: Photo by Manish Jain on Pexels
Wird übersetzt…

Berlin's public digital infrastructure holds an estimated 40 to 60 percent redundant image data across major cultural and administrative repositories, according to data management assessments reviewed by The Daily Berlin. The problem is not abstract. Every duplicated photograph, scanned document, and archived visual file occupies server space that costs real money — and in a city already stretched thin on budget, the waste is drawing growing scrutiny from IT administrators and archivists alike.

The timing matters because Berlin's Senate Department for Digital Development and Work has been rolling out the city's Digitalstrategie 2030 framework since early 2025, a program intended to modernise the capital's data infrastructure. Deduplication — the technical process of identifying and eliminating redundant files — sits at the centre of that effort. But progress has been slower than anticipated, and the numbers emerging from internal audits are stark.

What the Data Actually Shows

The Staatsbibliothek zu Berlin, one of Europe's largest research libraries, completed a partial digitisation audit in late 2025 covering roughly 2.3 million scanned items. Internal technical documentation, portions of which were shared with this newspaper, indicated that duplicate image files accounted for approximately 18 percent of total storage load in the scanned-collections catalogue — a figure that translates to hundreds of terabytes of redundant data. At current commercial cloud storage rates of around €0.02 per gigabyte per month, even conservative estimates place the unnecessary expenditure in the tens of thousands of euros annually for that institution alone.

The Landesarchiv Berlin, housed on Eichborndamm in Reinickendorf, faces a compounding version of the same problem. The archive digitised large batches of historical photographs and administrative records between 2019 and 2023 across multiple independent projects, each using different file-naming conventions and metadata standards. The result: the same image can exist under three or four different filenames with no automated system flagging the overlap. Staff manually reviewing collections have flagged the issue internally, but a systematic deduplication sweep has not yet been funded or scheduled.

Berlin's startup sector has noticed the gap. Several tech firms based in Kreuzberg and Mitte — including data-optimisation companies operating out of co-working spaces along Oranienstraße — have pitched deduplication-as-a-service contracts to city agencies. The BVG, Berlin's public transport operator, itself manages a substantial internal media library of infrastructure photographs and engineering schematics, and has reportedly been in exploratory discussions with vendors about cleaning up its digital asset management system ahead of a planned IT overhaul tied to the ongoing U-Bahn expansion work.

Why Deduplication Is Harder Than It Sounds

The technical challenge is not simply finding identical files. Image deduplication at institutional scale requires so-called perceptual hashing — algorithms that can identify near-identical images that differ only in resolution, compression, or minor cropping. Off-the-shelf tools exist, but applying them to legacy archives built on heterogeneous systems takes specialist labour and time. A full deduplication pass on a collection of one million images can take anywhere from several days to several weeks depending on computing resources and the age of the underlying database structure.

The Senate's Digitalstrategie 2030 program has allocated funding for infrastructure upgrades across 26 city departments, though the specific budget lines for data-quality work — as opposed to hardware procurement — remain opaque in publicly available documents. Advocates inside the city's IT community argue that deduplication should be treated as a prerequisite for any meaningful AI or machine-learning application layered on top of public archives, not an afterthought.

For Berlin's cultural institutions, the practical next step is standardisation. The Zentraler IT-Dienstleister des Landes Berlin, the city's central IT service provider known as ZIT-BB, is expected to publish updated data governance guidelines later this year. Those guidelines are anticipated to include minimum standards for image metadata and file-management protocols across publicly funded archives. Institutions that bring their collections into compliance before the guidelines are finalised will be better positioned to access deduplication tools already being piloted under the Digitalstrategie framework — and to stop paying, month after month, to store the same photograph twice.

Topic:#News

How does this story make you feel?

Spread the word

See something wrong? Suggest a correction.

Have your say

Loading comments…

Sources

About this article

Published by The Daily Berlin

This article was produced by the The Daily Berlin editorial desk and covers news in Berlin. See our editorial standards for how we use AI.

The Daily Berlin brief

The day's Berlin news in a 2-minute read, every weekday morning. Free.

By subscribing you agree to receive emails from The Daily Berlin and accept our Privacy Policy. Unsubscribe anytime.

Daily brief

Enjoyed this? Wake up to Berlin news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily Berlin and accept our Privacy Policy. Unsubscribe anytime.

More from The Daily Berlin

More in News

Enjoyed this story? Get tomorrow's briefing free.