Berlin's Digital Archive Removes Thousands of Duplicate Files From Open-Data Portal
A technical cleanup effort launched this week is forcing municipal departments to confront years of redundant uploads in Berlin's flagship public records system.
A technical cleanup effort launched this week is forcing municipal departments to confront years of redundant uploads in Berlin's flagship public records system.

Berlin's Senate Department for Digital Transformation confirmed this week that a duplicate image replacement sweep is underway across the city's open-data portal, data.berlin.de, after an internal audit found that redundant or near-identical image files account for a significant share of the platform's storage load. The cleanup, which began Monday, June 30, affects everything from scanned planning documents stored by the Stadtentwicklungsamt to photo assets uploaded by the Berliner Senatsverwaltung für Kultur over the past four years.
The timing matters. Berlin is mid-way through its Smart City Strategy 2030, a program that commits the city to making its public data infrastructure genuinely usable — not just technically accessible. Redundant image files slow search indexing, inflate cloud hosting costs, and muddy the metadata that developers and researchers depend on when pulling data from the portal for projects ranging from housing analysis to transit planning. For a city that pitches itself as a tech hub, a bloated and poorly maintained archive is an embarrassment the Senate has been under pressure to address.
The internal review, carried out by the Kompetenzzentrum Open Data — the city body responsible for managing data.berlin.de — identified thousands of image entries where functionally identical files had been uploaded multiple times under different file names, often because separate departments had no shared naming convention or deduplication protocol. The problem compounded after 2022, when Berlin expanded portal upload permissions to more than 40 city agencies following a push for greater administrative transparency under the Informationsfreiheitsgesetz, Berlin's freedom-of-information statute.
The Kompetenzzentrum is now working department by department to replace or consolidate duplicate entries, assigning canonical file references and retiring redundant copies. The process is expected to run through the end of July. According to publicly available portal statistics, data.berlin.de hosts more than 3,700 active datasets; image-heavy collections from departments including the Senatsverwaltung für Stadtentwicklung, Bauen und Wohnen have seen the most acute duplication. Storage costs for the portal's hosting infrastructure are funded through the Berlin state IT framework contract with Dataport, the public-sector IT service provider shared by several German states.
For residents and developers who rely on the portal daily, the practical disruption this week has been noticeable. Several dataset pages linked from the Berlin Open Data Handbook temporarily returned broken image previews as files were swapped out. The Technologiestiftung Berlin, which runs the CityLAB Berlin innovation lab at Platz der Luftbrücke 4 in Tempelhof, flagged the issue to its network of civic-tech developers after a Tuesday workshop session where participants encountered missing map-tile images in a neighbourhood data tool built on portal feeds.
The Senate Department has indicated that once the current sweep is complete, it plans to introduce automated deduplication checks as a standard step in the upload pipeline — a measure that open-data advocates have been recommending since at least 2023. The Wikimedia Deutschland office on Tempelhofer Ufer 23–24, which has its own working relationship with the portal through collaborative public-domain image projects, has previously called for better coordination between municipal data managers and the broader open-knowledge community, though no formal joint protocol is yet in place.
Developers and researchers who pull image assets from data.berlin.de are advised to re-check any cached file paths after July 31, when the replacement process is scheduled to conclude. The Kompetenzzentrum Open Data is maintaining a change-log on the portal's status page. Anyone building applications dependent on specific image URLs should map to the new canonical references once published — the old redundant paths will be retired rather than redirected. For Berliners watching the city's broader digital modernisation effort, this week's unglamorous cleanup is as revealing as any headline initiative: good data infrastructure is maintenance work, and Berlin is finally doing some of it.
How does this story make you feel?
Spread the word
About this article
Published by The Daily Berlin
Daily brief
Free, in your inbox before 7am. Weekdays.
More in News