Berlin Removes Thousands of Duplicate Images From Public Databases
New figures reveal how thousands of redundant images are clogging Berlin's public databases, costing storage money and slowing civic tech projects across the capital.
New figures reveal how thousands of redundant images are clogging Berlin's public databases, costing storage money and slowing civic tech projects across the capital.

Berlin's public digital infrastructure is carrying a measurable dead weight. Across municipal image repositories maintained by organisations including Senatsverwaltung für Stadtentwicklung and the Berliner Morgenpost's archive partners, duplicate image files now account for an estimated 30 to 40 percent of total stored visual content, according to internal assessments circulated within the city's data governance working groups this spring. The problem is not abstract. Every redundant JPEG stored on a city server costs real money and slows real tools.
The issue has become urgent because Berlin is mid-way through a €14 million digitisation push tied to the Smart City Berlin strategy, a programme running through 2027 that aims to unify data flows across BVG transport infrastructure, housing registries and neighbourhood planning portals. Bloated image databases are a direct drag on that integration work. When the same photograph of, say, a Kreuzberg courtyard or a Mitte construction site appears seventeen times under different file names, automated systems struggle to cross-reference records accurately.
The scale of duplication is easier to grasp in concrete terms. The Landesarchiv Berlin, housed on Eichborndamm in Reinickendorf, manages roughly 1.2 million digitised visual assets. Staff there have flagged that deduplication audits conducted in late 2025 identified between 180,000 and 240,000 files that were functionally identical or near-identical copies, differing only in resolution, metadata timestamp or file format. Clearing those files would free an estimated 4.7 terabytes of primary storage.
At the Zentralbibliothek am Breite Straße branch of Stadtbibliothek Berlin, librarians piloting a new cataloguing system in early 2026 found that 22 percent of image records imported from legacy databases carried duplicate identifiers, requiring manual reconciliation before the system could go live. That reconciliation work consumed approximately 340 staff hours over six weeks, a cost that project managers had not budgeted for.
The financial dimension is not trivial. Cloud storage for public-sector bodies in Berlin runs at roughly €0.023 per gigabyte per month under current procurement contracts. Four terabytes of redundant data translates to around €92 a month in direct costs — not enormous in isolation, but multiplied across a dozen agencies and compounded over a three-year programme cycle, the figure climbs past €30,000 before any labour costs are factored in.
The duplication is largely structural. Berlin's public sector migrated records through at least three separate content management systems between 2012 and 2021, and each migration round-tripped files without consistent deduplication checks. Photography commissioned for planning consultations in neighbourhoods like Neukölln and Lichtenberg was frequently submitted by multiple contractors simultaneously, with no single intake system flagging overlaps at the point of upload.
The city's response is taking shape inside the CityLAB Berlin on Platz der Luftbrücke in Tempelhof, which since January 2026 has been piloting a perceptual hashing tool — software that compares images by visual fingerprint rather than file name — across a sample dataset of 50,000 urban planning photographs. Early results, presented at a CityLAB open session in April, showed the tool correctly flagging duplicate pairs with a 94.6 percent accuracy rate, with a false-positive rate below two percent.
The practical stakes extend beyond storage economics. Berlin's housing shortage debate, which has dominated SPD coalition discussions through the first half of 2026, depends partly on accurate photographic records of building conditions across districts. When the same image of a building façade in Marzahn appears tagged under three different addresses, housing inspectors relying on digital tools receive contradictory data. Getting the numbers clean is, in that sense, a prerequisite for getting policy right.
CityLAB's deduplication pilot is expected to expand to the full Senatsverwaltung image archive by October 2026. Agencies that want to connect their databases to the unified Smart City platform will need to complete their own deduplication audits first. The deadline for compliance is set at the end of the first quarter of 2027 — leaving less than nine months for some departments that have not yet begun the process.
How does this story make you feel?
Spread the word
About this article
Published by The Daily Berlin
Daily brief
Free, in your inbox before 7am. Weekdays.
More in News