Skip to main content
Logo GMV

Main navigation

  • Sectors
    • Icono espacio
      Space
    • Icono Aeronáutica
      Aeronautics
    • Icono Defensa y Seguridad
      Defense and Security
    • Icono Sistemas Inteligentes de Transporte
      Intelligent Transportation Systems
    • Icono Automoción
      Automotive
    • Icono Ciberseguridad
      Cybersecurity
    • Icono Servicios públicos Digitales
      Digital Public Services
    • Icono Sanidad
      Healthcare
    • Icono Industria
      Industry
    • Icono Financiero
      Financial
    • Icono Industria
      Services
    • All Sectors

    Highlight

    Slopsquatting
    Slopsquatting: A silent threat born from the hallucinations of LLMs
  • Talent
  • About GMV
    • Get to Know the Company
    • History
    • Management Team
    • Certifications
    • Corporate Social Responsibility
  • Communication
    • News
    • Events
    • Blog
    • Magazine GMV News
    • Press Room
    • Media library
    • Latest from GMV

Secondary navigation

  • Products A-Z
  • GMV Global
    • Global (en)
    • Spain and LATAM (es - ca - en)
    • Germany (de - en)
    • Portugal (pt - en)
    • Poland (pl - en)
    • All branches and all GMV sites
  • Home
Back
New search
Date
Blog
  • Healthcare

Is it OK to Share Anonymized Data?

26/05/2020
  • Print
Share

The manager of HM Hospitales has recently announced that he has made available to the scientific community 2,157 anonymized medical records of COVID-19 patients treated at these hospitals.

Datos anonimizados

A fine initiative that nonetheless begs a couple of questions:

  1. Does personal or confidential data anonymization really provide a privacy guarantee?
  2. Is publishing anonymized databases currently the best way of helping the scientific community to draw up precise machine learning models to make headway in research, in this case biomedical research?

An anonymized database is prone to what are called re-identification attacks, meaning an attempt to trace ostensibly anonymous records in the records of another related database or data source to extract confidential information from it. For example, two Texas University researchers managed to de-anonymize the movie ratings of Netflix users in a dataset published by the company for a competition designed to improve its recommendation system. The technique used was based on a simple idea: in the movie dataset, with a huge number of fields, there are not many users who give the same rating to the same film. Given that a user’s ratings are unique, or almost unique, it should not be too difficult to identify this user with only a little auxiliary information obtained from another source.

The article explains that a high-dimensional dataset like Netflix’s greatly raises the chances of being able to de-anonymize a register, while at the same time slashing the amount of auxiliary information required to do so. It also enables the de-anonymization algorithms to be robust in the face of data perturbation or incorrect auxiliary information. They showed this by cross-checking the Netflix ratings against the Internet Movie Database (IMDb), where many Netflix subscribers had also introduced ratings of the movies they had seen; they thus managed to trace IMDb user profiles, often with users' real names, to their Netflix ratings (which are theoretically private). This turned out to be possible even if the subscriber had posted very few IMDb ratings and their ratings bore only a rough resemblance to the same subscriber’s Netflix ratings.

MPC-Learning is a project financed jointly by GMV's R&D+i area and the Spanish Ministry of Economic Affairs and Digital Transformation, and it focuses on mathematical techniques capable of providing a numerical calculation without having to share data.

A well-known case in the medical sphere was the disclosure of the medical records and data of the Governor of the State of Massachusetts, when it occurred to an MIT student, Latanya Sweeney, to collate an anonymized medical database with the voter registration rolls of Cambridge, MA. The voter records contained, among other things, the name, address, zip code, date of birth, and gender of Cambridge’s total of 54,000 voters at that time, who were distributed across seven zip codes. Combining this information with the records of the anonymized database, the student was then able to find the governor’s medical records quite easily: only six people in Cambridge shared his date of birth; three of these were men and only one of them, the governor himself, lived in his zip code. The article “The 'Re-Identification' of Governor William Weld's Medical Information” describes this case, albeit also pointing out that the re-identification was possible because the governor was a public figure who experienced a highly publicized hospitalization (he passed out at a public event and footage was broadcast on all TV networks). Nonetheless, it is highly likely that the selfsame procedure would also work for finding the information of a known person, or someone who shared too much information on the Internet.

 

Does this mean we should forgo the use of anonymized data for scientific research?

Probably not, or not yet. As it stands today, it does not yet seem that re-identification can be performed on a massive scale on all the records of any anonymized database. Although there is now a host of studies showing cases of re-identification in certain circumstances, no one would call this an excessive price to pay for the huge scientific advances allowed by the exchange of anonymized medical datasets. It does serve as food for thought, however, showing that if we want to share our datasets in the interests of medical research, we should give careful thought to the anonymization technique. It also shows that privacy may not yet be guaranteed or simply that the database concerned is not apt for anonymized publication. It is more than likely that new techniques will appear in the future that disclose all or part of the information we wanted to hide.

Gobierno de España

 

 

Just maybe, for that very reason, the time has now come to consider data-sharing alternatives. Quite apart from anonymization, this idea gains traction if we also consider the following question: wouldn’t it be better to switch to a cooperation scenario in which each hospital, group, organization, etc. worked together in a federated learning network instead of each publishing its own anonymized database? Federated learning is a privacy- and confidentiality-preserving distributed computation model. It involves taking the machine learning models to where the data is rather than working with a single, centralized dataset. A collaboration of this type not only gets over the obstacle of database anonymization shortfalls and solves the legal constraints on sharing medical records but also brings much more data into the trawl (i.e., not only the 2,157 medical records shared by HM Hospitales), thereby allowing the organizations to obtain more accurate models.

Due to cases like these and GMV’s own experience with its clients, the company has always considered data privacy to be a crucial factor. So much so that GMV is now taking part in the project MPC-Learning: Secure and Protected Machine Learning by Secret Sharing. Co-funded by the R&D department of GMV’s Secure e-Solutions sector and the Spanish Ministry of Economic Affairs and Digital Transformation, , the project focuses on the development of mathematical and computational techniques capable of numerical calculation without the need for sharing data.

Click here to find out more about MPC-Learning, GMV’s alternative solution.

Authors: Luis Porras Díaz and Juan Miguel Auñón

 

  • Print
Share

Comments

About text formats

Restricted HTML

  • Allowed HTML tags: <a href hreflang target> <em> <strong> <cite> <blockquote cite> <code> <ul type> <ol start type> <li> <dl> <dt> <dd> <h2 id> <h3 id> <h4 id> <h5 id> <h6 id>
  • Lines and paragraphs break automatically.
  • Web page addresses and email addresses turn into links automatically.
CAPTCHA

Related

Telemonitorización de pacientes
  • Healthcare
The need for patient telemonitoring
Dia de la salud
  • Healthcare
Technology and courage for equitable, quality digital healthcare
sanidad
  • Healthcare
Research and technology go hand in hand - our lives depend on it

Contact

Isaac Newton, 11 Tres Cantos
E-28760 Madrid

Tel. +34 91 807 21 00

Contact menu

  • Contact
  • GMV around the world

Blog

  • Blog

Sectors

Sectors menu

  • Space
  • Aeronautics
  • Defense and Security
  • Intelligent Transportation Systems
  • Automotive
  • Cybersecurity
  • Digital Public Services
  • Healthcare
  • Industry
  • Financial
  • Services
  • Talent
  • About GMV
  • Shortcut to
    • Press Room
    • News
    • Events
    • Blog
    • Products A-Z
© 2025, GMV Innovating Solutions S.L.

Footer menu

  • Contact
  • Legal Notice
  • Privacy Policy
  • Cookie Policy

Footer Info

  • Commitment to the Environment
  • Financial Information