Microsoft AI Research Division Accidentally Exposes Tens of Terabytes of Sensitive Data on GitHub

Microsoft's AI research division inadvertently exposed a vast amount of sensitive data, including private keys and passwords, while publishing a storage bucket of open-source training data on GitHub. The cloud security startup Wiz discovered a GitHub repository belonging to Microsoft and uncovered the accidental exposure of cloud-hosted data.

The GitHub repository was intended to provide open-source code and AI models for image recognition. Users were instructed to download the models from an Azure Storage URL. However, Wiz found that the URL had been misconfigured, granting permissions on the entire storage account and inadvertently exposing additional private data.

The exposed data included approximately 38 terabytes of sensitive information, such as personal backups of two Microsoft employees' personal computers. Moreover, the data contained passwords to Microsoft services, secret keys, and over 30,000 internal Microsoft Teams messages from hundreds of employees.

It is important to note that the storage account itself was not directly exposed. The issue arose due to the inclusion of an overly permissive shared access signature (SAS) token in the URL by Microsoft AI developers. SAS tokens are used in Azure to create shareable links granting access to an Azure Storage account's data.

Upon discovering the issue, Wiz promptly shared its findings with Microsoft on June 22. Microsoft took immediate action and revoked the SAS token two days later on June 24. The tech giant completed its investigation into the potential organizational impact on August 16.

In response to this incident, Microsoft has expanded GitHub's secret scanning service. This service now monitors all public open-source code changes for plaintext exposure of credentials and other secrets, including SAS tokens with overly permissive expirations or privileges.

While Microsoft assures that no customer data was exposed and no other internal services were compromised, this incident serves as a reminder that additional security checks and safeguards are necessary when handling vast amounts of data, especially in the field of AI. Development teams working with AI solutions must be vigilant to prevent such accidental exposures.

Links:
Wiz Twitt er Post
Microsoft B log Post

Microsoft AI Research Division Accidentally Exposes Tens of Terabytes of Sensitive Data on GitHub

Sven

Share this post

Recent Posts

Toys R Us Embraces Generative AI: A Leap into the Future of Brand Storytelling

OpenAI Expands Its Global Footprint with New Tokyo Office and GPT-4 Custom Model for Japanese Language

Mixtral 8x7B: The New Frontier in Sparse Mixture-of-Experts AI Models

Tags