Data Discovery: The First Step to AI Readiness
Published Apr 22 2024 09:00 AM 2,131 Views
Microsoft

Artificial intelligence (AI) is transforming the way businesses operate, innovate, and compete. AI can help enterprises improve efficiency, accuracy, and customer satisfaction, as well as create new products and services. However, AI also comes with challenges and risks, especially when it comes to data. Data is the fuel for AI, and without proper data management, enterprises may face legal, ethical, and operational issues that can undermine their AI initiatives. 

 

Before end users start using AI, enterprises need to ensure that their data is ready for AI. This means that data is accurate, complete, relevant, and secure. It also means that data is compliant with the applicable laws and regulations, as well as the ethical and social norms of the stakeholders. To achieve this, enterprises need to follow a series of steps that can help them assess, improve, and monitor their data quality and governance. In this blog series, we will explore these steps and provide recommendations for enterprises to prepare their data for AI. 

 

The first step is data discovery. Data discovery is the process of identifying, locating, and understanding the data sources and assets that are available for AI. Data discovery helps enterprises answer questions such as: What data do we have? Where is it stored? How is it structured? What does it contain? Who has access to it? How is it used? Data discovery is essential for enterprises to gain a comprehensive and accurate view of their data landscape and to identify the potential opportunities and risks for AI. 

 

Key Areas of Consideration for Data Discovery 

Data discovery is not a one-time activity, but a continuous and iterative process that requires collaboration and coordination across different teams and roles. Data discovery involves both technical and business aspects, and it should align with the enterprise's strategic goals and priorities for AI. To conduct effective data discovery, enterprises should consider the following key areas: 

 

  • Permissions: Enterprises should have a clear and consistent policy for data access and sharing, both internally and externally. Data permission should be based on the principle of least privilege, meaning that only the authorized users and applications should have access to the data they need for their specific purposes. Data permissions should also be documented and audited regularly, to ensure compliance and accountability. 
  • Sensitive data: Enterprises should identify and classify the data that is sensitive or confidential, such as personal data, financial data, health data, trade secrets, or intellectual property. Sensitive data should be protected with appropriate security measures, such as encryption, masking, or anonymization. Sensitive data should also be subject to stricter data governance rules, such as retention, deletion, or consent policies. 
  • Personally identifiable data: Enterprises should identify and classify the data that can be used to identify or link to an individual, such as name, email, phone number, address, or social security number. Personally identifiable data is a subset of sensitive data, and it may have specific legal and ethical implications, depending on the jurisdiction and the context. Enterprises should comply with the relevant data protection laws and regulations, such as the General Data Protection Regulation (GDPR) or the California Consumer Privacy Act (CCPA) and respect the rights and preferences of the data subjects. 
  • Data over exposure: Enterprises should assess and monitor the level of exposure or visibility of their data, both internally and externally. Data over exposure occurs when data is accessible or visible to more users or applications than necessary, or when data is exposed to unauthorized or malicious parties. Data over exposure can lead to data breaches, data leaks, or data misuse, which can damage the enterprise's reputation, trust, and competitiveness. Enterprises should implement data security controls, such as firewalls, access control lists, or encryption keys, to prevent or limit data over exposure. 
  • Data oversharing: Enterprises should evaluate and control the amount and frequency of data sharing, both internally and externally. Data oversharing occurs when data is shared or transferred more than necessary, or when data is shared or transferred without proper justification or consent. Data oversharing can result in data duplication, data inconsistency, or data loss, which can affect the data quality and integrity. Enterprises should establish data sharing agreements, protocols, or standards, to ensure that data sharing is done in a secure, compliant, and efficient manner. 

 

Microsoft solutions for data discovery 

Data discovery can be a challenging and time-consuming process, especially for large and complex enterprises with multiple data sources and platforms. Microsoft offers several solutions that can help enterprises discover, catalog, classify, and protect their data assets, such as: 

 

- Microsoft Information Protection: Microsoft Information Protection is a suite of solutions that helps enterprises protect their sensitive data across devices, apps, cloud services, and on-premises systems. Microsoft Information Protection allows enterprises to discover, classify, label, and encrypt their data, based on predefined or custom policies. Microsoft Information Protection also helps enterprises monitor and audit their data activities, and to detect and respond to data breaches or leaks. Microsoft Information Protection integrates with Microsoft 365, Azure, and Windows, as well as third-party applications and platforms. This set of capabilities is what will have the largest impact on end users and the enterprise when deploying security protections to data. To learn more about Microsoft Information Protection: Implement Information Protection in Microsoft 365 - Training  

 

- Microsoft Purview (Previously Azure Purview): Microsoft Purview is a unified data governance service that helps enterprises manage and govern their on-premises, multicloud, and software-as-a-service (SaaS) data. Microsoft Purview provides a holistic and up-to-date view of the enterprise's data landscape, enabling data discovery, data lineage, data cataloging, data classification, and data sensitivity analysis. Most enterprises that use Microsoft Purview do so to help comply with data privacy and security regulations, such as GDPR and CCPA, by identifying and labeling sensitive data, and applying fine-grained access policies. This solution is also available to the Defense Industrial Base within Azure Government and can help with the protection of Controlled Unclassified Information (CUI) or International Traffic in Arms Regulations (ITAR) data. To learn more visit: Unified Data Governance with Microsoft Purview | Microsoft Azure 

 

- Microsoft Purview Data Catalog: Data Catalog is a fully managed cloud service that helps enterprises discover, understand, and consume their data sources. Data Catalog allows data producers to register and annotate their data sources, and data consumers to search and browse the data catalog using natural language queries. Data Catalog also enables data collaboration and sharing, by allowing users to rate, review, and tag data sources, and to request access to data sets. To learn more visit: Data Catalog - Learn about Business Glossaries 

 

Using Microsoft Information Protection for Data Discovery 

One of the benefits of Microsoft Information Protection is that it can help enterprises run data discovery across their heterogeneous and distributed data environments. Data discovery is the process of finding, cataloging, and classifying data sources, and understanding their content, structure, quality, and sensitivity. Data discovery can help enterprises prepare their data for AI, as well as comply with data regulations and policies. 

 

Microsoft Information Protection allows enterprises to use predefined or custom labels to classify and protect their data sources based on their sensitivity and business value. For example, an enterprise can use labels such as "Public", "Internal", "Confidential", or "Highly Confidential" to mark their data sources according to their access requirements and risk levels. Labels can also include sub-labels for specific data types, such as "Personal Data", "Financial Data", "Health Data", “CUI” or "Legal Data". Labels can be applied manually by users, automatically by policies, or suggested by machine learning algorithms. 

 

By applying labels to data sources, enterprises can use Microsoft Information Protection to discover and inventory their data assets and liabilities, and to understand their data distribution and exposure. For example, an enterprise can use Microsoft Information Protection to answer questions such as: 

 

  • What types of data sources do we have, and where are they located? 
  • How much of our data is sensitive, and what kind of sensitivity does it have? 
  • Who has access to our data, and what level of protection does it have? 
  • How often is our data accessed, modified, or shared? 
  • Are there any anomalies or risks in our data activities or behaviors? 

 

Microsoft Information Protection is part of Microsoft Purview  and can provide a comprehensive and unified data governance solution. Microsoft Purview can leverage the labels from Microsoft Information Protection to enrich its data catalog and lineage, and to enable granular and dynamic data access policies. Microsoft Purview can also use its own scanning and classification capabilities to complement and validate the labels from Microsoft Information Protection, and to discover additional data attributes and insights. Together, Microsoft Information Protection and Microsoft Purview can help enterprises achieve data discovery at scale, and to optimize their data quality and security for AI readiness. 

 

For defense industrial base customers that sensitive data like CUI, ITAR or Export Control (EC), they should ensure they are using the correct version of information protection for that data set. For more information about our different clouds: Understanding Compliance Between Commercial, Government and DoD Offerings - September 2023 Update - ... 

 

Conclusion 

Data discovery is the first step to AI readiness, and it can help enterprises understand their data assets and liabilities, as well as their data opportunities and risks. Data discovery can enable enterprises to select the most suitable and valuable data sources for AI, and to ensure that their data is compliant, ethical, and secure. Data discovery can also help enterprises optimize their data management and governance practices, and to foster a data-driven and AI-enabled culture. In the next blog, we will discuss the second step to AI readiness: Data Access. 

 

 

Resources: 

Data, Privacy, and Security for Microsoft Copilot for Microsoft 365 | Microsoft Learn 

Copilot for Microsoft 365 – Microsoft Adoption 

Microsoft Purview data security and compliance protections for Microsoft Copilot | Microsoft Learn 

Embrace responsible AI principles and practices - Training | Microsoft Learn 

Responsible AI Principles and Approach | Microsoft AI 

Microsoft Responsible AI Standard v2 General Requirements 

Guidelines for Human-AI Interaction - Microsoft HAX Toolkit 

Understanding Compliance Between Commercial, Government and DoD Offerings - September 2023 Update - ... 

1 Comment
Co-Authors
Version history
Last update:
‎Apr 22 2024 12:31 PM
Updated by: