Setting Up Microsoft Purview: A Hands-On Journey
To deepen my understanding of data governance in Azure, I set up my own Microsoft Purview environment.
My aim was to gain practical, hands-on experience managing a modern data catalogue and to explore the full range of capabilities, including implementation, configuration, monitoring, scanning, management, and cost control.
Learning Goals:
- Learn how to configure Microsoft Purview effectively.
- Understand Role-Based Access Control (RBAC) and administrative permissions.
- Experiment with different scanning options and learn what data source types can be scanned.
- Explore how Purview handles different data structures — structured, semi-structured, and unstructured data.
- Scan both real and synthetic data.
- Automate metadata management.
- Control costs throughout the experimentation process.
By working directly with Purview, I hope to gain a deep understanding of how to scan and classify a wide range of data sources, manage access securely, and apply governance policies efficiently — all while keeping costs in check.
🔧 Step-By-Step: What I Did
🔹 1️⃣ Preparation
-
Confirmed Azure subscription and Owner permissions.
-
Note: In larger or real-world environments, resources are often spread across multiple resource groups and sometimes even across Azure tenants.
-
For example, Storage Accounts and Purview Accounts might belong to different resource groups or even different teams.
-
This adds complexity when assigning permissions and managing access across services.
-
-
For simplicity in my test setup, I created just one resource group: Purview-Test-RG.
-
This kept all my Purview-related resources (Purview Account, Storage Account, etc.) in one place for easier testing and role management.
-
🔍 Why this matters
✅ Simplifies RBAC: I avoided cross-resource-group or cross-tenant RBAC complexity during initial testing.
✅ Easier cost tracking: One resource group → one place to track costs.
✅ Quicker troubleshooting: No need to hop between resource groups when checking roles or diagnostic settings.
My Best Practice & Advice — Resource Interaction and Subscription Planning
When I’m setting up Azure resources — like creating a new Purview account — I always follow the best practice of keeping new resources within the same subscription as the existing, related ones. This makes life so much easier.
Why?
When resources sit in the same subscription, managing permissions, networking, billing, and monitoring becomes straightforward. I don’t need to worry about complex cross-subscription setups, which often lead to headaches.
By keeping everything together:
- I can assign permissions quickly using RBAC without dealing with cross-subscription role assignments.
- Networking is easier — I can connect services through VNets or Private Endpoints without complex peering.
- Purview can access data sources directly without complicated trust relationships.
- Billing stays clear — costs are grouped, making reporting and budgeting simple.
- Policies and compliance settings remain consistent across resources.
🚨 Problems I’ve Seen When This Best Practice Isn’t Followed
If resources are placed in different subscriptions:
- Time-consuming role assignments — I’ve had to manually set cross-subscription permissions, which delayed projects.
- Networking complications — setting up VNet peering or Private Link across subscriptions can be complex and costly.
- Data access failures — Purview sometimes can’t scan or classify data properly due to access restrictions.
- Confusing billing — costs get spread across subscriptions, making it hard to track spending.
- Policy mismatches — security or compliance policies might block certain actions between subscriptions.
✅ My Personal Checklist Before Creating Azure Resources
When I choose a subscription for new resources, I always check:
- Have I identified the subscription hosting the related resources?
- Do I have the right RBAC permissions in that subscription?
- Are existing VNets or Private Endpoints available there?
- Are any policies or compliance rules applied that I should know about?
- Will adding resources to this subscription align with my billing and budget plans?
- Is there a real reason to use a different subscription — or am I overcomplicating things?
My Best Practice Advice — Extra Steps to Improve Your Purview Configuration
By Sasha
When I set up Microsoft Purview, I always take a few extra steps at the start to make sure the environment is secure, easy to manage, and ready to grow without causing headaches later.
Here’s what I recommend:
1️⃣ Networking — Always Plan for Private Endpoints (even if you don’t need them yet)
Even if I deploy Purview with public access at first, I plan ahead in case we need to secure it using Private Endpoints or Private Links.
I always check:
- That the Resource Group’s region and VNets will support Private Endpoints later.
- Which VNet or subnet I’d use, and I apply a tag to note this — it saves time and avoids surprises down the line.
2️⃣ Tags — Set Them Up from Day One
I apply clear, consistent tags to every resource:
- Environment (Dev / Test / Prod)
- Owner (my name or team)
- CostCenter or BillingCode
- Compliance (if it’s required)
These tags make it much easier to handle billing, monitoring, access reviews, and reporting as the environment scales.
3️⃣ Access Control — Assign Roles Carefully
Straight away, I assign myself or my team the right roles:
- Purview Data Curator
- Purview Reader
If other users will be onboarded soon, I plan out an RBAC model early (e.g., Admins, Curators, Readers). This avoids confusion and last-minute access requests.
4️⃣ Diagnostic Settings — Prepare for Monitoring and Auditing
I always decide early where to send diagnostic logs:
- Log Analytics (for easy querying)
- A Storage Account (for archiving)
- Or Event Hub (if integrating with other monitoring tools)
Setting this up from the start makes auditing and troubleshooting much simpler later on.
5️⃣ Naming Standards — Stay Consistent
I use clear, predictable names not only for the Purview Account but also for:
- Collections
- Scans
- Glossary terms
This prevents confusion and keeps the data estate organised as it grows.
By following these steps early, I’ve saved myself (and my teams) from lots of potential problems — especially with access issues, networking challenges, billing surprises, and compliance audits later on.
Step-By-Step: What I Did
🔹 2️⃣ Purview Account Setup
Microsoft Purview Account Name: PurviewTest01
Region: UK South
Key actions:
-
Created the Purview account through the Azure Portal, selecting Locally Redundant Storage (LRS) to keep costs low.
-
Assigned myself multiple roles to ensure I had both management-plane and data-plane access:
-
Owner role at the Azure subscription level (management plane) → This allowed me to create and configure the Purview account itself and assign permissions.
-
Inside Purview Studio, I assigned myself:
-
Collection Admin → Full permissions to manage collections and delegate permissions to others.
-
Data Source Admin → Authority to register and configure data sources for scanning.
-
Data Curator → Ability to curate metadata, classifications, glossary terms, and business metadata.
-
-
-
Verified access in both Azure IAM and Purview Studio:
-
Checked that Azure IAM permissions were correctly applied at the subscription and resource group levels.
-
Confirmed that Purview roles were assigned properly within the Purview Governance Portal (Purview Studio).
-
Why this was important:
✅ Without the correct mix of Azure RBAC and Purview roles, scans can fail or metadata tasks can become impossible.
✅ This also allowed me to test role-based access control (RBAC) behaviour between Azure and Purview — a common source of confusion for new users.
🔹 3️⃣ Storage for Data Scanning
Storage Account Name: purviewteststorage01blob
Actions:
-
Created the Storage Account in the same Azure region (UK South) to avoid latency or billing issues.
-
Created a Blob container named jsonfiles to store data to be scanned.
-
Chose Locally Redundant Storage (LRS) and Standard performance for minimal cost.
-
Set container access level:
-
Allowed public access from all networks during the testing phase.
-
This avoided the complexity of configuring Private Endpoints or Virtual Network rules at the early stage.
-
Why this was important:
✅ Testing with public access simplified the initial connection between Purview and the Storage Account, especially when using the system-assigned managed identity for scanning.
✅ Placing all scan-related data into a single container (jsonfiles) made it easy to monitor what was being scanned and to manage future updates or additional test files.
🔹 4️⃣ Uploading and Scanning Real Data
Data source:
-
The National Gas Transmission operational data catalogue JSON.
-
Downloaded from data.nationalgas.com.
File uploaded: catalogue.json
File contents:
-
Schema (technical metadata):
-
Described the structure of multiple data sets, including fields like name, frequency, unit of measure, time frame, and publication ID.
-
-
Business metadata:
-
Included owner information, descriptions, security classifications, and other contextual details about the data and its use.
-
Scan configuration in Purview:
-
Level 3 scan selected:
-
Extracted schema (field names and data types).
-
Pulled in sample data (when available).
-
Applied automatic classifications to detected data types (for example, security classification or PII flags).
-
-
System Default scan rules:
-
Ensured compatibility with JSON format and enabled Purview to automatically detect and classify common metadata patterns.
-
Results:
-
3 assets discovered:
-
The JSON file itself.
-
The schema extracted from within the file.
-
Possibly a container-level or folder-level asset.
-
-
1 asset classified:
-
One of the detected fields was automatically tagged with a classification based on Purview’s built-in rule set.
-
-
Scan duration: ~5 minutes.
Why this was important:
✅ Testing with real-world data (not synthetic data) highlighted both the strengths and the current limitations of Purview’s scanning and metadata extraction.
✅ Helped confirm that technical schema extraction works well, but business metadata requires post-scan enrichment — leading to my next step: automating metadata flattening and upload.
📚 What I Learned
🔎 RBAC is layered and critical
Understanding that Purview permissions require a combination of Azure RBAC roles (for resource management and data source connection) and Purview-specific roles (for cataloguing, curation, and scanning) was essential. Without both, certain actions simply won’t work — and this complexity increases when data and Purview accounts are spread across different resource groups or even tenants.
🔎 Resource Group simplicity helped — but real-world setups are harder
For my tests, keeping everything inside one Resource Group (Purview-Test-RG) simplified permissions, access, and billing.
👉 In real-world Azure environments, it’s common for Storage and Purview to be in separate resource groups (or even separate subscriptions/tenants), which adds complexity.
👉 Testing in a single Resource Group removed a lot of potential headaches while I learned the basics.
🔎 Costs are manageable for small tests but require planning
Scanning small files like my 10.2 MB JSON catalogue costs only pennies per scan. However, I learned that:
-
Purview cannot be paused → The Data Map incurs ongoing charges.
-
Recurring scans or scaling up to thousands of files can increase costs quickly.
-
Creating cost alerts and tagging resources is a must.
🔎 Purview scanning extracts schema well but not business metadata
My Level 3 scan successfully detected the JSON schema, sample data, and even applied a classification.
But — as expected — it did not extract or apply the rich business metadata contained in the JSON’s meta fields or nested catalogue entries.
👉 This confirmed that post-scan metadata enrichment is a key step in most real-world Purview projects.
Next Steps
✅ Finalise and enhance the Python script
I plan to built a working Python script to:
-
Flatten the complex nested JSON structure.
-
Create a CSV linking each data item and its business metadata.
-
Prepare the metadata for Purview Bulk Upload or REST API ingestion.
Next, I will:
-
Refine the script for scalability and flexibility.
-
Test it with additional JSON metadata catalogues.
-
Explore automating the CSV upload directly into Purview using the Atlas REST API or PowerShell.
✅ Practice glossary tagging and classification rules
I plan to align some of the business metadata fields with a Purview glossary, to see how glossary terms can be auto-assigned or manually enriched post-scan.
✅ Expand testing to structured data
I’ll set up a basic Azure SQL Database and test Purview’s structured data scanning and metadata extraction.
✅ Establish cost alerts and monitoring
To ensure future tests remain cost-effective and to avoid unexpected charges as I scale up scans and metadata enrichment.
✅ Summary and Conclusions
This hands-on project provided a complete learning journey — from setting up Microsoft Purview to scanning real-world data and overcoming common metadata management challenges.
Key takeaways:
-
RBAC and permissions are critical and must be configured both at the Azure resource level and within Purview Studio.
-
Resource group planning matters. While I kept all resources in one group for simplicity, real-world scenarios often require managing permissions across multiple groups or even tenants, which adds complexity.
-
Purview’s scanning effectively extracts technical metadata (schemas and classifications) but does not automatically bring in business metadata.
-
Costs are predictable and manageable at small scale but need monitoring as data volume and scan frequency increase.
-
Business metadata enrichment requires post-scan automation. I built a Python script to flatten the National Gas Transmission JSON metadata into a CSV ready for bulk upload or API ingestion.
-
Diagnostics and troubleshooting tools in Purview are essential for resolving permission and scanning issues quickly.
-
Automation bridges the gap between technical metadata scanning and full data governance — turning a scanned data asset into a rich, searchable, and business-friendly resource.
Overall lesson:
Governance in Microsoft Purview isn’t just about technology — it’s about smart planning, clear role control, and building the right automation to connect business and technical metadata at scale.