This browser is not actively supported anymore. For the best passle experience, we strongly recommend you upgrade your browser.

Social Media Links

| 5 minutes read

Accelerate Your Records Management, Artificial Intelligence, and Microsoft Purview Initiatives With Unstructured Data Cleanup

Unstructured Data Is Everywhere in Your Organization — And Just Waiting To Cause Trouble

Unstructured Data Cleanup is an information governance practice focused on the permanent deletion of information that is no longer of use to the business. Using a scanning and classification tool, organizations can better understand the information they manage and determine if it can be defensively disposed, or if there is a regulatory, legal, or business reason why it must be retained.

In this four-part series, I will share several approaches to getting started with unstructured data cleanup as part of your information governance program.

Part 2: We will consider the capabilities of leading vendor solutions that are routinely implemented for the scanning and classification of unstructured data.

Part 3: We will explore best practices for planning and implementation, and how they can accelerate your Records Management, Generative AI, and Microsoft Purview Initiatives. 

Part 4: We will share a communication and change management strategy that can encourage organization-wide approval and participation. 

The Long Road to Now

For more than two decades, I have been serving Fortune 500 clients to find solutions for managing their rapidly expanding volumes of unstructured data. This includes electronic documents, scanned paper, digital media, and any other content that appears as a file on a storage volume in a shared network drive or a document management application.

Until about 15 years ago, the collective IT mindset was, “Storage is cheap. We will just buy more disks." 

From that point forward— but before co-authoring and sharing links to files was a common practice—file duplication was wildly rampant and unchecked. 

I can back this up with actual statistics from client engagements: 

The average amount of duplication of unstructured data in an organization of any size is consistently 35 to 40%.

Most of our clients at Ankura measure their volumes of unstructured data in petabytes. 

Many are burdened with dozens of petabytes of information, while larger or older organizations have amassed hundreds of petabytes — and the volume continues to increase.  

Not all of it is stuck in legacy network shared drives. I am betting you have unstructured data trapped in one or more of these applications, somewhere in your organization:

  • Late 1990s-era document management systems (e.g., FileNET, OpenText, Stellent)
  • 2000s-era on-premises collaboration applications (Jive, Groove, Lotus Notes)
  • 2010s-era cloud-based file-sharing applications (, Dropbox)
  • That HTML intranet that you are still maintaining for some reason

All That Stuff Is Probably Still There

That old unstructured data is hanging out, waiting for the right moment to cause great harm to your organization. For decades no one had a good business reason or the authority to delete it, until now.

Unstructured Data Cleanup is an information governance practice with a single objective:

 Eliminate all information that has no value to the business while preserving information that is essential for business operations and for the requirements of legal and regulatory compliance.

With only these vague requirements as a guide, a data cleanup effort can be fraught with risk and delays—mostly because of the difficulty in discerning between “no value” and “essential” information—all mixed together in millions of folders, sprinkled across hundreds of petabytes of storage with origins that predate the current millennium. 

Meanwhile, the demands for data privacy and security from both regulators and customers are evolving and intensifying—and rightfully so.

Established regulations such as the General Data Protection Regulation (GDPR), Health Insurance Portability and Accountability Act (HIPAA), the Sarbanes Oxley Act, and New York State Department of Financial Services (NYDFS) have created a maze that even the most forward-thinking companies struggle to navigate. At best, it is like walking a tightrope without a balance pole. And the rope is rapidly growing longer.

How Does Unstructured Data Cleanup Work?

The process of Unstructured Data Cleanup is akin to renovating an old, musty, dilapidated house: 

  1. Inspect the property and create a plan
  2. Review building codes, get permits and approvals
  3. Safely dispose of hazardous junk and abandoned personal belongings
  4. Fortify the foundation and structure 
  5. Update the floorplan so it is easier to get around

Unstructured data cleanup is the process of identifying (inspecting) your current and legacy repositories, gaining permission from your Legal or Compliance team to proceed with the search and defensible destruction of risky or harmful unstructured data.  

With Less, You Will Have More

You will more quickly be able to find the information you need and be more confident that it is the best version. Depending on your approach to implementing AI in your organization, unstructured data cleanup may reduce the effort required to select samples for your training sets. From a records and information management perspective, Microsoft Purview is growing in popularity for automated classification and disposition of unstructured data — and information natively stored in Microsoft 365 is the best use case.

From left to right:

  1. A data discovery and classification application is used to scan the contents of an unstructured data repository (a storage volume, shared drive, etc.) 
  2. The application’s classification engine is used to discover/tag unstructured data that matches your unstructured data cleanup objectives, such as
    • “Display files last accessed more than 10 years ago”
    • “Display files that have no purpose (temp files, log files)”
    • "Display files that contain predefined sensitive or privacy-related content"
    • Define other profiles as needed: “Find all zip files larger than 5 GB,” or "Find documents that have no owner."
  3. Depending on your organization’s policies around information destruction, human review may be necessary for certain information types
  4. Act: Dispose, preserve, or do nothing

Like most IT initiatives, creating a limited pilot for a single filesystem or repository is straightforward. 

An enterprise-wide unstructured data cleanup effort will require more planning, coordination with resources from information governance, legal, compliance, IT, and most importantly participation and buy-in from business leaders, as well as system and data owners. 

The Challenges of Unstructured Data Cleanup 

It Is a Balancing Act. The challenge of differentiating the “good” data from the “bad” data with the need for data to drive business decisions and innovation.

Risks of Over-Deletion. Specifically, newly created risks associated with the inadvertent deletion of data that may be important for future business needs or regulatory requirements.

Implementation Issues. The complexities involved in implementing unstructured data cleanup technology and practices across large, distributed organizations.

A Lack of Mature Tools. There are many vendors that sell data discovery and classification software that help you explore your unstructured information, regardless of where it is stored. However, there is no viable and purpose-built solution that is able to scan a repository, classify its information, and then take action on the information (route for human review, preserve, dispose) all in a single, integrated application.

Next Up

In the second installment of our four-part series, we'll explore the capabilities of scanning and classification applications, and common use cases for unstructured data cleanup. We'll share some hard-learned lessons, as well. 

© Copyright 2024. The views expressed herein are those of the author(s) and not necessarily the views of Ankura Consulting Group, LLC., its management, its subsidiaries, its affiliates, or its other professionals. Ankura is not a law firm and cannot provide legal advice.


compliance, unstructured data cleanup, article, f-strategy, data & technology

Let’s Connect

We solve problems by operating as one firm to deliver for our clients. Where others advise, we solve. Where others consult, we partner.

I’m interested in

I need help with