Experts estimate that more data was created in the last two years than throughout all of human history. By the end of 2019, 246 billion emails will be created each day.  With the adoption of the internet of things (IoT), telematics, or machine-to-machine communication, this data explosion is poised to accelerate at an even greater pace. Against the backdrop of this massive data growth, businesses facing litigation, regulatory inquiries, or enforcement agency investigations find it difficult, if not impossible, to manually review the millions of documents that fall within the scope of these matters, on reasonable budgets and in time to meet stringent deadlines.
In addition, industry-specific data types present unique challenges when faced with adversarial legal proceedings, and the construction industry is no exception. Separate industries rely on differing data sources and legal events involve data drawn from any source or record types with potentially relevant material; from structured timekeeping, shipping, and computer-aided design (CAD) or building information modeling (BIM) files to unstructured email, document management systems, and handwritten reporting (e.g., daily construction reports), costs quickly become significant when clients and their outside counsel are faced with reviewing these materials in a traditional linear manner. Keyword searches can begin to cull the data that requires legal review, but risk reducing defensibility and make it harder to answer the question: did we find everything?
To assist with these challenges, corporations and legal teams are utilizing advanced technology, including predictive modeling, near-duplicate detection, and concept clustering, to minimize review, limit costs, and maintain defensibility. As the data evolves, new tools are being explored such as sentiment analysis to categorize and gain understanding into these evolving data sets.
Predictive modeling, also known as predictive coding, is a form of supervised machine learning technology that seeks to identify targeted documents in a population. The software makes predictions regarding the overall data population by utilizing training data derived from the population at issue and can be used to identify relevant material, perform quality control of production, or privileged document populations, and assist with categorizing data by matter-specific issues. This process allows for significant time savings, as well as potential cost savings approaching 50 percent, related to document research and review when large volumes of data are at play.
These processes rely on the text present in the data, not the data type or source, thus eliminating the hurdle of conforming data types to a single standard and empowering a legal team to analyze not only client data but the adversarial party’s production data as well. When text is not present (e.g., certain CAD or BIM files), it must be noted that modeling cannot be successfully performed and will require legal team review.
Predictive modeling requires limited human interaction by the legal team’s subject matter experts (SMEs). The SMEs will review a subset of the entire population for relevance (training set) and this subset will be used to teach the program, which will then make predictions about each document in the corpus based on the teaching. Once the process is complete, the SMEs validate the model’s accuracy by reviewing a statistically valid sample to determine the model’s accuracy (validation set). The SME’s interaction with the training and validation sets contrasts with the traditional linear review approach of reviewing each document in the corpus at great cost in both time and expense. This approach provides the legal team with more time to focus on the merits of the matter, while not being bogged down by extensive reviewing of multiple irrelevant documents.
As an example of this technology in use, a prominent utility provider had hired a company to build its new flagship manufacturing facility. Upon taking possession of the factory, the provider identified a long list of deficiencies with its construction that hampered its ability to operate efficiently. The client entered arbitration seeking monetary relief for the cost of correcting the deficiencies and for the burden of operating at reduced output. An expert was engaged to assist with the identification of delays, completeness of project turnover, damages from the deficient facility, and provide expertise regarding the discovery process.
The client had a repository of over 1.3 million documents drawn from multiple sources (e.g., email, schedules, CAD files, reports) and was required to produce relevant documents from the repository within a 10-week time frame. With limited legal team resources (four or five attorney reviewers) to conduct a full review of the repository, the review was projected to take over eight months to complete.
The expert deployed predictive modeling to analyze the text of the documents and create multiple predictive models to identify relevant documents and potentially privileged documents. The models classified the material and assisted the legal team’s review by identifying relevant material, facilitating the removal of irrelevant material from the review workflow, as well as providing quality control of the ultimate production population to verify that potentially privileged material would not be inadvertently produced. This resulted in the legal team meeting the arbitration panel’s deadline without the need to add large numbers of additional legal staffing.
Predictive models were also created to categorize the opposition’s production. These models were set up to supplement traditional keyword searching to categorize the produced data and identified key documents related to the various deficiencies at issue. In very short order, the team had a deep understanding of the material produced by the opposition, which allowed for greater time to formulate their legal arguments.
Having faced the prospect of hiring additional legal staff to complete review of all material by the arbitrator’s deadlines, the legal team leveraged predictive modeling to save the client from hiring additional legal staff, as well as saving hundreds of thousands of dollars in review costs.
Near-duplicate detection technology is a form of machine learning that identifies near-duplicate or duplicate documents based on the textual content of the documents. By evaluating and comparing the text present in each document in a population against one another, near-duplicate detection groups documents based on textual similarity. Contrary to predictive modeling’s supervised learning, near-duplicate detection utilizes unsupervised learning, which does not require human intervention or training. This process allows for analysis as soon as data is available.
The technology enables greater quality control over a legal team’s review of documents. In very little time, a team can determine if the relevance or privilege of a document is in line with similar documents. Issue research is improved with the ability to quickly jump to documents similar in content to investigate related information. The technology enables legal teams to validate privilege determinations across groups, find documents related to specific issues, and perform first-pass review based on similarity to speed up workflows.
 The Radicati Group, Inc., “Email Statistics Report, 2015-2019,” The Radicati Group, Inc. (2015), https://www.radicati.com/wp/wp-content/uploads/2015/02/Email-Statistics-Report-2015-2019-Executive-Summary.pdf.
© Copyright 2019. The views expressed herein are those of the author(s) and not necessarily the views of Ankura Consulting Group, LLC., its management, its subsidiaries, its affiliates, or its other professionals.