Manual Classification of Data is Still a Best Practice when Developing a Repeatable and Defensible Process
Smart managers always put a large emphasis on automating whatever processes they can within their organizations and for good reasons: processes become more predictable, there is a reduced chance of human error and ultimately the business is more successful. In the realm of eDiscovery, there are even efforts to automate the data classification process in order to define repeatable and defensible processes should the business be subjected to an eDiscovery request. However businesses are finding out that it may not be in their best interest to automate data classification and that for now courts still prefer people to computers when it comes to performing this particular task.
My interest in this topic was piqued by a recent conversation I had with Shannon Smith, an attorney who is an eDiscovery and Archiving Specialist with CommVault. We were discussing the pros and cons of the retention policy that many businesses inadvertently adopt (that is to say, they keep all of the data that they create). The reason they take this strategy is simple: it is easy to implement (many of them do this by default anyway) and it saves them the hassle of trying to classify records according to the various internal and external policies that exist.
She explained that in some verticals this model makes sense and is probably even desirable, such as among architectural and engineering firms, video production houses and other lines of business where the data that they store does not contain sensitive personal information. But when you start to talk about financial services firms, health care providers or bio-med and pharmaceutical companies, these organizations need to develop retention policies that make sense both for their business and that satisfies federal regulations.
The trick in these circumstances is to first properly classify the data so the appropriate retention policy or policies can be applied to it. My initial reaction when she said this was that organizations should look to identify a software tool that could automate the classification and retention of this data.
Shannon disagreed. She said that even though CommVault® Simpana® does have a tool built into it called "Content Director" that can automatically classify data, the advice that CommVault generally gives to its clients is that they classify their data by dragging and dropping data into folders that have specific archiving and retention policies associated with them.
This confused me as I assumed it would be the other way around. To me it made more sense to use a software tool to do the data classification as opposed to turning over a task as subjective as data classification over to an end-user or group of end-users.
However it is precisely because classifying data is so subjective that it is for now still best left in the hands of users albeit intelligent end-users. In addition, most organizations do not have the proper processes in place to leverage automated classification with confidence - e.g., published guidelines for how and where to save data, consistent file naming schemas, etc.
Smith explained that the key here is "intelligent users" since they can understand the context of how data is kept. Organizations probably do not want to save employee emails regarding their lunch plans though they may want to keep those emails where the lunch plans reference discussions about negotiating the terms of a contract. This requires a certain amount of intelligence on the part of the user to properly classify the data.
To use a tool that automates the classification and retention of a specific email solely based upon a single word such as "lunch" or "contract" would be inappropriate since the context of how these words are used is critical. She says, "The idea is to move away from retaining data based upon file types to retaining data based upon the actual content. However for now most customers agree that using people to classify data and make that determination is still the best way to properly classify data. We won't likely see a change in this area until organizations recognize the value in developing and enforcing data retention guidelines that make automation much more reliable."
However in order for a business to successfully use people to classify their data and have that process hold up in court means businesses must train their users. She adds, "If a business has never trained its users on how to manage their data, it is unlikely that the organization can claim with any confidence that it has developed a repeatable and defensible process around data management and ediscovery efforts."
Smith recommends that businesses first develop data retention policies and then instruct their employees on how to classify data. This way, when the organization gets hit with a subpoena or lawsuit, there will be some semblance of order during the collections process. Businesses do not have to make the process perfect day one as they can, over time, make improvements to the process. However courts will look for these processes as it gives them more confidence that the business is following a process as opposed to flying by the seat of its pants every time it gets hit with a lawsuit and has to perform an eDiscovery.
In this respect, CommVault Simpana supports full content indexing so users can classify the data regardless of where it resides. The metadata and contents can then be searched against regardless of where the data resides in the business, be it in the form of archives, backups or even online data that is still sitting on production servers. This capability becomes very valuable as it can be used by any business user to search across all corporate data (SharePoint data, online or offline email, file servers, etc.)
Today's world is all about process and automation but data classification does not yet appear to be a process that can be readily automated in such a way that organizations can rely on the outputs. The ability to leverage automation requires organizations to develop and enforce detailed classification and retention procedures and most companies have not yet dedicated the resources to such an effort.
While progress is being made in data classification and companies like CommVault envision and are working towards the goal of delivering tools that enable automated and reliable processes, for now businesses must still rely on documented procedures, educated users and sophisticated search engines such as what CommVault provides to deliver on the repeatable and defensible processes that courts are looking for now.
My interest in this topic was piqued by a recent conversation I had with Shannon Smith, an attorney who is an eDiscovery and Archiving Specialist with CommVault. We were discussing the pros and cons of the retention policy that many businesses inadvertently adopt (that is to say, they keep all of the data that they create). The reason they take this strategy is simple: it is easy to implement (many of them do this by default anyway) and it saves them the hassle of trying to classify records according to the various internal and external policies that exist.
She explained that in some verticals this model makes sense and is probably even desirable, such as among architectural and engineering firms, video production houses and other lines of business where the data that they store does not contain sensitive personal information. But when you start to talk about financial services firms, health care providers or bio-med and pharmaceutical companies, these organizations need to develop retention policies that make sense both for their business and that satisfies federal regulations.
The trick in these circumstances is to first properly classify the data so the appropriate retention policy or policies can be applied to it. My initial reaction when she said this was that organizations should look to identify a software tool that could automate the classification and retention of this data.
Shannon disagreed. She said that even though CommVault® Simpana® does have a tool built into it called "Content Director" that can automatically classify data, the advice that CommVault generally gives to its clients is that they classify their data by dragging and dropping data into folders that have specific archiving and retention policies associated with them.
This confused me as I assumed it would be the other way around. To me it made more sense to use a software tool to do the data classification as opposed to turning over a task as subjective as data classification over to an end-user or group of end-users.
However it is precisely because classifying data is so subjective that it is for now still best left in the hands of users albeit intelligent end-users. In addition, most organizations do not have the proper processes in place to leverage automated classification with confidence - e.g., published guidelines for how and where to save data, consistent file naming schemas, etc.
Smith explained that the key here is "intelligent users" since they can understand the context of how data is kept. Organizations probably do not want to save employee emails regarding their lunch plans though they may want to keep those emails where the lunch plans reference discussions about negotiating the terms of a contract. This requires a certain amount of intelligence on the part of the user to properly classify the data.
To use a tool that automates the classification and retention of a specific email solely based upon a single word such as "lunch" or "contract" would be inappropriate since the context of how these words are used is critical. She says, "The idea is to move away from retaining data based upon file types to retaining data based upon the actual content. However for now most customers agree that using people to classify data and make that determination is still the best way to properly classify data. We won't likely see a change in this area until organizations recognize the value in developing and enforcing data retention guidelines that make automation much more reliable."
However in order for a business to successfully use people to classify their data and have that process hold up in court means businesses must train their users. She adds, "If a business has never trained its users on how to manage their data, it is unlikely that the organization can claim with any confidence that it has developed a repeatable and defensible process around data management and ediscovery efforts."
Smith recommends that businesses first develop data retention policies and then instruct their employees on how to classify data. This way, when the organization gets hit with a subpoena or lawsuit, there will be some semblance of order during the collections process. Businesses do not have to make the process perfect day one as they can, over time, make improvements to the process. However courts will look for these processes as it gives them more confidence that the business is following a process as opposed to flying by the seat of its pants every time it gets hit with a lawsuit and has to perform an eDiscovery.
In this respect, CommVault Simpana supports full content indexing so users can classify the data regardless of where it resides. The metadata and contents can then be searched against regardless of where the data resides in the business, be it in the form of archives, backups or even online data that is still sitting on production servers. This capability becomes very valuable as it can be used by any business user to search across all corporate data (SharePoint data, online or offline email, file servers, etc.)
Today's world is all about process and automation but data classification does not yet appear to be a process that can be readily automated in such a way that organizations can rely on the outputs. The ability to leverage automation requires organizations to develop and enforce detailed classification and retention procedures and most companies have not yet dedicated the resources to such an effort.
While progress is being made in data classification and companies like CommVault envision and are working towards the goal of delivering tools that enable automated and reliable processes, for now businesses must still rely on documented procedures, educated users and sophisticated search engines such as what CommVault provides to deliver on the repeatable and defensible processes that courts are looking for now.
Leave a comment