Most enterprises today do business across the globe, have databases in multiple countries and DBAs or users in different regions who have access to those databases. With GDPR mandating privacy requirements for personal data of European Union (EU) residents and visitors, it is important for an organization to know and control who accesses that data and what those with access authority can do with it.
Chapter 5 of the GDPR addresses “data transfers to third country or international organizations” and Article 44 of Chapter 5 specifically talks about “general principle for transfers”, which outlines the requirement for preventing unauthorized data transfers outside of EU member states.
Blocking transfer of personal data outside the EU; or
Ensuring adequate data protection
In both cases, the starting point for compliance with the GDPR is data discovery and data classification followed by implementation of strong security policies, audit policies and reporting.
Imperva SecureSphere can help organizations comply with the GDPR by blocking the transfer of personal data outside the EU and ensuring adequate data protection. In this post, I’ll review how the SecureSphere database security solution can not only classify sensitive data and prevent it from crossing a specific geographic location to meet the Article 44 requirement, but also generate audit logs and reports that can assist with investigations, reporting mandates and data forensics (Figure 1).
Figure 1: Imperva SecureSphere helps enforce cross-border data transfers by mapping to GDPR requirements
Many organizations are not aware of all the databases that exist in their network. Often times, a DBA may create databases to test an upgrade for example, then forget to take it down, thus leaving a database containing potentially sensitive data unsecured and unmonitored. SecureSphere Database Discovery scans and reports on all the databases that exist in the network, providing you with detailed information on each including IP address, port number, OS type and version (Figure 2).
Figure 2: Database Discovery scan results
After database discovery, it is important to understand what kind of data exists in your databases. The goal here is to look for any sensitive or privileged information. SecureSphere can identify sensitive data using column names or a content-based search using regular expressions making it highly accurate (Figure 3).
Figure 3: Data classification scan results
Security policies play a key role in protecting against known/unknown attacks and threats and complying with regulations and organization guidelines. Let’s say for example you have two DBAs in different countries trying to access a database in Germany. You would need to define and enforce security policies that ensure the DBAs are accessing only the data they are authorized to access based on their location (Figure 4).
You can set up a security policy in SecureSphere that allows Mark, a DBA in Germany, to access the database in Germany, but block access by Franc, a DBA in Singapore, as Franc should not be allowed access due to his geo location (Figure 5).
Figure 4: User role and location mapping
In our example, SecureSphere’s security policy is tracking and blocking based on:
User first name, last name and role
From which country they are accessing the data
What query are they trying to run
Which database they are trying to access and if that database contains any sensitive information
Figure 5: SecureSphere security policy blocks a DBA in Singapore from accessing a German database
Auditing is necessary as it records all user activities, provides visibility into transactions, and creates an audit trail that can assist in analyzing data theft and sensitive data exposure.
In the snapshot below, you see response size “0” for the DBA in Singapore, confirming he was not able to access and perform a query on the database in Germany. Whereas the DBA from Germany has a response size of “178”, indicating he was able to execute the query and access the database (Figure 6).
SecureSphere can also create detailed reports with charts using multiple parameters such as user, database, schema, query, operation, response size, sensitive data access, affected rows and more (Figure 7). This information can be used to report on activity that assists in maintaining compliance with various regulations.
Figure 7: Create and manage reports on database activity
Gartner has published their 2017 Magic Quadrant for Web Application Firewalls (WAF) and Imperva has again been named a WAF leader—now for four consecutive years.
Attacks remain same, but infrastructure is changing
According to 2017 Verizon Data Breach Investigations Report, web app attacks remain the number one cause of data breaches and denial of service (DoS) is still the most common security incident. The strategic planning assumption in the report is that, “By 2020, more than 50% of public-facing web applications will be protected by cloud-based WAF service platforms, combining CDN, DDoS protection, bot mitigation and WAF, up from less than 20% today.”1
As enterprises are moving applications to private and public cloud infrastructures, it becomes more important to adopt solutions that can be adapted to any cloud provder or any on-premises deployment. The Imperva WAF product line does exactly that.
Helping customers successfully move to the cloud
Gartner recognized Imperva as a leader for our completeness of vision and ability to execute (Figure 1). We stayed ahead of the competition by offering security solutions for a changing deployment and infrastructure model while securing applications from both web app attacks and denial of service attacks.
Figure 1: 2017 Gartner Magic Quadrant for Web Application Firewalls (WAF)
Here are a few highlights that we believe make Imperva a four-time, consecutive leader.
Deploy where you need, when you need, for just one price.
FlexProtect licensing lets Imperva customers move applications among Imperva on-prem and cloud solutions without incurring additional costs. This new licensing program gives you predictable costs even as you move back and forth among cloud and on-prem deployments.
Scale On Demand
Imperva offers one of the most feature-rich cloud WAF solutions that carries the same capabilities as our robust on-prem WAF, plus a CDN, DDoS protection and load balancing. This allows customers to safely scale out security operations as they move toward a hybrid WAF strategy, without compromising performance and protection.
Laser-focused on Security
Advanced security features such as dynamic application profiling and virtual patching set Imperva WAF apart from the competition, always scoring higher when security is the main focus in purchasing decisions.
Imperva is recognized for going beyond standard threat intelligence by providing crowd-sourced threat intelligence and emergency feeds to protect customers against brand new attack campaigns.
Database security best practices are also applicable for big data environments. The question is how to achieve security and compliance for big data environments given the challenges they present. Issues of volume, scale, and multiple layers/technologies/instances make for a uniquely complex environment. Not to mention some of the big data stored and processed can also be sensitive data. Who has access to that data within your big data environment? Are the environment and the data vulnerable to cyber threats? Do your big data deployments meet compliance mandates (e.g., GDPR, HIPAA, PCI, and SOX)?
Drew Schuil, Vice President of Global Product Strategy, returns to talk about big data security in today’s Whiteboard Wednesday. Learn about the challenges associated with securing big data and requirements for protecting it as you build out your plan.
Hi, welcome to Whiteboard Wednesday. My name is Drew Schuil, Vice President of Global Product Strategy at Imperva, and today’s topic is Challenges of Securing Big Data.
I meet with a lot of customers and chief security officers and we talk about protecting databases and file systems, so structured and unstructured data. And when I bring up big data, it’s often something that’s an afterthought or really hasn’t been looked at yet. So, we want to talk about some of the issues and things to get in front of this problem—this opportunity—as it arises.
The Big Data Trend
Let’s look at some of the trends. The biggest thing to note here is that big data is growing and it’s coming fast. IDC is predicting double digit growth in big data lakes within large enterprises and part of the reason that data collection is exploding is we’re seeing a proliferation of IoT, or Internet of Things, devices. Whether it’s the consumer market or the business environment, these devices are collecting metadata that’s very valuable to organizations for data analytics, for market trends, for consumer activity. The more and more data that’s being collected is being thrown into these big data lakes.
That leads us to our next trend here, which is sensitive data. Most organizations I talk to say, “Look, we’re not storing credit card numbers. We’re fairly/100% certain about that.” However, when we start looking at some of the newer regulations that have teeth, like the Europe’s GDPR, now the scope is potentially wider when we talk about personally identifiable information (PII). Things like first name, last name, email address, address, some of these little pieces of information that perhaps were benign before are coming into compliance, into scope, for data protection and [it’s important to make] sure that we’ve got a security strategy in mind.
Big Data Security Requirements
Let’s look at the framework. As you can see, access control, threat filtering, etc.—really the same kind of concepts that we had [for relational database security], but there’s some spin. There are some new things when we talk about big data.
Access Control and Threat Filtering: Specifically, with the first one, access control. When we talk about database environments as an example, they are fairly locked down. You’ve got DBAs, you’ve got least permission, auditing and entitlements reviews if you’re in financial services. However, within big data environments, because of the nature of big data and the analytics and the people that need to have access to it, a lot of times permissions are granted on a very wide basis. It’s a little different when we’re thinking about production databases versus production big data environments, because more and more people have access. With that, it increases the landscape for threats. Whether it’s endpoint threats and malware infections and account takeover, whether it’s malicious insider use cases—someone gaining access to data that they shouldn’t have. Or a DDoS attack, someone that says, “Hey, this is a big data environment that’s critical to the business, I’m going to extort you by threatening to DDoS that environment.” The same types of threats that we see with other business applications.
Activity Monitoring and Alerts: That leads to activity monitoring. I mentioned GDPR, Europe’s data privacy regulation, and activity monitoring and auditing. Being able to understand who is accessing what. Is that appropriate? Does that violate some regulation or data security standard within the organization? And then being able to get this information to the team that’s responsible for securing it. A lot of times that means feeding it from the monitoring tools into a SIEM or into a SOC or some other monitoring mechanism.
[We’ve got the] trends, it’s taking off. Big data is not something to ignore. We’ve got the same requirements. In the next section we’ll talk about some of the challenges that are introduced inherently by big data.
Big Data Security Challenge #1 – the Data Itself:
So, we have the same security requirements, but some very different challenges when looking at how to secure big data and it starts with the three V’s: volume, velocity, and variety.
Volume: One of the benefits of a big data environment is that it can handle massive amounts of data and actually make sense of it and crunch it in a lot of different ways to produce valuable results for the business.
Velocity: The other challenge is velocity. Particularly within high tech environments or retail or banking, where decisions need to be made very quickly on this data, having a security solution that is real time not only for alerting and monitoring, but also blocking—to keep up with that becomes a challenge when we’re balancing cost versus risk.
Variety: The third issue is variety. Because of the amount of, I’d say, the relaxed permissions that we have within big data, the number of people from different departments and access points coming in and doing different things to the data, it really becomes a challenge when we start talking about data discovery and classification. Which data is sensitive, so that I can have some focus and scope? And then how do I apply policies against that if I’m having a challenge classifying the data …and I’m also having a challenge in terms of classifying the users and permissions and whether they should or shouldn’t be accessing the data? It really compounds the problem when we look at big data in the context of these three V’s.
Big Data Security Challenge #2 – the Environment:
The second challenge here is the environment.
Multiple Layers: When we look at big data environments, it’s not as simple as our traditional, let’s say, database environment, where we’ve got an application talking to an Oracle database and a pretty clear, crisp understanding of where we need to put in controls and blocking points. If we look at our diagram over here, we’ve got multiple different layers from distributed storage and querying layers to different management applications. Look at this environment and just the complexity of it. It should look much more difficult from a security perspective than something, again, like an Oracle or DB2 or SQL server stand-alone type of an application to protect.
Different Technologies: We’ve also got different technology mixed in to each big data environment, so you may have NoSQL, NSQL, data warehouse, BI tools. You’ve got all these different types of technologies within the environment, so again it’s not as cookie cutter as we’re used to in the past.
Multiple Instances/Dispersed Data Stores: You may have different instances and it may be dispersed over a wide geography, particularly if we’re dealing with a large multi-national, like a retailer, and we’re crunching data across multiple different regions. Now, you’ve got to not only look at the complexity of a single environment, but replicate that, be able to have the security environment talk to other security environments across a wide geography. You start to see some of the challenges when we talk about securing big data environments.
Big Data Security Challenge #3 – People:
All right. The third challenge is people and like we’ve talked about before in other sessions, people can often be the weakest link when we’re talking about security, especially in a complex environment like big data. If we look at the people that are most adept to administrating and dealing with a big data environment, we’re talking about computer scientists, PhD types. We’re talking about people that are going to be really focused on anything but security. They’re going to be focused on making the system work fast, getting accurate results. Really the last thing on their mind is going to be security and compounding that, again, is the privileged access problem.
The nature of these environments is very different from what we’re used to with a traditional database environment, where, let’s say, you’ve got production that’s very locked down and then maybe you’re doing some data masking, a best practice for your pre-production and test within a big data environment. A lot of times it’s much more open and you’ve got developers with very unrestricted access to potentially sensitive data and security – again, it is an afterthought when we talk about people accessing these systems.
Where to Start?
We’ve talked about big data trends. We’ve talked about some of the challenges in securing big data. Let’s talk about what you can do next. Where you can start. This section is a little bit of motherhood and apple pie, but some interesting tidbits that we’ve heard from customers that we’ve talked to.
Raise Awareness: We’re actually seeing financial services and retail, some of the early adopters, implement solutions like Imperva to address the security and compliance requirements for big data. And what they’ve told us is they’ve started with raising awareness within the organization. Basically saying, “Hey, we’ve got databases, we’ve got file systems, we’ve got cloud that we need to deal with. Big data also falls into our data security strategy.” Just raising awareness within the organization, so it doesn’t come as a surprise.
Proactively Interview Business Units: Then proactively interviewing the business units. So, talking to marketing, talking to the CRM teams, the customer support teams, talking to any of the business units that may be early adopters of big data so that you can get in front of and be aware of those projects and not be reacting later on to surprise projects.
Develop a Strategy/Build a Plan: Developing a strategy and building a plan. It’s much easier to be able to respond very quickly to the executives and say, “Hey, I knew this was coming. Here’s our plan. I’ve had this plan in place for the last six months, 18 months.” Really to just get ahead of these issues.
Additional Security Requirements for Big Data
Some of the requirements that we’ve heard from early adopters that are rolling out Imperva to protect big data…back to the complexities that it’s got to be able to address the three V’s. It’s got to be scalable, it’s got to be able to address very high performance environments, it’s got to be able to be deployed in a distributed environment across multiple different geographies. We’ve got to be able to integrate with other pieces of the security ecosystem, much like we see in protecting our structured and unstructured data. We’ve got to be able to integrate with SIEM. We’ve got to be able to pull information in about the risk profiles of users who are interacting with data, profiles that may contain information not seen by Imperva, but by some of your other security tools.
And they want to be able to leverage existing solutions. So, if they’ve already deployed a solution like Imperva to audit their databases, to audit their file systems in SharePoint, to audit their cloud systems, to secure their web applications…why not be able to take that same console, the same policy engines, the same framework they’ve already developed and apply that to the next data type? To apply that to big data. So, that’s something that a lot of organizations are looking for, not yet another vendor but to be able to consolidate their vendor portfolio.
Then, finally, actionable alerts. This really goes back to being able to provide context. In a database environment, we’re talking about millions and millions of events, let’s say in Oracle or DB2. Now, when we shift to talking about big data, it could be billions or trillions of events. So, it becomes even more important that we have things like machine learning that can understand and make sense of good versus bad and inappropriate behavior so that we can send actionable alerts, that we can send single digit alerts to the rest of the ecosystem, the SIEM and the SOC and so forth.
So, that’s our big data talk. I hope you found it helpful and please tune in for additional whiteboard sessions. Thanks.
These days we hear about machine learning and artificial intelligence (AI) in all aspects of life. We see machines that learn and imitate the human brain in order to automate human processes. There are autonomous cars that learn the road conditions to drive, personal assistants we can converse with and machines that can predict what stock markets will do. In some respects, it can appear as “magic.”
However, it’s not. Behind machine learning there are some fundamental, well-studied and understood techniques. To the extent that there is any magic, it is in knowing how to apply these techniques to solve a certain problem. We’ll take a look at some of these techniques, and then illustrate how some of them are applied to solve a specific problem for identifying improper access to unstructured data.
Machine Learning, Defined.
Machine learning is a type of artificial intelligence that enables computers to detect patterns and establish baseline behavior using algorithms that learn through training or observation. It can process and analyze vast amounts of data that are simply impractical for humans.
Supervised learning – the machine is presented with a set of inputs and expected outputs, later given a new input the output is predicted.
Unsupervised learning – the machine aims to find patterns, within a dataset without an explicit input from a human as to what these patterns might look like.
More importantly, however, is that within unsupervised machine learning, there are several different techniques that can be used to identify patterns, and ultimately yield valuable analysis. Understanding the problem domain is key to being able to correctly choose which of these techniques to use. One of the key decisions data scientists make is which approach to use. And if a data scientist doesn’t understand the problem domain, they cannot choose the right approach.
Clustering is the assignment of objects to homogeneous groups (called clusters) while making sure that objects in different groups are not similar. Clustering is considered an unsupervised task as it aims to describe the hidden structure of the objects.
Each object is described by a set of characters called features. The first step of dividing objects into clusters is to define the distance between the different objects. Defining an adequate distance measure is crucial for the success of the clustering process.
There are many clustering algorithms, each has its advantages and disadvantages. A popular algorithm for clustering is k-means, which aims to identify the best k cluster centers in an iterative manner. Cluster centers are served as “representative” of the objects associated with the cluster. k-means’ key features are also its drawbacks:
The number of clusters (k) must be given explicitly. In some cases, the number of different groups is unknown.
k-means iterative nature might lead to an incorrect result due to convergence to a local minimum.
The clusters are assumed to be spherical.
Despite these drawbacks, k-means remains the right and popular choice in many cases. An example for clustering using k-means on spherical data can be seen in Figure 1.
Figure 1: k-means clustering on spherical data
A different clustering algorithm is OPTICS, which is a density-based clustering algorithm. Density-based clustering, unlike centroid-based clustering, works by identifying “dense” clusters of points, allowing it to learn clusters of arbitrary shape and densities. OPTICS can also identify outliers (noise) in the data by identifying scattered objects.
Figure 2: k-means versus OPTICS on moon-like data
The OPTICS approach yields a very different grouping of data points than k-means; it classifies outliers and more accurately represents clusters that are by nature not spherical.An example of running k-means versus OPTICS on moon-like data is presented in Figure 2:
In the field of machine learning, it is useful to apply a process called dimensionality reduction to highly dimensional data. The purpose of this process is to reduce the number of features under consideration, where each feature is a dimension that partly represents the objects.
Why is dimensionality reduction important? As more features are added, the data becomes very sparse and analysis suffers from the curse of dimensionality. Additionally, it is easier to process smaller data sets.
Dimensionality reduction can be executed using two different methods:
Selecting from the existing features (feature selection)
Extracting new features by combining the existing features (feature extraction)
The main technique for feature extraction is the Principle Component Analysis (PCA). PCA guarantees finding the best linear transformation that reduces the number of dimensions with a minimum loss of information. Sometimes the information that was lost is regarded as noise – information that does not represent the phenomena we are trying to model, but is rather a side effect of some usually unknown processes. PCA process can be visualized as follows (Figure 3):
Figure 3: PCA process visualized
Following the process in the example, we might be content with just PC1 – one feature instead of originally two.
There is a great choice of dimensionality reduction techniques: some are linear like PCA, some are nonlinear and lately methods using deep learning are gaining popularity (word embedding).
Applying the Techniques to Dynamically Learn True Peer Groups
A recent Hacker Intelligence Initiative (HII) research report from the Imperva Defense Center describes a new innovative approach to file security. This approach uses unsupervised machine learning to dynamically learn peer groups. Once these peer groups are learned, they are then used to determine appropriate virtual permissions for which users should/should not be allowed to access files within an organization. This dynamic peer group functionality is available in Imperva’s breach prevention solution, CounterBreach.
Figure 4 illustrates how machine learning is used to detect suspicious file access activity based upon dynamic peer group analysis.
Figure 4: The process of detecting suspicious activity using dynamic peer group analysis
First, Imperva performs dimensionality reduction on the data. In the previous step the audit data was transformed into a matrix with users as rows and folders as columns. The values in the matrix cells are the amount of activity performed by the given user in a folder. The first reason we use PCA is the sparseness of the matrix, more than 99% of its cells are empty. Secondly, many of the folders’ access patterns are correlated, which causes multicollinearity in our matrix. Practically in our case, correlation can occur due to groups of users working on a similar project therefore working on a similar group of folders. Lastly, after using PCA the matrix is 90% smaller and hence easier to process.After collecting and preparing the data, the machine learning is applied to build the virtual peer groups. In order to build these dynamic peer groups, Imperva uses the machine learning techniques mentioned above – PCA and density-based clustering.
Second, Imperva chose the OPTICS algorithm as the clustering algorithm. The users are clustered based on densities. The number of peer groups is unknown. Since k-means requires knowing the number of clusters – peer groups in this case – it cannot be used. OPTICS helps us overcome this limitation. OPTICS also allows us to handle noisy users with special care – each noisy user resides in a cluster alone. In addition to the above, by a great deal of trial and error it was confirmed that OPTICS is the right choice for this dataset.
The strength of a successful algorithm based on data analysis lays in the combination of three building blocks. The first is the data itself, the second is data preparation—cleaning and choosing the exact features that represent the data characters—and the third is using the right machine learning methods in order to correctly profile the data.
In this case, PCA and OPTICS proved particularly useful for learning peer working groups. However, the “machine” didn’t magically determine this itself. Rather, it was a person (well, really a team) that understood the problem, and the data being analyzed, that performed the “magic” of selecting the right machine learning building blocks.
Last month, I met James (name changed) while at AWS Summit in London. As I was managing Imperva’s booth, he walked over to me with a query about what we do. A conversation ensued and James described his company for me. They were into financial-legal intermediation between underwriters, insurance brokers, and customers.
As it turned out, he was at the event to deep dive into the Amazon API Gateway plus AWS Lambda service model. He explained that use of Lambda should allow them to cover additional real-time intermediation cases that are very common in the financial-legal industry. In addition, the AWS Lambda cost model made a lot of financial sense to him.
So, I asked…how did he plan to secure his public presence?
To my mild surprise, he had not considered security—most financial customers in a public setting would require provable security controls by default. He also had not put much thought toward the “public” part of the API gateway. The fact that, as of today, AWS offers their API gateway as a managed service offering makes it very simple for application developers as AWS takes responsibility for deploying and managing the application stack for them. On the other hand, it creates challenges for enterprise security teams as API gateways cannot be contained inside a VPC (a defacto model for most customers).
We ended things with me sharing information with him about how our application security offerings (SecureSphere and Incapsula) could help secure these new applications. Turns out, I’d have several similar conversations throughout the day like the one I had with James.
Microservices, Developers, and a Say in Security
As the day progressed, I had the chance to chat with multiple DevOps and security folks. I observed two trends:
One, most developers or DevOps folks at the AWS Summit were either currently using or notably interested in using microservices architectures. Containers and API gateways were the tools of choice for such architectures.
Secondly, development teams had a stronger say in the choice and deployment model of application security technologies. At the very least, they wanted an integrated app+security deployment approach.
Both of these trends have a strong business context. In certain cases, both container and API gateway (Lambda-based) approaches are cheaper to run versus EC2-based application instances. And the microservices approachemphasizes independence between multiple business units while allowing them to scale independently to the needs of the business. (Note: Azure also offers API gatewayspowered by Azure functions, similar to Lambda).
How Imperva Customers Secure API Gateways and Containers
Imperva customers are using our products today to secure containers and API gateway deployments (with or without microservices):
API gateways can front end a virtual application instance/container deployment or, in the case of public clouds, a Lambda/function-based deployment. In both cases, customers are using either our instance/appliance-based (SecureSphere) application firewall or our cloud-based application security solution (Incapsula) to secure the gateway.
On the other hand, securing containers is similar to securing virtual instance deployments. Many Imperva customers have successfully secured container deployments with SecureSphere or Incapsula.
With our focus on ease of deployment, API coverage, and a wide range of virtual/physical appliance capacity, Imperva ensures customers are able to accommodate developers’ interests and requirements while implementing the necessary security controls for the business.
In my role at Imperva, I continually listen to our customers, observe security trends and share what we learn as a team. Perhaps we’ll have a conversation ourselves one day soon. Until then, look for more from us here on best practices for securing microservices and ways Imperva can help.
Understanding what sensitive data resides in your enterprise database is a critical step in securing your data. Imperva offers Classifier, a free data classification tool that allows you to quickly uncover sensitive data in your database.
Classifier contains over 250 search rules for popular enterprise databases such as Oracle, Microsoft SQL, SAP Sybase, IBM DB2 and MySQL and supports multiple platforms like Windows, Mac and Linux. Once you download and install Classifier, you can start discovering sensitive data, such as credit card numbers, person IDs (which includes-ID type elements associated with a person like user name, user ID, and employee ID), access codes and more in your database. The tool also jumpstarts you on your road to compliance with General Data Protection Regulation (GDPR) as well as data security. This post will walk you through the steps of using the tool.
First, you need to meet the prerequisites listed in the Classifier User Guide. Then you can begin your scan, view the results, and evaluate corrective action options. Let’s get started.
Running a Scan
Running a Classifier scan is a simple, four-step process.
Select your database type from the drop down list. (Options include Oracle, Microsoft SQL Server, SAP Sybase, IBM DB2, and MySQL.)
Enter details for the selected database, as follows (see Figure 1):
Port (or use default Port)
Schema – a collection of database objects (i.e. table) associated with one particular database user name
Database Name / Instance / SID
NOTE: Microsoft SQL Server supports Windows Authentication, which is enabled by default. To disable and manually enter a User Name and Password, click the Authentication button next to the User Name field. Enter the appropriate User Name and Password (see Figure 2).
Click Go to start the scan. The scan will run without the database experiencing any downtime or performance degradation.
Figure 1: Set scan parameters in Classifier
Figure 2: Disable Windows Authentication
Review the Results
The results of the scan are presented on an easy-to-read dashboard (see Figure 3).
Figure 3: Classifier executive summary dashboard [click to enlarge]
The dashboard is organized into three panes:
Top Pane — Displays an executive summary of the sensitive data contained within your database as well as an indication of the amount of sensitive data present.
Number of sensitive data categories detected
Number of total sensitive data
Time to complete the scan
Middle Pane — Displays summary statistics that include:
Ratio of sensitive/non-sensitive database columns
Data Classification Results — Different categories of sensitive data found, such as personal identification number, mailing address, access codes, etc.
Ratio of each sensitive data category
Bottom Pane — Displays Classification Details, organized into a sort-able table with the following types (see Figure 4 for a larger view):
Category — Displays type of sensitive data
In the example above, there are a total of 30 columns of sensitive data, which account for 11% of this scanned database. Among the sensitive data found, 7% are access codes, 20% is free text, 10% are person IDs, and 49% are person names. When you look at the classification details, you can find the actual amount under each category.
To better understand which schema, tables, and columns are contained within each category, you can click on a category row under the Classification Details section to expand the content. You can drill down into details of a specific category, including row counts associated with each schema, table and column identified by the scan (see Figure 4).
Figure 4: Category detail example showing a total of 2 tables that contain 17 rows of person ID data [click to enlarge].
Now that you’ve identified what sensitive data resides in your database, you can then take appropriate actions, such as data monitoring or data masking, to further secure your data. It’s easy to use Classifier to quickly uncover sensitive data that may be at risk within your organization. While this free tool searches database metadata, our enterprise data security products provide additional capabilities, such as database content searching, reporting and export functionality.
It’s been a busy year thus far in the cybercrime world with the stakes seeming to grow higher every month. Just last month, insider threats were making headlines with a news report that Reality Winner, a contractor for the National Security Association with a top-level security clearance, leaked sensitive documents related to an FBI investigation into Russian hacking to the media.
Threats from insiders, while not as common as breaches from external actors, are still very significant and damaging. According to the 2017 Verizon Data Breach Investigations Report (DBIR), insider threats account for 25 percent of breaches and often take months or years to detect. Insider threats are among the costliest breaches because they often take longer to discover which gives the perpetrators ample time to damage systems and steal valuable data. The longer the attack is active and undetected, the higher the cost to the organization.
Insider threats can be classified into three types:
Malicious insiders – Trusted insiders that intentionally steal data for their own purposes.
Careless and negligent insiders – These are people within or directly associated with the organization that due to careless behavior expose sensitive data, despite not having malicious intent.
Compromised insiders – Typically this is an account takeover scenario, where credentials have been guessed or captured as part of a targeted attack. Although the actor behind the account is not an employee, the use of legitimate credentials presents as if it were an employee.
While malicious and compromised insiders are frightening, it’s worth noting that security professionals often identify employees as the weak link in the cyber security chain since they often don’t follow corporate guidelines regarding safe computer and application use.
With the lurking danger associated with careless users, it isn’t surprising then that in a survey of 310 IT security professionals at Infosecurity Europe more than half (59 percent) were deeply concerned not primarily about malicious users, but about careless users who unwittingly put their organization’s data at risk.
While there are specific strategies and tools to help manage and investigate insider threats, 14 percent of respondents revealed they don’t have a technology solution in place to detect insider threats.
And those that did use tools found them to be labor intensive. For example, fifty-five percent of respondents said that managing too many security alerts was the most time-consuming element of investigating insider threats.
And forty-four percent of security professionals admitted they do not have enough staff resources to analyze data permissions correctly.
While detecting insider threats can often include labor intensive analysis of alerts and reports, this is just the type of work where machine learning solutions excel. Respondents recognized this as 65 percent estimated that machine learning-based solutions applied to identifying insider threats would free up more than 12 staff hours a week – which equates to more than 600 hours a year.Employees, contractors, business associates or partners pose the biggest risk to enterprise data since they are by definition granted trusted access to sensitive data. To mitigate the risk, corporations should ask themselves where their sensitive data lies, and invest in solutions that directly monitor who accesses it and how.
Ransomware has loomed large in the news of late. It seems to be around every turn, and it’s not going anywhere. The untraceability of Bitcoin payments, coupled with new blackhat tools available to anyone at little (if any) cost, means extortion attempts will continue to grab headlines worldwide.
But is ransomware the only form of cybercrime extortion? The short answer is no. People commonly refer to any form of online extortion as ransomware, but it may have nothing to do with ransomware in the strictest sense of the word. Specifically, ransomware is a form of malware that encrypts files and decrypts them once a ransom is paid. But illicit demands for payment—by definition, a ransom—can be associated with other types of digital extortion requests.
This may seem like semantics. But it matters when it comes to mitigating extortionary attacks; just because a solution may detect ransomware, doesn’t mean it protects against other extortionary attacks. And we expect extortionary attacks to increase. To a certain extent, the darkweb is saturated/flooded with PII for sale. This drives down cybercriminal profits. As this occurs, it is likely many cybercriminals add extortionary attacks as they attempt to optimize their profits.
In this post, we clear the ransomware air. We’ll explain exactly what “traditional” ransomware is and how it works, and review other common digital ransom-related attack types that are often, mistakenly, labeled as ransomware.
The name ransomware is derived from ransom and software. It’s a type of malware attack in which the attacker locks and encrypts the victim’s data and then demands a payment to unlock and decrypt the data (see Figure 1). Ransomware attacks take advantage of human, system, network, and/or software vulnerabilities to infect a victim’s device—which can be a computer, printer, smartphone, wearable, point-of-sale (POS) terminal, or other endpoint. Ransomware can target either endpoints or file servers. It doesn’t need to be “local” to infect; ransomware that infects and endpoint can encrypt a remote file share without having to run locally on that remote file share.
WannaCry (a.k.a., WCry or WanaCryptor) is one recent, highly-publicized ransomware example. It takes advantage of systems running older, unpatched versions of Microsoft Windows. A key difference is that, like a worm, this ransomware propagates itself to connected systems by way of a Server Message Block (SMB) protocol vulnerability.
There are several kinds of ransomware distribution techniques, but perhaps the most common is email. An attacker sends an email—ostensibly from a trusted source—that tricks the recipient into clicking a link which unleashes the payload. When the victim clicks the link, visits a web page, or installs a file, application, or program that includes the malicious code, the ransomware is covertly downloaded and installed.
Figure 1: Example of a ransomware ransom note demanding payment.
In particular, so-called email phishing attempts have become increasingly more sophisticated. TechTarget says such “messages usually appear to come from a large and well-known company or website with a broad membership base, such as Google or PayPal. In the case of spear phishing, however, the apparent source of the email is likely to be an individual within the recipient’s own company—generally someone in a position of authority—or from someone the target knows personally.”
DDoS Ransom Notes
A 2015 FBI public service announcement says it all: “The Internet Crime Complaint Center (IC3)… received an increasing number of complaints from businesses reporting extortion campaigns via e-mail… the victim business receives an e-mail threatening a distributed denial of service (DDoS) attack to its website unless it pays a ransom. Ransoms vary in price and are usually demanded in Bitcoin.”
An Imperva Incapsula survey found that “46% of DDoS victims received a ransom note from their attacker—often prior to the assault.” Figure 2 below shows a ransom note from a hacker group calling themselves the Armada Collective that was blackmailing hosting providers in Switzerland:
Figure 2: A ransom note from Armada Collective announcing an impending DDoS attack
Today DDoS-for-hire is readily available at very inexpensive prices, making it easy for anyone to launch an attack on the scale of those unleashed by the infamous Lizard Squad. In wrapping up 2016, the Imperva Global DDoS Threat Landscape report stated, “…the higher number of persistent events can be interpreted as a sign of professional offenders upping their game.” On the other hand, the preponderance of short attack bursts can be attributed to the growing popularity of cheap botnet-for-hire services preferred by non-professionals.
Another security survey revealed that “80 percent of IT security professionals believe that their organization will be threatened with a DDoS ransom attack in the next 12 months.”
Data Theft and Extortion
Dubbed extortionware (a.k.a., doxware), another common threat involves the theft of personal or sensitive data coupled with a threat to openly release it—perhaps to the internet at large—unless a ransom is paid. Author and enterprise threats expert Nick Lewis describes extortionware as “…when a cybercriminal threatens a person or organization with some sort of harm by exposing personal or sensitive information. For example, a criminal could compromise a database with sensitive data and then tell the enterprise [they] will post the sensitive data on the internet if [their] demands aren’t met.”
Another type of ransom-related attack is akin to the threat above, but in this case the enterprise doesn’t retain access to its data. A recent widely known example of this is when an entity calling itself The Dark Overlord, earlier connected to a health care breach, claimed to have stolen several new episodes of Netflix’s popular Orange Is the New Black show and demanded an unspecified ransom in exchange for their return.
Like a similar theft involving the BBC, Netflix confirmed that one of its production vendors—also used by other studios—had been breached. The Guardian suggested that, “Pirated copies of the show could dent Netflix’s subscriber growth and the company’s stock price.”
What You Can Do
For any of these threats, it’s back to basics: protect your systems and data. The ransom/ransomware trend is expected to continue as incentives increase and it becomes easier for cybercriminals to execute shakedowns armed with new ransomware-as-a-service (RaaS) tools, BYOD user vulnerabilities, improved encryption methods and untraceable Bitcoin payoffs.
A good defense begins with running regular backups and always using accounts having the fewest permissions. The ability to dynamically assign and, more importantly, retract user permissions through machine learning and granular data inspection is a solid best practice.
Ideally, you want to immediately detect ransomware behaviors and quarantine impacted users before ransomware can spread to network file servers. One approach is deception-based ransomware detection, which consists of using strategically planted, hidden (decoy) files to identify ransomware at the earliest stage of the attack. The decoy files are planted at carefully planned file system locations in order to identify ransomware encryption behaviors before they can touch legitimate files. Having monitoring and blocking measures in place—in addition to admin alerts and granular activity logging—would also help minimize the disruption to your core business processes were a ransomware attack to occur.
When it comes to preventing DDoS attacks, organizations can also invest in always-on DDoS protection that automatically detects and mitigates attacks targeting websites and web applications, as well as protects against DDoS attacks that directly target your network infrastructure.
Along with these measures, other basic defenses such as business continuity and disaster recovery planning should be part of any comprehensive information security program.
Insider threat detection and containment of insider threats requires an expert understanding of both users and how they use and access enterprise data.
In our first Whiteboard Wednesday, Drew Schuil, Vice President of Global Product Strategy at Imperva, talks about the challenges of insider threat detection and approaches to protect sensitive data and large repositories of data from careless, compromised, and malicious users.
Welcome to Whiteboard Wednesday. My name is Drew Schuil, Vice President of Global Product Strategy with Imperva. Today’s topic is Challenges of Detecting Insider Threats, particularly when we’re talking about sensitive data and large repositories of data, like databases, big data systems, and file repositories.
Insider Threat Profiles
We’re going to start off with insider threat profiles. I’ve got three already [on the board], compromised, careless, and malicious.
Let’s look at compromised. This is where most of the security industry is focused. When we think about compromised users, think about users that have clicked on a phishing link, users that have gotten their endpoint infected somehow through malware, and now the attacker is inside the network. They’re perhaps moving laterally through the organization, doing reconnaissance, trying to find where sensitive data is and compromise additional credentials to get access to it. If you look at the security solutions that organizations are implementing today—endpoint security, sand boxing, anti-phishing—a lot of the security solutions are really designed to look for this use case and try to stop it as soon as possible, to quarantine the compromised user.
Two more, I should say, overlooked user profiles, are careless or negligent users. Think of a DBA who’s got legitimate access to the network but is using short cuts to get a job done. Maybe they don’t want to go through the change control process and they’re using an application service account to connect to the database instead of their named account, now they’re basically eliminating any visibility into who that user is by borrowing another account. A lot of times organizations are basically blind to this type of behavior because the DBA has access to everything. It’s an area where security doesn’t necessarily understand what’s going on, what should be going on, and in general no alarm bells are going to be going off for compromised detection when it’s careless behavior.
Similarly, for malicious users, these are users that have legitimate credentials, they’re able to log in to do their job, but maybe they’re being extorted. Maybe they’re taking information to their next job. Ponemon reports 69% of exiting employees admit to taking data with them. It’s not necessarily someone that’s an Edward Snowden, but maybe someone who’s just taking data with them to their next job because they think they’re entitled to it.
When we look at these last two categories [careless and malicious], I think this is an area for improvement within the security industry, and something that’s going to require looking at new technology and new approaches to solve all three use cases, not just the compromised threat profile.
Why is Detection so Difficult?
So why is detection so difficult…why haven’t we solved this problem? Why do we continue to see these very, very large breaches…60, 80, 100 million records at a time? That’s coming from a database, by the way, not from a spreadsheet on someone’s laptop that got left at an airport. That’s coming from a huge data repository within the enterprise.
Part of the problem is these users have legitimate access. They’re on the network, they work there. When we look at this, it’s not necessarily about IAM [identify and access management], it’s not about access control. Really what it’s about is post-log in detection. I need to see what the user did after they logged in, and is that behavior normal or not. That’s one of the biggest challenges, understanding good versus bad behavior. We’re looking at millions and millions of transactions against a database or a big data environment. How do you determine the good versus bad?
So what are some of the approaches people are taking? Today, in some cases they’re sending the information to the SOC. Maybe through route logs that they’re writing correlation rules against. Maybe they have other security layers within the environment that they’re trying to piece together to understand this picture. But in most cases they don’t have a very good picture of this post-log in behavior to be able to understand good versus bad, and the result is alert overload. In the case of [the Target breach], they had information sent to their SIEM, within the SOC environment, but they weren’t able to find it. They weren’t able to get to the actionable data.
The last problem, and I think this is one of the biggest ones, is these large enterprises have dozens or hundreds or thousands of applications within the environment, and they’re all serving different business units and business requirements. You’ve got one team, the security team, that’s responsible for deciphering all this good versus bad, and users and applications and others within the organization accessing data, and so this lack of context is really something that is not going to be solved through predictive static policies, or through just communicating with the business units. You really have to have something more advanced to be able to understand the good versus bad, to be able to sort through the alerts, and be able to provide some context to that team so they can actually go quarantine, follow up and deal with an insider threat once it’s detected.
Identifying Breaches Requires Understanding of Users and Data
We talked about user profiles, let’s next talk about data. And I think what this comes down to is when we’re looking at the challenges of detecting insider threats, it’s really at the intersection of users and data. This is essentially where the data breaches are happening when we talk about insider threats. The Verizon Data Breach Investigations Report that comes out every year has indicated that in a lot of the cases where we see very, very large amounts of data through the forensics and the analysis, it was an insider, someone already within the organization—again, a compromised, careless or malicious user.
Data and User Attributes
When we start talking about big data, databases, file systems, especially databases, there’s a lot more here than just the IP address and the user name. We want to understand more about that user, where they came from, what type of application they were using, which department they’re part of, really the context of that user as they’re interacting with data. When we start getting into some of these other things—the database table, the schema, the SQL operation that was performing against that database, for example—we start to get further and further away from the comfort zone of the security team that’s responsible for protecting this data.
Again, this is where we have that issue with context. It’s not only do we have hundreds or thousands of applications, but now we’ve got sort of a different language that’s not very familiar or comfortable for this security team. We start looking at this amount of data, the key thing is really the type of data, this deep understanding that we need to be able to fill in and address some of these challenges that we talked about earlier.
Machine Learning is Not Magic
What is everyone doing within the industry? If you’ve been to the RSA Conference, if you’ve been to any security show recently or talked to a vendor, chances are they’re telling you about machine learning, and how machine learning or user behavioral analytics (UBA) is going to help solve this problem. In some cases, in very narrow-focused use cases, it’s doing a great job and is really bringing security to the next level. But the key thing to note is that machine learning is not magic. There’s no magic potion where we can just apply machine learning or artificial intelligence against the data set and expect to get good results. You’ve got to start with a very laser-like focus of which problems you’re trying to solve. That really leads us to the next section here, which is key indicators of abuse.
Key Indicators of Abuse
One of the things that we’ve done here at Imperva is really taken a laser-like approach to identify things like service account abuse and machine takeover, excessive database or file access, and done that in the context of a deep understanding of how users interact with data. This is really the key to getting some value out of machine learning, but also solving this insider threat profile problem, having a deep understanding of this intersection between users and data.
Insider Threats: Factor in the Unknown
We’ve talked about machine learning and user behavioral analytics, and I briefly mentioned predefined static policies in an earlier part of the talk. The challenge with this approach, even if I’ve got a very granular ability to set policies to do real time alerting and blocking, is factoring in the unknown. If we go back and look at the previous approach to insider threats, it was mainly about compliance. For example, PCI and compliance had a very tight, narrow scope in what they were looking for, and in fact, the environment was usually very controlled and often times set aside from the rest of the environment. As we start to look at insider threats across the broader environment, across the broader data set, think of Europe’s GDPR where we’re looking for personally identifiable information (PII), which could be all over the enterprise. The problem now is much more challenging. We have to think about, and anticipate, every single mutation of a policy and every variation of how that policy would need to be created. The challenge here is now you’re creating hundreds or thousands of policies, you’re having to maintain those, and behind that, or underneath it the application environment’s changing constantly. It becomes an operational challenge for an organization to sustain.
Static Policies Don’t Scale
The other issue is a lot of times the insider threats are unanticipated. We’re not thinking about all the potential variables of a policy that would need to be created in order to find it. When we look at the database example, and why static policies don’t scale, we have to understand who’s connecting to the database, how do they connect? Are they using SQL*Plus, Toad, Aqua Data Studio, some other type of tool to connect to the database? What data are they accessing? Have I done data classification before? Do I even know the context of that data to be able to write a policy against it? What do their peers do? Is this person doing something that no one else within the DBA group, or no one else within the IT group—or finance, or whatever that group is—is that something that we can use as part of the correlation? How much data do they normally query? Unless I have a baseline and a deep understanding of SQL, first of all, to understand what the amount of data is, or what a query is, or how many rows are coming back from a database, this is something that can be difficult to quantify.
When do they normally work? That seems like a pretty basic one, but if I look at what I need to do to detect insider threats, I need to be able to correlate across these five examples, these six examples [ Who is connecting to database? How do they connect? What data are they accessing? Do their peers access the same? How much data do they query? When do they usually work?] as well as many others, and then all the possible mutations of that. It becomes a real challenge, and as we get into the next section, we’re going to be talking about machine learning, very focused machine learning, so that I don’t have to worry about setting and maintaining hundreds or thousands of predictive static policies over time.
Detect Insider Threats with Imperva CounterBreach
I talked about the intersection of users and data, and having a deep understanding of those users interacting with data to solve this problem. What CounterBreach does is it essentially uses machine learning to automate the understanding of all of the different variables, both the user variables and data variable in such a way that we can make sense of all this and address the context issue, address the false positives, address not having to create static predefined policies issue.
One of the first things we do is identify user and connection types. What does that mean? In the database world, one of the biggest challenges our customers have is just differentiating application service accounts connecting to the database, versus interactive users or privilege users like DBAs connect to the database, because they perform and have different responsibilities. If we understand the different users and some organizations, that’s a huge win. I worked with a large payment processing company that literally had a rat’s nest of legacy connections through the database. They didn’t know who was what, and just by going in and automatically differentiating based on behavioral statistics and algorithms…this connection is an application based on velocity, based on what it does, how it connects to the database…we can automatically detect and say, “Hey, this is a service account.” Based on the differentiation, we can also say this is a DBA that’s connecting to the database.
Once we’ve understood the connection types, and often times that’s a huge win for the organization, the second thing we want to understand is what is the typical purpose of the account in terms of how it accesses data? We see an application account, we’ve profiled that, we understand it’s an application account, we’re going to see it access sensitive data applications. Typically it’s acting on behalf of users, let’s say, on a healthcare portal that are interacting with the application and updating sensitive PII information. We’re going to see certain database operations, certain SQL calls. We’ll get certain tables within the database and be able to classify that at that gr annual level. When we talk about a deep understanding of data, that’s what I’m talking about, is not just the connection but the operations, basically the SQL operations that are being performed against which data, so this is dynamic data classification.
By the same token we see DBAs also interacting with the database. They need to maintain, again, the performance, uptime and availability, and what we typically see is they’re accessing meta data. I shouldn’t see the same operations against the same tables that the application is doing from a DBA user. One of the common things that we see with CounterBreach is a DBA now that’s normally accessing this data that we’ve seen over a period of time—and built that profile, that good behavior profile—all of a sudden accessing sensitive data because we’ve profiled that this is sensitive data from the application.
In a previous post, we shared three primary reasons why the traditional, static approach to file security no longer works for today’s modern enterprises. Working groups are formed organically and are cross-functional by nature, making a black and white approach to file access control outdated—it can’t keep pace with a constantly changing environment and creates security gaps. Files can be lost, stolen or misused by malicious, careless, or compromised users.
We also introduced a new file security approach—one that leverages machine learning to build dynamic peer groups within an organization based on how users actually access files. By automatically identifying groups based on behavior, file access permissions can be accurately defined for each user and dynamically removed based on changes in user interaction with enterprise files over time.
In this post, we’ll review the algorithms used to create dynamic peer groups that identify suspicious file access activity and help solve the traditional access control problem.
Building Dynamic Peer Groups to Detect Suspicious File Access
Several steps are required to dynamically place users in virtual peer groups according to how they access data (see Figure 1).
First, granular file access data is collected and processed. Next, a behavioral baseline is established that accounts for every file and folder accessed by each user. Based on how they access enterprise files, the dynamic peer group algorithm assigns users who may belong to different Active Directory (AD) groups into virtual peer groups. If the algorithm does not have enough information to associate a user with a specific peer group, the user is placed in a new peer group in which they are the sole member. Once virtual peer groups are established, access to resources by unrelated users can be flagged; this enables IT personnel to immediately follow up on such incidents.
Figure 1 – Overview of suspicious file access detection process
Granular data inputs
Algorithm input comes from Imperva SecureSphere audit logs. These contain access activity that provides full visibility regarding which files users access over time. Each event contains the following fields:
Date and Time
Date and time of file request
Username used to identify requesting user
Department to which user belongs (as registered in Active Directory)
Domain in which the user is a member
IP that initiated the file request
IP to which the file request was sent
Path of requested file
Requested file name
Requested file extension
Requested file operation (e.g., create, delete)
The behavioral models are created daily and simulate a sliding window on the audit data. This lets the profile dynamically learn new behavioral patterns and ignore old and irrelevant ones. Additionally, the audit files are periodically transferred to a behavior analytics engine. This improves existing behavioral models and reports suspicious incidents.
The behavior analytics engine is divided into two components:
Learning process (profilers) – Initially run over a baseline period, profilers are algorithms that profile the objects and activity in the file ecosystem and relate it to normal user behavior. These include users, peer groups, and folders, as well as the correlation between the objects. Profilers are activated daily afterward, both to enhance the profile as more data becomes available, and to keep pace with environmental changes (e.g., when new users are introduced).
Detection (detectors) – Audit data is usually aggregated over a short period (less than one day) before being processed by the detector. Activated when new data is received, detectors pass file access data from the profiler through predefined rules to identify anomalies. They then classify suspicious requests, reporting each as an incident.
Create peer groups using machine learning algorithms
To build peer groups, data must first be cleansed of irrelevant information—including files accessed by automatic processes, those that are accessed by a single user, and popular files frequently opened by many users in the organization.
Now with clean data, Imperva builds a matrix of the different users (rows) and folders accessed over time (columns). Each entry contains the number of times a user has accessed a given folder in the input data time frame.
The matrix is very sparse because the majority of users do not access most folders; therefore, dimensionality reduction is performed on that matrix to reduce both the scarcity and noise in the data. This leaves meaningful data access patterns which become the clustering algorithm input.
A density-based clustering algorithm is used to divide the different peer groups within the organization into homogeneous groups called clusters. Members of a given cluster have all accessed similar folders, with a typical cluster containing about four to nine users. The process also makes certain that users in different clusters are unique.
Define virtual permissions to enterprise files
The notion of “close” and “far” clusters are used to define the virtual permissions model of each user. For every cluster, the algorithm determines which peer groups are close and far based on the similarity between it and the other clusters. Distances are partitioned into two groups using a k-means algorithm; a smaller distance designates a closer cluster.
Each user is permitted access to folders accessed by others within their own cluster, or by users belonging to close clusters.
Detect suspicious file access
The detector aspect of the algorithm identifies suspicious folder access. Within a profiling period, for example, user John’s access to a given folder is considered suspicious if the folder is only accessed by users belonging to clusters far from his.
Imperva CounterBreach automatically determines the “true” peer groups in the organization and then detects unauthorized access from unauthorized users.
Incident severity (e.g., high, medium or low) is a function of the number of users and clusters having accessed the folder during the learning period. The ratio between the first and second quantities implies severity; higher values indicate higher severity (many users grouped in a small number of clusters). Lower values (close to 1) indicate reduced confidence, as the number of users equals or approaches the number of clusters. Personal folders and files are given careful consideration when ranking severity.
Adding context to accessed files with dynamic labels
With the goal of providing sufficient context to security teams so they can understand and validate each incident, Imperva presents typical behavior of the user who performed the suspicious file access activity. In addition, a label is applied to each folder accessed during the incident; this helps SOC teams evaluate the content or relevance of the files in question.
In assigning a label to a folder, the algorithm assesses the users who accessed it during the profiling period, as well as those from their peer groups. It then looks for the group (or groups) in Active Directory (AD) that best fits this set of users. This has two relevance aspects: the first, called precision, is how many users in the set are also in the AD group; the second is recall, the number of users in the AD group also contained in the user set. The best AD group (or groups) becomes the folder label—for example, Finance-Users, EnterpriseManagementTeam, or G&A-Administration. The label provides security teams with more context about the nature of the files pertaining to an incident.
Up Next: Examples from Customer Data
To validate the algorithms explained above, several Imperva customers allowed us to leverage production data from their SecureSphere audit logs. Containing highly granular data access activity, the log data provided full visibility into which files users accessed over a given duration—we saw the algorithms identify some very interesting real-life file access examples.
In our next post in this series we’ll review those examples and demonstrate the effectiveness of this automated approach to file access security.
For additional information on detecting suspicious file access with dynamic peer groups read the full Imperva Hacker Intelligence Initiative (HII) report: Today’s File Security is So ‘80s.