We start from an environment with approximately 300 accounts, with more than one productive region in each account, different previous environments, many service users and personal users accessing through CLI and Web Interface, multiple development teams using DevOps methodologies and Infrastructure as Code in each account. To this, we should add the speed for the development of new features and products of AWS.
From a security point of view and a security cloud team (which is the case) we could only say, it is a very difficult situation if we want to have visibility of all infrastructure resources that the different teams are building and deploying. The task becomes more complicated if we want to have some controls which are associated with security policies that are intrinsic to the company business or a particular cloud model or to see how the security risks associated with an account evolve. So we can conclude that it is not an easy task.
Next, we will explain how we have achieved this milestone by monitoring this information with a minimum latency and a record performance given the amount of data we handle. We have achieved great flexibility for the visualization and management of these “custom” control panels, getting to have the data model of AWS resources as the only dependency.
As a result, we will comment on how we managed to have control panels for bastioned instances or correct tracing of all the VPC pointing to a specific security S3 bucket for example.
AWS provides a service called AWS Config, which alerts every configuration change in every resource. You only need a simple way to use this information to your advantage, and here is where AWS Athena enters the stage. Athena lets you query information from S3 objects using old acquaintance SQL, allowing you to perform queries based on the configuration information that AWS Config provides.
So you have the information and a way to query it, so far so good, isn’t it? Wait a moment, little padawan, that’s too easy. You must structure your information in a way that lets you query it with Athena, and here it is the third leg of this system, AWS Glue. It makes the info from AWS Config more digestible for Athena. With Glue, we finally store the config information in a proper format for Athena, in terms of speed, space and cost-effectiveness. Later we spoke about Control Panels or QuickSight.
How AWS Config provides us the data
AWS Config gives us data for every change in the configuration, and fortunately it gives us the whole new configuration after the change with a list of related objects (for example, if an instance use a new security group, there will be a relation between both resources and Config will give us both updated resources).
AWS Config can deliver us the data in four ways:
- Via API
- Sending messages to an SNS topic
- Sending to CloudWatch
- Storing it as JSON objects in an S3 bucket
For our approach, the last way is more suitable, just remember Athena reads data from S3.
Config puts S3 objects in the bucket in JSON format. An S3 object has a single JSON object in it, but this object can contain several configuration changes from different resources. This is not the best option for Athena, but we can manage it in our ETL.
With this data we can track every change, but there is a big problem: in order to get all the resources deployed at a given time we should compute it from the beginning of times (there may be an EC2 Instance created 2 months ago and still running), and the old configuration is not always available. Hopefully, you can configure AWS Config to give you a daily snapshot, so you can get all the resources at a time without having to go back to the Pleistocene. You only have to calculate all the resources activated since the closest snapshot to the time you want to track. But there is a drawback, this snapshot is delivered in an indeterminate time, you can’t request the snapshot at your preferred time. This means that in our case the snapshot is at about 19:00 UTC every day, so until this time our data for the current date is incomplete. We will solve this issue later.
In order to make data easily available in Athena, we transform the data from Config with Glue jobs. We also enrich it with data from DynamoDB, and we could use many other sources.
1. S-CMDB Copy Raw Job
So, at this point, AWS Config is configured to deliver every change and a daily snapshot into the centralized security logs account. On this account, we store other relevant security logs, like VPC flow logs, ELB logs, and so on. Our solution must take this information from the Security Logs account and put it into our account for processing scopes. In our first approach, we process the data right from the logging account, but Glue began to throw exceptions because their S3 bucket throttles our script requests. That’s why we’ve developed this job.
This is a simple python job, and it just copies new data from the security logging account to our raw bucket. It gets a copy of every new config JSON for the current date. This date is padded 4 hours to copy the last JSON’s for the previous date because sometimes the data is delivered after the date change. So, at 3:00 UTC the job will try to copy yesterday files, but at 4:00 UTC it begins to copy today’s files. It also adds a field containing the delivery date for any snapshot files.
In order to ensure that data don’t grow ad infinitum, we have enabled a life cycle for the destination bucket. This job is triggered every hour in order to maintain the CMDB up-to-date.
2. S-CMDB Ingest Job
This job is triggered once the data is in our account. It writes the JSON data from Config into parquet files. This data format is more suitable for Athena, and reduces the amount of data that needs to be read, thus reducing costs.
3. S-CMDB Normalise Snapshots
This job generates a snapshot normalised at 00:00 UTC. It takes the config snapshots and the changes for the previous day, collapse them and take the newest changes by resource. By doing this, we generate a new snapshot with the state of all the resources for the next day at that time. This will make the queries easier to track all the configurations at an exact moment.
The process is quite simple:
- Select the last configuration for every resource in that day
- If the configuration change is a deletion, we delete that resource from the snapshot
- This job is triggered once a day and makes our data complete since the beginning of the day.
Athena is the access layer to our information. Here you can take the data retrieved from Config Service and use it for something useful.
1. Table definition
Athena needs to define the table that accesses the S3 objects. We define the fields as strings and use JSON functions in SQL to access the data inside the strings. Trying to model the JSON structure is impossible beyond the Config service data.
The queries in Athena can be very simple if you only try to count the number of resources of some type in a day, for example.
This returns the number of instances on the 15th of May. Using the configuration to filter isn’t too hard:
It’s a nesting hell. Try to guess what the query is doing and why. Every nested array in the JSON must be unnested and cross joined with its father until you reach the data you want.
Athena queries are very powerful, but can get very complex when you try to use nested data in JSON fields:
- You should first know the JSON structure before querying by data inside the JSON.
- Querying by nested data inside arrays is hard, avoid checking all array data if you can. If you can get what you want only checking the first item in the array, use JSONpath to access the first element. It won’t fail on empty arrays.
- Use it on your own.
Quicksight: adding visualization capabilities to the CMDB
With all this data we must have a quick way to visualise how secure our infrastructure is. We have chosen QuickSight because it’s a managed service and it’s well integrated with Athena.
We use Quicksight to show some security insides at a glance. For example, imagine that you want to manage different configurations about some resources and verify this configuration is correct over time. As Security Cloud Team we are going to show different examples quite explanatory:
1. Endpoint exposition Security Risk
In this case, we are going to produce intelligence about resources what could involve a security risk related with information leakage, access control or simply that a service is not being monitored by a structural service publication layer.
2. Security Infrastructure Control
We are going to manage different resources which we want to secure it. For example, the percentage of hardened AMIs which were shared with the projects previously and we need to manage who is using this regulatory control or VPC’s logging properly.
3. Secure Regions allowed
We could have a fast overview of our deployment in our allowed regions
QuickSight: the good, the bad and the uglies
In QuickSight you can work in two ways: querying Athena Database directly or importing data into QuickSight and performing queries with its own engine (Spice). Athena is slooow, very slow, and thus it’s not a choice. That’s bad. On the other hand, Spice is very fast while graphing. And that’s good. But Spice needs to load the data and that is not so fast and freezes your visualisations while importing data, so you can’t update data too often. It’s also quite expensive when you load large datasets. And that’s ugly.
But the really ugly thing about QuickSight is that you can’t version your developments. If you load the data and add some calculated fields these calculations are only inside
If you accidentally delete the analysis or you change the way you calculate the field, you can go back to the previous calculation.
To avoid these problems we use Athena views, reducing the amount of data loaded to Spice. This reduces costs and the most important, the time required to update the dataset. We’re talking about less than a minute against more than 40 minutes of downtime. Now we can update during office hours. This makes the data more responsive to actual changes and way more convenient. We can also enrich data with data stored in other Athena tables in plain old SQL, well known to all coworkers.
Athena views make development easier, more powerful and versionable. It also reduces the amount of data so it’s also faster and cheaper. So this is the way to go.
We have identified two main types of views:
- Queries against current resources
- Queries that show evolution along time
Get current resources for a type can be hard, so we have created a view that responds with all resources at current time. We use this view as a base to develop any other view that uses the current state. This simplifies every derived query and in case that the base query must change it will be automatically changed for all its derived views.
For any query along time, try not to aggregate the data in the view. For example, you want to count how many instances where deployed by date. If you make the count in the query, you can’t filter by account. Let’s add any field you want to filter and that’s it. Now you want to group by month and here come the problems. Let’s say I have 3 instances on day 1 and 3 on day 2, two of them the same that in day 1. If your data is already grouped by date and your group that data by month, 3 from day 1 plus 3 from day 2 are 6 instances in a month. But actually there are 4 distinct instances. If you don’t group the resources in the view, you can count distinct instances in QuickSight, and it will give you the correct count.
You must use views to enrich the dataset. Just left join the tables you want to the main table and select your desired fields. Don’t import the dataset in QuickSight. Although it technically makes sense, you can’t version it and it’s less maintainable.
Using Athena views in QuickSight
QuickSight doesn’t discriminate between tables and views, so it’s as easy as create a view and import it to a dataset. From there you can use it as usual.
We have limited ourselves in the way we work with views:
- They must only select data that would be useful in a visualisation. If it’s not used, it’s not selected.
- If the view represents a time series, it must be limited to a time. For convenience, we have chosen 60 days in order to limit dataset size. QS can show up to 100 items in X-axis, but the labels become unreadable.
- Calculate all fields in the view. It’s more powerful and you can version it in git.
- Don’t aggregate in the view. This may sound against the previous statement, but if you aggregate in the view you lose drill-down capabilities.
- You can add more fields to a view at any time, but you must edit the dataset in QuickSight in order to change the dataset schema. Until you edit the dataset, new fields won’t be available.
The life cycle of data
Now we’ll explain the life cycle of data in the CMDB from the change in a resource to the visualisation at QuickSight.
- All begins with a user changing the resources in an AWS account. For example, creating an instance. Config service discovers the new instance, grabs its configuration and sends it through the delivery channel. We have configured the channel to deliver to an S3 bucket in the logging account, so Config put the resource config in a new S3 object. Every change in any resource will be stored in the S3 bucket in JSON format.
- Once a day, every region in every account dumps a snapshot into the same bucket in a logging account.
- Every hour the CloudSec-CMDB-Copy-new-raw-job is triggered. It copies files from the logging account to the security account raw bucket.
- When Copy job finishes, the ingestion job begins. It only takes as input new objects because we have enabled the Job Bookmark for it. It puts the processed configurations into the processed bucket in the security account. Data is partitioned by date and account.
- The ingestion job also adds new partitions to Athena's table. And now the data is available for analysis. Any authorised user can query the CMDB using SQL.
- QuickSight datasets are updated automatically every hour. Now the dashboard has available the last changes.
At the end, we have a CMDB with some interesting features:
- Complete: It contains every change made in the monitored resources. It doesn’t miss a change between updates.
- Up-to-date: It refreshes data hourly, and you can always trigger an on-demand update.
- Integrated into our account provisioning pipeline: You only need to configure AWS Config when an account is created. This is configured in the organisation account bootstrapping.
- Easily exploitable: You can query with good old SQL. Or you can go deeper with a Spark analysis.
- Visual: You can use your controls and visualise them in QuickSight. Just create a view and import the data.
- Security Insight Fastly: You can have a security overview quickly with visualisation and deep analysis.
- Versioning: All resources could be made with code, so you have a security data-set and normal SDCL in your security visualisations.