Writing a Data Provider for MSTICPy

MSTIC
8 min readMar 15, 2022

--

In our recent hack month, the most popular type of contribution was a data provider. We’ve never really documented how to create one of these, so we thought that it would be a good idea to do that now and (hopefully) avoid too much future head-scratching and pouring over existing code.

Query Provider query browser

What is a MSTICPy Data Provider?

A data provider lets you query data from a notebook in a standardized way.

Before reading further you should familiarize yourself with how the data providers work from the Querying and Importing Data section of the MSTICPy documentation.

The term provider is more a concept than defining a single piece of code. There are several components to a provider, the most important of which is a driver. The driver class encapsulates the following functionality:

  • Authentication to the service (usually from configuration values but can also support specifying parameters such as usernames and passwords at run time).
  • Querying data from the service — queries can be either ad hoc queries as strings or templated queries allowing substitutable parameters for common items such as time range, account and host names, etc.
  • Returning the data as a pandas DataFrame. The driver is responsible for converting data types if needed. This is particularly important for datetime data that is usually returned as a string. Most MSTICPy functionality expects datetime in a timezone-aware pandas Timestamp.

Implementing a data provider

To implement a data provider, you need to do the following:

  1. Write the driver
  2. Customize the driver (optional)
  3. Register the driver
  4. Add queries
  5. Add a settings definition
  6. Create documentation
  7. Create unit tests

1. Write the driver class

The driver class must be derived from DriverBase (see DriverBase source). You should implement the following methods:

  • __init__
  • connect
  • query
  • query_with_results (optional)

Also see “2. Customize the driver” section later.

__init__

This initializes your driver with anything it needs to load. It should call super().__init__(**kwargs). Keyword arguments are passed from the QueryProvider class when it is initialized with your provider name. These kwargs will always include data_environment — the name of your provider (see DataEnvironment) and may include the bool debug, which you can use to output optional debug information. Any other kwargs from QueryProvider are passed to your driver class.

At minimum you should set the instance attribute self._loaded to True when your driver __init__ completes successfully.

connect

This method is called from QueryProvider.connect() and is used to authenticate to the data service. It takes an optional connect_str parameter and a kwargs keyword argument dictionary.

Any per-connection configuration settings can be read in (from msticpyconfig.yaml) here using the DriverBase._get_config_settings(ProviderName) method. This returns the args section of your configuration settings from msticpyconfig.yaml.

Some drivers use an API key to authenticate, some use name/password and others use Azure AAD. See KqlDriver source for an example of the latter.

On successful authentication, set self._connected to True. On failure, you can raise a MsticpyConnectionError and provide more details to the user for the reasons. See SplunkDriver source for an example.

query

This takes the following parameters:

  • query - string of query text
  • query_source - this is populated if the query is a MSTICPy template query read from a query yaml file (see Creating new queries) and is an instance of QuerySource. This is a representation of the yaml query with extracted parameters and metadata available as explicit attributes
  • kwargs - any other keyword arguments passed when running the query that are not consumed as query parameters, etc.

This method should submit the query to the service and handle the returned data. The data should be returned as a pandas DataFrame.

Note: you should convert data types to their expected format. For example, dates and numeric values are often returned as strings. It is particularly important to convert date time values. MSTICPy expects datetimes to be pandas Timestamp format and timezone-aware (usually UTC but this is not mandatory).

In case of a query failure, it can return the failure response in place of a DataFrame.

query_with_results

Implementing this method is optional, it can be used if you need to be able to return the raw response as well as the data in dataframe format. However, this method isn’t exposed in the data provider framework — so is more for experimentation/debugging purposes. The query method can call this method to avoid duplication of code.

2. Customize the driver

This section is optional but is needed for many providers.

Exposing attributes via the QueryProvider

QueryProvider is a facade class for the driver classes. The user interacts with the former but not directly with the latter.

If you want to expose an attribute from the driver class as an attribute of QueryProvider you can do the following:

  • implement the attribute in the driver (this can be a method or other type)
  • set self.public_attribs (of your driver class) to a Python dictionary of {name: value} where name is the public name of the attribute (i.e. what you will see as an attribute of QueryProvider) and value is the value of the attribute that you want to expose (e.g. a method or attribute of the driver). This is shown in the example below.
def __init__(self, **kwargs):
"""Initialize new instance."""
... self.public_attribs = {
"client": self.service,
"saved_searches": self._saved_searches,
"fired_alerts": self._fired_alerts,
}

Custom parameter formatting

The format used for dates and lists differ between different query languages. The driver can implement a custom formatter to render datetime or list parameters into the correct format before being substituted into the query string.

Datetime formatter functions should take a Python datetime and return a string. List formatter functions should take a Python Iterable and return a string.

# Parameter Formatting methods
@staticmethod
def _format_datetime(date_time: datetime) -> str:
"""Return datetime-formatted string."""
return f'"{date_time.isoformat(sep=" ")}"'
@staticmethod
def _format_list(param_list: Iterable[Any]) -> str:
"""Return formatted list parameter."""
fmt_list = [f'"{item}"' for item in param_list]
return ",".join(fmt_list)

You must register these functions in the driver __init__ method as follows:

def __init__(self, **kwargs):
"""Initialize new instance."""
... self.formatters = {
Formatters.DATETIME: self._format_datetime,
Formatters.LIST: self._format_list,
}

See SplunkDriver source for an example.

Customizing the query parameter substitution

MSTICPy uses the Python str.format method to substitute named parameters. Here is an example query taken from a query yaml file:

   sources:
list_files:
description: Lists all file events by filename
metadata:
args:
query: '
{table}
| where Timestamp >= datetime({start})
| where Timestamp <= datetime({end})
| where FileName has "{file_name}"
{add_query_items}'

Each value surrounded by braces is considered to be a substitutable parameter name. If you need to include explicit brace characters in the string you can escape the substitution using double braces sequences: {{ and }}. These get converted to single braces by str.format().

While this works well for most query languages, in some cases (like queries expressed as JSON strings), replacing all braces with escaped double-braces is onerous. In this case, you can opt to do the parameter substitution yourself in the driver. To do this implement a method that takes two parameters:

  • query — the raw query string from the yaml file
  • param_dict — a dictionary of parameter name, parameter value

The param_dict values will already have been formatted into a suitable string format using any methods you specified in Custom parameter formatting. Use your own logic to substitute the parameter values into the raw query string and return the query string. The query string will then be passed to your driver’s query method.

You need to register the parameter substitution function in your driver’s __init__ method as shown here.

def __init__(self, **kwargs):
"""Initialize new instance."""
... self.formatters = {
Formatters.PARAM_HANDLER: self._custom_param_handler,
Formatters.DATETIME: self._format_datetime,
Formatters.LIST: self._format_list,
}

3. Register the driver

There are two updates to classes that you need to make to register your driver.

Add the provider as a DataEnvironment

In the enum DataEnvironment (see DataEnvironments source) add an entry for your provider using the next available enum value.

@export
class DataEnvironment(Enum):
"""
Enumeration of data environments.
Used to identify which queries are relevant for which
data sources.
"""
Unknown = 0
AzureSentinel = 1 # alias of LogAnalytics
LogAnalytics = 1
MSSentinel = 1
Kusto = 2
...
ResourceGraph = 9
Sumologic = 10
M365D = 11
Cybereason = 12
Elastic = 14
NewProvider = 15 # << Your provider entry

You can also add aliases by re-using the same numeric value (see the MSSentinel, AzureSentinel and LogAnalytics, entries.)

Add an entry to the driver dynamic load table

In the __init__.py module of data drivers (see drivers sub-package __init__ source).

_ENVIRONMENT_DRIVERS = {
DataEnvironment.LogAnalytics: ("kql_driver", "KqlDriver"),
...
DataEnvironment.Elastic: ("elastic_driver", "ElasticDriver"),
DataEnvironment.NewProvider: ("new_driver", "NewDriverClass"),
}

4. Add queries

Create a folder in msticpy/data/queries with the name of your provider and add queries. For more details on creating queries, see Creating new queries.

Query parameter names

While you can choose whatever parameter names you like for your queries, certain functionality in MSTICPy (e.g. Pivot functions) will uses standardized names to add additional functionality. For example, all queries with the host_name parameter are automatically added as enrichment functions to the Host entity.

This is a list of commonly used parameter names:

start — Query start time

end — Query end time

account_name — User account name

commandline — Process command line

domain — DNS domain name

file_hash — File hash string

host_name — Host name (FQDN or simple)

ip_address — Dotted IP address string

logon_session_id — User logon session

process_id — Process ID

process_name — Process or file name

resource_id — Azure resource ID

url — URL

5. Add settings definition

MSTICPy’s settings editor uses configuration from a YAML file to create UI settings. This allows users to set settings interactively.

Define whatever settings you need as sub-keys of the *args* key.

DataProviders:
...
MicrosoftDefender:
Args:
ClientId: str(format=uuid)
TenantId: str(format=uuid)
# [SuppressMessage("Microsoft.Security", ...
ClientSecret: *cred_key

Use the examples and in-line documentation in mpconfig_defaults.yaml to specify your settings.

The special value *cred_key is a YAML macro and used where you need to store a secret of some kind. Items of this type allow the user to store the value in an environment variable or as an Azure Key Vault secret rather than in the msticpyconfig file.

6. Add provider documentation

A data provider should have documentation describing its configuration and use. This should be in restructured text (RST) for generating document pages in Sphinx.

See the examples Splunk Provider and Sumologic Provider

7. Create driver unit tests

Please add a unit test using mocks to simulate the service responses. Code coverage should be at least 80%.

Do not add unit tests that call the live service. You can include tests that do this but you must mark them as to be skipped during normal unit test runs.

See the test examples in MSTICPy data drivers unit tests.

Conclusion

A version of this documentation will shortly be published on the MSTICPy ReadTheDocs site.

Please reach out if you need any help with this to msticpy@microsoft.com or leave an issue or message at https://github.com/microsoft/msticpy.

--

--

MSTIC

This is the account of the Microsoft Threat Intelligence Center (MSTIC).