CDP organizes digital information about the physical world. Assets are digital representations of physical objects or groups of objects, and assets are organized into an asset hierarchy. For example, an asset can represent a water pump which is part of a subsystem on an oil platform.
Assets are used to connect related data together, even if the data comes from different sources; Time series of data points, events and files are all connected to one or more assets. The pump asset can be connected to a time series measuring pressure within the pump, as well as events recording maintenance operations, and a file with a 3D diagram of the pump.
At the top of an asset hierarchy is a root asset (e.g., the oil platform). Each project can have multiple root assets. All assets have a name and a parent asset. No assets with the same parent can have the same name.
A time series consists of a sequence of data points connected to a single asset.
For example: A water pump asset can have a temperature time series that records a data point in units of °C every second.
A single asset can have several time series. The water pump could have additional time series measuring pressure within the pump, rpm, flow volume, power consumption, and more.
Time series store data points as either number or strings. This is controlled by the is_string flag on the time series object. Numerical data points can be aggregated before they are returned from a query (e.g., to find the average temperature for a day). String data points, on the other hand, cannot be aggregated by CDP, but can store arbitrary information like states (e.g. “open”/”closed”) or more complex information (JSON).
Cognite stores discrete data points, but the underlying process measured by the data points can vary continuously. When interpolating between data points, we can either assume that each value stays the same until the next measurement, or that it linearly changes between the two measurements. This is controlled by the is_step flag on the time series object. For example, if we estimate the average over a time containing two data points, the average will either be close to the first (is step) or close to the mean of the two (not is step).
Deprecation warning: In the future, CDP will phase out name as a unique identifier for time series, and instead use a primary key of externalId. Time series names must currently be unique across all time series in the same project. In version 0.6 of CDP, time series names will no longer be unique.
A data point stores a single piece of information, a number or a string, associated with a specific time. Data points are identified by their timestamps, measured in milliseconds since the unix epoch -- 00:00, January 1st, 1970. Milliseconds is the finest time resolution supported by CDP i.e. fractional milliseconds are not supported. Leap seconds are not counted.
Numerical data points can be aggregated before they are retrieved from CDP. This allows for faster queries by reducing the amount of data transferred. You can aggregate data points by specifying one or more aggregates (e.g. average, minimum, maximum) as well as the time granularity over which the aggregates should be applied (e.g. “1h” for one hour).
Aggregates are aligned to the start time modulo the granularity unit. For example, if you ask for daily average temperatures since monday afternoon last week, the first aggregated data point will contain averages for monday, the second for tuesday, etc. Determining aggregate alignment without considering data point timestamps allows CDP to pre-calculate aggregates (e.g. to quickly return daily average temperatures for a year). As a consequence, aggregating over 60 minutes can return a different result that aggregating over 1 hour because the two queries will be aligned differently.
Event objects store complex information about multiple assets over a time period. For example, an event can describe two hours of maintenance on a water pump and some associated pipes, or a future time window where the pump is scheduled for inspection. This is in contrast with data points in time series that store single pieces of information about one asset at specific points in time (e.g., temperature measurements).
An event’s time period is defined by a start time and end time, both millisecond timestamps since the UNIX epoch. The timestamps can be in the future. Events can also be categorized by a type (e.g, “fault”) and a subtype (e.g., “electrical”), both arbitrary strings defined when creating the event. In addition, events can have a text description as well as arbitrary metadata and properties.
A file stores a sequence of bytes connected to one or more assets. For example, a file can contain a piping and instrumentation diagram (P&IDs) showing how multiple assets are connected.
Each file is identified by a unique ID that is generated when it is a created, as well as a name and a directory path. File names are limited to 256 bytes, and directory paths to 512. The combination of file name and directory path must be unique within a project.
Directories in CDP differ from ones in normal file systems; They exist only as string attributes on individual file objects. This means that directories themselves cannot be created, deleted, or moved. There is no particular path separator, and no notion of a directory hierarchy.
Files are created in two steps; First the metadata is stored in a file object, and then the file contents are uploaded. This means that files can exist in a non-uploaded state.
Cursors and pagination
When fetching data from the Cognite API, the results will be wrapped in one of two data types:
The difference between
DataWithCursor is merely that the latter also has cursors that
you can use to navigate through pages of results. The cursor is a random string that can be copied
and sent with subsequent requests to navigate through pages of results.
To access the API of CDP the requests must be authenticated. Users and services authenticate differently.
Users authenticate by presenting a token obtained from the identity provider configured for the project. This enables users to authenticate using their existing identity that are managed by the user´s organization.
Services authenticate through presenting an API key. The API key is a secret string that grants access to a project when making requests to the API. Each API key connects exactly one service to one project. A single service can have multiple API keys for the same project.
3D models and revisions
The cognite platform uses 3D models of physical assets to give data a visual and geometrical context. We can connect e.g. a pump asset with a 3D model of the plant floor where it's placed. Seeing asset data rendered in 3D enables you to quickly find the sensor data you are interested in.
3D data is organized into models and revisions. A model is just a placeholder for a set of revisions, or versions. Revisions contain the actual 3D data. For example, you can have a model named
Compressor and you can upload a revision under that model. When you create a revision you need to attach a 3D file. For each new version of the 3D model, you upload a new revision under the same model. A revision can have status
unpublished which is used by applications to decide whether or not to list the revision. Multiple revisions can be published at the same time, since they do not necessarily represent time evolution of the 3D model, but rather different versions (high detail vs low detail).
When you upload a new revision, Cognite needs to process the 3D data to optimize it for our renderer. Depending on the complexity of the 3D file, this can take some time. A revision can have status
Failed, which can be tracked during processing.
3D data is typically built up by a hierarchical structure. This is very similar to how we organize our internal asset hierarchy. Each 3D node is assigned a random ID, nodeId (
uint64). If a user clicks on an object on the screen, the application can get a callback containing the nodeId of the clicked object. We support endpoints to extract the full 3D node hierarchy, and endpoints to create mapping between 3D nodes and nodes in Cognite's asset hierarchy. You can then use the nodeId to connect the 3D data to asset information such as metadata and timeseries.
We also deliver a web based 3D viewer to embed the 3D model in your own web page.
Projects are used to separate customers from one another, and all objects in CDP are linked to a project. A customer usually has only one project. The project object contains configuration for how to authenticate users.
Automatically assigned object ids are unique only within each project.
This section describes a set of services which facilitate in streamlining data science work on top of the Cognite Data Platform (CDP). This section introduces:
- Specifications and services for defining input/output data
- Model Hosting
- Jobs and asynchronous services to use e.g., PatternSearch
Specifying input data
A data spec (short for data specification) specifies which data to pass to a given analytics job. It can refer to one or more datasources, such as Time Series or Files. A data transfer service is shipped along with our Python SDK and allows the user to easily build data specs and fetch the data it describes.
Cognite Model Hosting
Our Model Hosting API was released November 1st, and is still experimental. We may change it without notice, and you should not use it in production.
The Cognite Data Platform offers a hosting environment for user defined machine learning models, enabling anyone working on top of CDP to seamlessly operationalize their smart ideas!
These models can perform specific, user-defined tasks on both streaming and batched data. Models uploaded to our hosting environment are automatically given an HTTP endpoint so they can be made available instantly to anyone with access.
We introduce the different types of resources available in Model Hosting below. More information can be found at the Model Hosting page.
Models & Versions
A model is an entity that are capable of performing predictions. Prediction is simply a routine that take some input and give some output, and can in principle represent any kind of calculation. A model may have any number of versions, which are actual implementations of the prediction routine.
Conceptually, a model is an abstract solution to a specific problem, while model versions are different specific manifestations of that abstract solution. A model version can have been trained on training data and using persisted state to do predictions, or it can be doing simple stateless calculations that don't need training.
Predictions performed using a model will use the active version. Which version is active can be changed at any time, but only one version can be active. One can test a version that is not active by performing predictions directly with the version instead of the parent model.
A Schedule performs predictions on data from CDP regularly using a model and writes the output back to CDP. This is useful if you for example would like to monitor some equipment using sensor time series. A Schedule is defined by which model will do the prediction, which data from CDP to feed into the model, and where in CDP to write the the model output. An arbitrary number of Schedules can be set up to use the same model.
A Source Package is a Python package containing the user-defined code that describes the behavior of a model version. As normal Python packages, it contain additional metadata and specify its dependencies. It exposes routines for training and prediction, and when combined with training data (preferable defined using a data spec) it can be used to create a model version. Notice that source packages can be reused across models and model versions, and be used with different training data.
Lifecycle of resources
Resources in Model Hosting might consume significant compute and storage, and it's therefore important to be able to manage the lifecycle of these resources. But resources in model Hosting depend on each other - restricting which resources can be deleted without breaking other resources or without losing important metadata. That's why certain resources can be deprecated. When something is deprecated it will be blocked from new usages and it's compute and storage resources will be freed as soon as it's not in use anymore. All metadata will be kept for transparency and for keeping track of history, but the resource can never be reversed to not be deprecated anymore.
You can for example not delete a Source Package as long as it's used by one or more model versions, but you can deprecate it and effectively stop new model versions from using it.
Jobs and asynchronous services
Machine learning often requires time for computation, creating a need for asynchronous services. When you use such services, the sequence of steps is:
- You send a request to launch a job, e.g. pattern search
- You receive a job ID which you can use to retrieve the current status of the job
- You see when the job is complete
- You send a new request to retrieve the result of the job