Skip to content

How to generate a synthetic dataset

You can generate and store synthetic data for your use case using our AI synthetic data generation tool.

Datasets are securely stored on the platform and are available to all the users within your organization.

Info

All datasets are associated with a specific use case. Therefore, the selected use case, provided example, and custom dataset description must conform to the designated input data format for that use case to successfully save the dataset(s).

To generate dataset(s) in the platform:

  1. Navigate to the Datasets section via the left-side menu.
  2. Click on the Synthetic dataset: custom description button located in the top-right corner.
  3. You will be redirected to the Generate synthetic dataset page, where you should complete the following fields:
    • Use case: Select the use case associated with the dataset you want to generate.
    • File: Upload an example dataset in JSON format that will be used as a reference to generate the synthetic dataset. The example dataset must adhere to the input data format of the selected use case.
    • Dataset custom description: Provide a custom description with the characteristics you want for the dataset. Tip: You can generate multiple datasets (see the examples provided under the text area).
    • AI advanced settings: Toggle this section to access advanced settings for the AI model used to generate the synthetic dataset. This section is optional and can be left as is if you are not familiar with the settings. The default settings are usually sufficient for most use cases. The advanced settings include:
      • LLM Model: Select the model you want to use to generate the synthetic dataset. The default model is gpt-4o-mini.
      • Temperature: Adjust the temperature parameter to control the randomness of the generated text. A higher temperature value (e.g., 1.0) will result in more random and diverse outputs, while a lower temperature value (e.g., 0.2) will make the outputs more deterministic and focused on the most likely completions. The default value is 0.7.

To save the dataset(s) in the platform:

  1. Check the generated dataset(s) in the Synthetic data preview section. For each dataset, the name and description are generated automatically; however, you can adjust the name and description if needed.
  2. If you are satisfied with the generated dataset(s), click on the Save dataset button located in the bottom-right corner.

Now the new dataset(s) will be listed in the Datasets section, and it/they will be available as input for your jobs.

What’s next#

Run a job