Parse API

TonicTextual Parse Class

Pipeline Class

class tonic_textual.classes.pipeline.Pipeline(name: str, id: str, client: HttpClient)

Class to represent and provide access to a Tonic Textual pipeline.

Parameters:

name (str) – Pipeline name.
id (str) – Pipeline identifier.
client (HttpClient) – The HTTP client to use.

describe() → str: Returns the name and id of the pipeline.

enumerate_files(lazy_load_content=True) → PipelineFileEnumerator

Enumerate the files in the pipeline.

Parameters:: lazy_load_content (bool) – Whether to lazily load the content of the files. Default is True.
Returns:: An enumerator for the files in the pipeline.
Return type:: PipelineFileEnumerator

get_delta(pipeline_run1: PipelineRun, pipeline_run2: PipelineRun) → FileParseResultsDiffEnumerator

Enumerates the files in the diff between two pipeline runs.

Parameters:

pipeline_run1 (PipelineRun) – The first pipeline run.
pipeline_run2 (PipelineRun) – The second pipeline run.

Returns:

An enumerator for the files in the diff between the two runs.

Return type:

FileParseResultsDiffEnumerator

get_runs() → List[PipelineRun]

Get the runs for the pipeline.

Returns:: A list of PipelineRun objects.
Return type:: List[PipelineRun]

run() → str

Run the pipeline.

Returns:: The ID of the job.
Return type:: str

upload_file(file: IOBase, file_name: str, csv_config: SolarCsvConfig | None = None) → str

Upload a file to the pipeline.

Parameters:

pipeline_id (str) – The ID of the pipeline.
file (io.IOBase) – The file to upload.
file_name (str) – The name of the file.
csv_config (SolarCsvConfig) – The configuration for the CSV file. This is optional.

Returns:

This function does not return any value.

Return type:

None

File Enumerators

class tonic_textual.classes.pipeline_file_enumerator.PipelineFileEnumerator(job_id: str, client: HttpClient, lazy_load_content=True)

Enumerates the files in a pipeline.

Parameters:

job_id (str) – The job identifier.
client (HttpClient) – The HTTP client to use.
lazy_load_content (bool) – Whether to lazy load the content of the files. Default is True.

next() → FileParseResult

class tonic_textual.classes.file_parse_result_diff_enumerator.FileParseResultsDiffEnumerator(job_id1: str, job_id2: str, client: HttpClient)

Enumerates the files in a diff between two jobs.

Parameters:

job_id1 (str) – The first job identifier.
job_id2 (str) – The second job identifier.
client (HttpClient) – The HTTP client to use.

next() → FileParseResultsDiff

Pipeline File Results

class tonic_textual.classes.parse_api_responses.file_parse_result.FileParseResult(response: Dict, client: HttpClient, lazy_load_content=False, document: Dict | None = None)

A class representing the result of a parsed file.

Parameters:

response (Dict) – The response from the API.
client (HttpClient) – The HTTP client to use.
lazy_load_content (bool) – Whether to lazy load the content of the file. Default is False.

describe() → str: Returns the parsed file path.

download_results() → str

Downloads the results file.

Returns:: The results file.
Return type:: string

get_all_entities() → List[SingleDetectionResult]

Returns a list of all the detected entities in the file.

Returns:: A list of detected entities in the file.
Return type:: List[SingleDetectionResult]

get_chunks(max_chars=15000, generator_config: Dict[str, PiiState] = {}, generator_default: PiiState = PiiState.Off, metadata_entities: List[str] = [], include_metadata=True) → List

Returns a list of chunks of text from the document. The chunks are filtered by the generator_default configuration.

Parameters:

max_chars (int = 15_000) – The maximum number of characters in each chunk.
generator_config (Dict[str, PiiState]) – A dictionary of sensitive data entities. For each entity, indicates whether to redact, synthesize, or ignore it. Values must be one of “Redaction”, “Synthesis”, or “Off”.
generator_default (PiiState = PiiState.Redaction) – The default redaction used for all types not specified in generator_config. Values must be one of “Redaction”, “Synthesis”, or “Off”.

include_metadata: bool = True

If True, the metadata is included in the chunk.

Returns:

List[str]: A list of strings containing the chunks of text.

get_entities(generator_config: Dict[str, PiiState] = {}, generator_default: PiiState = PiiState.Redaction, allow_overlap: bool = False) → List[SingleDetectionResult]

Returns a list of entities in the document. The entities are filtered by the generator_default configuration.

Parameters:

generator_default (PiiState) – The default redaction used for all types not specified in generator_config. Values must be one of “Redaction”, “Synthesis”, or “Off”.
generator_config (Dict[str, PiiState]) – A dictionary of sensitive data entities. For each entity, indicates whether to redact, synthesize, or ignore it. Values must be one of “Redaction”, “Synthesis”, or “Off”.

Returns:

A list of the detected entities. Each item in list contains the entity type, source start index, source end index, the entity text, and replacement text.

Return type:

List[SingleDetectionResult]

get_markdown(generator_config: Dict[str, PiiState] = {}, generator_default: PiiState = PiiState.Off) → str

Returns file in markdown format, redacted or synthesized based on config.

Parameters:

generator_config (Dict[str, PiiState]) – A dictionary of sensitive data entities. For each entity, indicates whether to redact, synthesize, or ignore it. Values must be one of “Redaction”, “Synthesis”, or “Off”.
generator_default (PiiState = PiiState.Redaction) – The default redaction used for all types not specified in generator_config. Values must be one of “Redaction”, “Synthesis”, or “Off”.

Returns:

The file in markdown format, redacted or synthesized based on generator_config and generator_default.

Return type:

str

get_tables() → List[Table]

Returns a list of tables found in document. This is applicable to CSV, XLSX, PDF, and images

Parameters:

sensitive_entity_types (List[str]) – A list of sensitive entity types to check for.
start (int = 0) – The start index to check for sensitive data.
end (int = -1) – The end index to check for sensitive data.

Returns:

True if the element contains sensitive data, False otherwise.

Return type:

bool

is_sensitive(sensitive_entity_types: List[str], start: int = 0, end: int = -1) → bool

Returns True if the element contains sensitive data, False otherwise.

Parameters:

sensitive_entity_types (List[str]) – A list of sensitive entity types to check for.
start (int = 0) – The start index to check for sensitive data.
end (int = -1) – The end index to check for sensitive data.

Returns:

True if the element contains sensitive data, False otherwise.

Return type:

bool

class tonic_textual.classes.parse_api_responses.file_parse_results_diff.FileParseDiffAction(value)

Enum that stores possible state of a file parse result diff.

Added = 1: The file was added, so it is new..

Deleted = 2: The file was deleted.

Modified = 3: The file was was modified.

NonModified = 4: The file was not modified.

class tonic_textual.classes.parse_api_responses.file_parse_results_diff.FileParseResultsDiff(status: FileParseDiffAction, file: FileParseResult)

Stores file parse result and file parse result action.

Parameters:

status (FileParseDiffAction) – The action of the file parse result.
file (FileParseResult) – The file parse result.

deconstruct() → Tuple[FileParseDiffAction, FileParseResult]: Returns the status and the file path of the diff.

describe() → str: Returns the status and the file path of the diff as string.

Overview

Redact