Parse API
TonicTextual Parse Class
Pipeline Class
- class tonic_textual.classes.pipeline.Pipeline(name: str, id: str, client: HttpClient)
Class to represent and provide access to a Tonic Textual pipeline.
- Parameters:
name (str) – Pipeline name.
id (str) – Pipeline identifier.
client (HttpClient) – The HTTP client to use.
- enumerate_files(lazy_load_content=True) PipelineFileEnumerator
Enumerate the files in the pipeline.
- Parameters:
lazy_load_content (bool) – Whether to lazily load the content of the files. Default is True.
- Returns:
An enumerator for the files in the pipeline.
- Return type:
- get_delta(pipeline_run1: PipelineRun, pipeline_run2: PipelineRun) FileParseResultsDiffEnumerator
Enumerates the files in the diff between two pipeline runs.
- Parameters:
pipeline_run1 (PipelineRun) – The first pipeline run.
pipeline_run2 (PipelineRun) – The second pipeline run.
- Returns:
An enumerator for the files in the diff between the two runs.
- Return type:
- get_runs() List[PipelineRun]
Get the runs for the pipeline.
- Returns:
A list of PipelineRun objects.
- Return type:
List[PipelineRun]
- upload_file(file: IOBase, file_name: str, csv_config: SolarCsvConfig | None = None) str
Upload a file to the pipeline.
- Parameters:
pipeline_id (str) – The ID of the pipeline.
file (io.IOBase) – The file to upload.
file_name (str) – The name of the file.
csv_config (SolarCsvConfig) – The configuration for the CSV file. This is optional.
- Returns:
This function does not return any value.
- Return type:
None
File Enumerators
- class tonic_textual.classes.pipeline_file_enumerator.PipelineFileEnumerator(job_id: str, client: HttpClient, lazy_load_content=True)
Enumerates the files in a pipeline.
- Parameters:
job_id (str) – The job identifier.
client (HttpClient) – The HTTP client to use.
lazy_load_content (bool) – Whether to lazy load the content of the files. Default is True.
- next() FileParseResult
- class tonic_textual.classes.file_parse_result_diff_enumerator.FileParseResultsDiffEnumerator(job_id1: str, job_id2: str, client: HttpClient)
Enumerates the files in a diff between two jobs.
- Parameters:
job_id1 (str) – The first job identifier.
job_id2 (str) – The second job identifier.
client (HttpClient) – The HTTP client to use.
- next() FileParseResultsDiff
Pipeline File Results
- class tonic_textual.classes.parse_api_responses.file_parse_result.FileParseResult(response: Dict, client: HttpClient, lazy_load_content=False, document: Dict | None = None)
A class representing the result of a parsed file.
- Parameters:
response (Dict) – The response from the API.
client (HttpClient) – The HTTP client to use.
lazy_load_content (bool) – Whether to lazy load the content of the file. Default is False.
- get_all_entities() List[SingleDetectionResult]
Returns a list of all the detected entities in the file.
- Returns:
A list of detected entities in the file.
- Return type:
List[SingleDetectionResult]
- get_chunks(max_chars=15000, generator_config: Dict[str, PiiState] = {}, generator_default: PiiState = PiiState.Off, metadata_entities: List[str] = [], include_metadata=True) List
Returns a list of chunks of text from the document. The chunks are filtered by the generator_default configuration.
- Parameters:
max_chars (int = 15_000) – The maximum number of characters in each chunk.
generator_config (Dict[str, PiiState]) – A dictionary of sensitive data entities. For each entity, indicates whether to redact, synthesize, or ignore it. Values must be one of “Redaction”, “Synthesis”, or “Off”.
generator_default (PiiState = PiiState.Redaction) – The default redaction used for all types not specified in generator_config. Values must be one of “Redaction”, “Synthesis”, or “Off”.
- include_metadata: bool = True
If True, the metadata is included in the chunk.
- Returns:
- List[str]
A list of strings containing the chunks of text.
- get_entities(generator_config: Dict[str, PiiState] = {}, generator_default: PiiState = PiiState.Redaction, allow_overlap: bool = False) List[SingleDetectionResult]
Returns a list of entities in the document. The entities are filtered by the generator_default configuration.
- Parameters:
generator_default (PiiState) – The default redaction used for all types not specified in generator_config. Values must be one of “Redaction”, “Synthesis”, or “Off”.
generator_config (Dict[str, PiiState]) – A dictionary of sensitive data entities. For each entity, indicates whether to redact, synthesize, or ignore it. Values must be one of “Redaction”, “Synthesis”, or “Off”.
- Returns:
A list of the detected entities. Each item in list contains the entity type, source start index, source end index, the entity text, and replacement text.
- Return type:
List[SingleDetectionResult]
- get_markdown(generator_config: Dict[str, PiiState] = {}, generator_default: PiiState = PiiState.Off) str
Returns file in markdown format, redacted or synthesized based on config.
- Parameters:
generator_config (Dict[str, PiiState]) – A dictionary of sensitive data entities. For each entity, indicates whether to redact, synthesize, or ignore it. Values must be one of “Redaction”, “Synthesis”, or “Off”.
generator_default (PiiState = PiiState.Redaction) – The default redaction used for all types not specified in generator_config. Values must be one of “Redaction”, “Synthesis”, or “Off”.
- Returns:
The file in markdown format, redacted or synthesized based on generator_config and generator_default.
- Return type:
str
- get_tables() List[Table]
Returns a list of tables found in document. This is applicable to CSV, XLSX, PDF, and images
- Parameters:
sensitive_entity_types (List[str]) – A list of sensitive entity types to check for.
start (int = 0) – The start index to check for sensitive data.
end (int = -1) – The end index to check for sensitive data.
- Returns:
True if the element contains sensitive data, False otherwise.
- Return type:
bool
- is_sensitive(sensitive_entity_types: List[str], start: int = 0, end: int = -1) bool
Returns True if the element contains sensitive data, False otherwise.
- Parameters:
sensitive_entity_types (List[str]) – A list of sensitive entity types to check for.
start (int = 0) – The start index to check for sensitive data.
end (int = -1) – The end index to check for sensitive data.
- Returns:
True if the element contains sensitive data, False otherwise.
- Return type:
bool
- class tonic_textual.classes.parse_api_responses.file_parse_results_diff.FileParseDiffAction(value)
Enum that stores possible state of a file parse result diff.
- class tonic_textual.classes.parse_api_responses.file_parse_results_diff.FileParseResultsDiff(status: FileParseDiffAction, file: FileParseResult)
Stores file parse result and file parse result action.
- Parameters:
status (FileParseDiffAction) – The action of the file parse result.
file (FileParseResult) – The file parse result.
- deconstruct() Tuple[FileParseDiffAction, FileParseResult]
Returns the status and the file path of the diff.