Overview
Textual’s Pipeline API allows you to extract text and entity metadata from Textual Pipelines
Creating and deleting a pipeline
To create a pipeline, use the create_pipeline
method.
To delete a pipeline, use the delete_pipeline
method.
from tonic_textual.parse_api import TonicTextualParse
textual = TonicTextual("<TONIC-TEXTUAL-URL>")
pipeline = textual.create_pipeline("pipeline name")
textual.delete_pipeline(pipeline.id)
Getting pipelines
The Pipeline
class represents a pipeline in Textual. A pipeline
is a collection of jobs that process files and extract text and entities from them.
To get the list of all available pipelines, use the get_pipelines
method
from tonic_textual.parse_api import TonicTextualParse
textual = TonicTextual("<TONIC-TEXTUAL-URL>")
pipelines = textual.get_pipelines()
latest_pipeline = pipelines[-1]
print(latest_pipeline.describe())
This produces results similar to the following:
--------------------------------------------------------
Name: pipeline demo
ID: 056e6cc7-0a1d-3ab4-5e61-919fb5475b31
--------------------------------------------------------
Alternatively, use the get_pipeline_by_id
method to get a specific pipeline.
pipeline_id = '056e6cc7-0a1d-3ab4-5e61-919fb5475b31'
textual.get_pipeline_by_id(pipeline_id)
Uploading files
To upload a file to a pipeline, use the upload_file
method.
pipeline = textual.create_pipeline(pipeline_name)
with open(file_path, "rb") as file_content:
file_bytes = file_content.read()
pipeline.upload_file(file_bytes, file_name)
Enumerating files in a pipeline
A pipeline’s enumerate_files
method returns a pipeline enumerator
of all of the files that the pipeline processed.
By default, this will enumerate over the most recent job run of the pipeline, but you can
specify a specific job run by passing the job run ID as an argument.
for file in pipeline.enumerate_files():
print(file.describe())
Parsing documents
Files can be parsed on a one off, on demand fashion without using Pipelines. In this approach, you simply send a file to the Textual service and receive back a parsed result to be used.
Note that files should be read using the ‘rb’ access mode, which opens the file for read in binary format. A timeout can optionally be set in the parse_file command to stop waiting on the parsed result after some number of seconds. You can also set the TONIC_TEXTUAL_PARSE_TIMEOUT_IN_SECONDS environment variable to enfore an SDK-wide timeout.
In addition to reading files from your local system, you can also pass in a bucket, key pair to parse files sitting in S3. This uses the boto3 library to fetch the file from S3 and therefore requires the correct AWS credentials be setup. Usage is similar to the above.
Extracting text and entities
A FileParseResult
object can be used to extract text and entities from a file.
for file in pipeline.enumerate_files():
file.get_markdown() # returns the markdown of the file
file.get_all_entities() # returns all entities in the file
file.get_chunks() # chunks the file into paragraphs and returns them
file.download_results() # downloads the results file as UTF-8 string