Enterprises across the globe are looking to utilize multiple data sources to implement a unified search experience for their employees and end customers. Considering the large volume of data that needs to be examined and indexed, the retrieval speed, solution scalability, and search performance become key factors to consider when choosing an enterprise intelligent search solution. Additionally, these unique data sources comprise structured and unstructured content repositories—including various file types—which may cause compatibility issues.

Amazon Kendra is a highly accurate and intelligent search service that enables users to search for answers to their questions from your unstructured and structured data using natural language processing and advanced search algorithms. It returns specific answers to questions, giving users an experience that’s close to interacting with a human expert.

Today, Amazon Kendra launched seven additional data format support options for you to use. This allows you to easily integrate your existing data sources as is and perform intelligent search across multiple content repositories.

In this post, we discuss the new supported data formats and how to use them.

New supported data formats

Previously, Amazon Kendra supported documents that included structured text in the form of frequently asked questions and answers, as well as unstructured text in the form of HTML files, Microsoft PowerPoint presentations, Microsoft Word documents, plain text documents, and PDFs.

With this launch, Amazon Kendra now offers support for seven additional data formats:

  • Rich Text Format (RTF)
  • JavaScript Object Notation (JSON)
  • Markdown (MD)
  • Comma separated values (CSV)
  • Microsoft Excel (MS Excel)
  • Extensible Markup Language (XML)
  • Extensible Stylesheet Language Transformations (XSLT)

Amazon Kendra users can ingest these documents with different data formats to their index in the following two ways:

Solution overview

In the following sections, we walk through the steps for adding documents from a data source and performing a search on those documents.

The following diagram shows our solution architecture.

For testing this solution for any of the supported formats, you need to use your own data. You can test by uploading documents of the same or different formats to the S3 bucket.

Create an Amazon Kendra index

For instructions on creating your Amazon Kendra index, refer to Creating an index.

You can skip this step if you have a pre-existing index to use for this demo.

Upload documents to an S3 bucket and ingest to the index using the S3 connector

Complete the following steps to connect an S3 bucket to your index:

  1. Create an S3 bucket to store your documents.
  2. Create a folder named sample-data.
  3. Upload the documents that you want to test to the folder.
  4. On the Amazon Kendra console, go to your index and choose Data sources.
  5. Choose Add data source.
  6. Under Available data sources, select S3 and choose Add Connector.
  7. Enter a name for your connector (such as Demo_S3_connector) and choose Next.
  8. Choose Browse S3 and choose the S3 bucket where you uploaded the documents.
  9. For IAM Role, create a new role.
  10. For Set sync run schedule, select Run on demand.
  11. Choose Next.
  12. On the Review and create page, choose Add data source.
  13. After the creation process is complete, choose Sync Now.

Now that you have ingested some documents, you can navigate to the built-in search console to test queries.

Search your documents with the Amazon Kendra search console

On the Amazon Kendra console, choose Search indexed content in the navigation pane.

The following are examples of the results from the search for different document types:

  • RTF – Input data in RTF format uploaded to the S3 bucket and sync the data source:

The following screenshot shows the search results.

  • JSON – Input data in JSON format uploaded to the S3 bucket and sync the data source:

The following screenshot shows the search results.

  • Markdown – Input data in MD format uploaded to the S3 bucket and sync the data source:

The following screenshot shows the search results.

  • CSV – Input data in CSV format uploaded to the S3 bucket and sync the data source:

The following screenshot shows the search results.

  • Excel – Input data in Excel format uploaded to the S3 bucket and sync the data source:

The following screenshot shows the search results.

  • XML – Input data in XML format uploaded to the S3 bucket and sync the data source:

The following screenshot shows the search results.

  • XSLT – Input data in XSLT format uploaded to the S3 bucket and sync the data source:

The following screenshot shows the search results.

Clean up

To avoid incurring future costs, clean up the resources you created as part of this solution using the following steps:

  1. On the Amazon Kendra console, choose Indexes in the navigation pane.
  2. Choose the index that contains the data source to delete.
  3. In the navigation pane, choose Data sources.
  4. Choose the data source to remove, then choose Delete.

When you delete a data source, Amazon Kendra removes all the stored information about the data source. Amazon Kendra removes all the document data stored in the index, and all run histories and metrics associated with the data source. Deleting a data source does not remove the original documents from your storage.

  1. On the Amazon Kendra console, choose Indexes in the navigation pane.
  2. Choose the index to delete, then choose Delete.

Refer to Deleting an index and data source for more details.

  1. On the Amazon S3 console, choose Buckets in the navigation pane.
  2. Select the bucket you want to delete, then choose Delete.
  3. Enter the name of the bucket to confirm deletion, then choose Delete bucket.

If the bucket contains any objects, you’ll receive an error alert. Empty the bucket before deleting it by choosing the link in the error message and following the instructions on the Empty bucket page. Then return to the Delete bucket page and delete the bucket.

  1. To verify that you’ve deleted the bucket, open the Buckets page and enter the name of the bucket that you deleted. If the bucket can’t be found, your deletion was successful.

Refer to Deleting a bucket page for more details.

Conclusion

In this post, we discussed the new data formats that Amazon Kendra now supports. In addition, we discussed how to use Amazon Kendra to ingest and perform a search on these new document types stored in an S3 bucket. To learn more about the different data formats supported, refer to Types of documents.

We introduced you to the basics, but there are many additional features that we didn’t cover in this post, such as the following:

  • You can enable user-based access control for your Amazon Kendra index and restrict access to users and groups that you configure.
  • You can map additional fields to Amazon Kendra index attributes and enable them for faceting, search, and display in the search results.
  • You can integrate different third-party data source connectors like Service Now and Salesforce with the Custom Document Enrichment (CDE) capability in Amazon Kendra to perform additional attribute mapping logic and even custom content transformation during ingestion. For the complete list of supported connectors, refer to Connectors.

To learn about these possibilities and more, refer to the Amazon Kendra Developer Guide.


About the authors

Rishabh Yadav is a Partner Solutions architect at AWS with an extensive background in DevOps and Security offerings at AWS. He works with the ASEAN partners to provide guidance on enterprise cloud adoption and architecture reviews along with building AWS practice through the implementation of Well-Architected Framework. Outside of work, he likes to spend his time in the sports field and FPS gaming.

Kruthi Jayasimha Rao is a Partner Solutions Architect with a focus in AI and ML. She provides technical guidance to AWS Partners in following best practices to build secure, resilient, and highly available solutions in the AWS Cloud.

Keerthi Kumar Kallur is a Software Development Engineer at AWS. He has been with the AWS Kendra team since past 2 years and worked on various features as well as customers. In his spare time, he likes to do outdoor activities such as hiking, sports such as volleyball.

Source: https://aws.amazon.com/blogs/machine-learning/new-expanded-data-format-support-in-amazon-kendra/