Introducing DocArray → A type agnostic data structure!



Original Source Here

Overview

For data scientists and engineers, speed is important along with accuracy. For accuracy, we built Finetuner, which lets you finetune neural networks to achieve top performance on downstream tasks. Concerning speed, Jina was already fast, but now it’s even faster. That’s all down to DocArray – the perfect scalable data structure for deep learning. DocArray has been created to remove all the shortcomings in existing data structures, especially for ML and data science-related tasks. Here is a comparison of DocArray with other data structures.

Installation

DocArray is included with Jina 3.0 and above, and it can be accessed after installation. You can install it:

  1. Via Pip (no extra dependency installed): pip install docarray
  2. Via conda (no extra dependency installed): conda install -c conda-forage docarray
  3. Full install: pip install "docarray[full]"

For more details on installation, please refer to this documentation.

Features

DocArray is the first step towards standardizing a data structure for machine learning and data science. The aim of DocArray is to provide a single, robust, efficient, and powerful data structure for all your data processing needs. It comes packed with the following features:

  • Easy to use: DocArray is completely independent of Jina. It can be used to handle unstructured data without any prior knowledge of Jina’s framework.

  • Support for multiple data types: DocArray supports all kinds of data including text, image, audio, video, and data types like ndarray, JSON, or Pandas dataframe.

  • Easy operations: Sharing data over a network is easier with DocArray since it allows serialization and deserialization of data ensuring network packets are transmitted fast and not lost. Not only this, it’s now easier to create, vectorize and embed Documents without external processing.

  • Ultra-fast: DocArray is much faster than the previous DocumentArray that was built into Jina core. We conducted tests on 100,000 Documents/DocumentArrays averaged over 5 repetitions, and the results speak for themselves:

image

  • Neural Search Ready: DocArray is a one-stop solution for all your data pre-processing needs. It comes pre-loaded with basic functionalities like .embed(), .match(), etc. to ensure easy compatibility with Jina core and letting you build neural search solutions in no time.

  • Less overhead for networking: Now being independent, DocArray has fewer layers and allows better access. DocArray lets you interoperate with other frameworks very easily.

Operations on DocArray

DocArray has two components: Document, and DocumentArray. Document is the basic data type in Jina, and every piece of data, be it text, audio, video, etc is converted into a Document for further processing. A DocumentArray is a group of Documents. DocArray allows users to manipulate and work with the data stored in Documents and DocumentArrays. Let’s look at them in detail:

Document

  • Construction

    1. Without any attributes:
    from docarray import Document
    doc = Document()
    1. With attributes:
    from docarray import Document
    doc = Document(parameter)
    1. Nested data: This is all about Documents inside Documents. When we talk about nesting, the important thing to understand is granularity and adjacency:

      a. Granularity: nesting Documents vertically. This is achieved by the
      .chunks attribute.

      b. Adjacency: nesting Documents horizontally. This is achieved by the
      .matches attribute.

      To see these in action and know more about nesting data in DocArray, please see here.

      from docarray import Document
      doc = Document(id="one", chunks=[Document(id="two")])
  • Serialization: DocArray makes it really easy to send and receive Documents and it is designed to be "ready-to-wire". Serialization is supported for JSON, bytes, dict, and protobuf, and the code for all of them can be found here.

  • Embeddings: Embeddings are multi-dimensional representations of Documents. The Document’s .embedding attribute contains its vector embeddings. Examples and sample usage can be found here.

  • Visualization: Visualization is very important with image and video data. DocArray helps you achieve this using the .plot() method. If you want to see the organization of your nested data, you can use the .summary() method.

image

DocumentArray

  • Constructing a DocumentArray: You can construct a DocumentArray using a single file, multiple local files, a list of Documents, and even from empty Documents. A simple DocumentArray would look something like this:
from docarray import DocumentArray
docs = DocumentArray()

That’s not all. You can perform many other operations while creating a DocumentArray:

  • Serialization: Serialization of DocumentArray is similar to that of a Document. You can also serialize a DocumentArray to be sent across a network, in bytes, JSON, cloud, base64, Protobuf, list, or dataframe. An in-depth overview of all the processes is listed here.

  • Access elements: DocumentArray elements can be accessed as easily as Python lists. DocumentArray lets you access deeply-nested Documents by id, ellipses, multiple ids, boolean masks, or nested structure, and the usage can be seen here.

  • Nearest neighbors: DocArray also makes it easy to find nearest neighbors within Documents in the DocumentArray. If the .embeddings() attribute is set for a DocumentArray, we can use the match() method for finding the nearest neighbour Documents.

  • Evaluate matches: You can evaluate the result using the evaluate() method of DocArray. The results are stored in the evaluations() field of the Documents. You can get more information about the performance such as average precision, reciprocal rank, f1 score, etc in this document.

Examples

Since we talked about the different features of DocArray and what makes it the most suitable data structure for deep learning, now it’s time to look at the capabilities of DocArray in action. Let’s look at two examples of DocArray for two different data types:

Manipulating Text with DocArray

Let’s see a simple example of representing text in DocArray. Representing text is as simple as creating an instance of Document and adding the text to it. The Document supports data in any language. In this first step, data is converted into a Document to be ready for further processing.

from docarray import Document

#convert a simple sentence into a Document in Jina 
doc = Document(text='hello, world')

# convert a sentence in Hindi to a Document in Jina
doc = Document(text=’👋	नमस्ते दुनिया!’)

Diving a bit deeper, let’s see how we can build a very simple text matching function. The idea is to input a query sentence, match it with sentences in our dataset, and return the matched sentences.

First, we need to load the dataset from a URL, convert it into text, and put it into a Document. The dataset is the e-book of Pride and Prejudice that can be found here.

from docarray import Document
doc = Document(uri="https://www.gutenberg.org/files/1342/1342-0.txt").load_uri_to_text()

Next, since our dataset is an amalgamation of long sentences, we need to break it into smaller chunks that can be converted into a DocumentArray. We split the sentences using the ‘\n’ symbol i.e. whenever a new line is encountered, we store that sentence as a Document in the DocumentArray.

docs = DocumentArray(Document(text = s.strip()) for s in doc.text.split('\n') if s.strip())

Next comes the vectorization of features (i.e. we need to convert our features into indices in a vector/matrix). The features in this example become the embeddings of each Document in our DocumentArray. There are many ways to do this but, a faster and space-efficient way is to use feature hashing. It works by taking the features, applying a hash function that can hash the values and return them as indices. But, DocArray saves us from the computation, and using feature hashing is as easy as a single line of code:

docs.apply(lambda doc: doc.embed_feature_hashing())

Finally, we take our query sentence, convert it into a Document, vectorize it and then match it with the vectors of the Documents in the DocumentArrays. We take the query sentence as "she entered the room" from Pride and Prejudice and try to fetch similar sentences.

# query sentence 
query = (Document(text="she entered the room").embed_feature_hashing().match(docs, limit=5, exclude_self=True, 
metric="jaccard", use_scipy=True))

# print the results
print(query.matches[:, ('text', 'scores__jaccard')])

Here are the arguments used in the .match() method:

  • limit : It specifies the number of results to be returned. In our example, we specified it to be 5, so 5 sentences will be returned.
  • exclude_self: If set to True, the sentence won’t be matched to itself. Otherwise, it will match itself and return the query sentence as one of the results.
  • metric: It defines which metric is used for calculating the nearest neighbors. We use the jaccard metric in this example.
  • use_scipy: Framework is automatically chosen depending on the embeddings. Here, we use scipy for better processing.

Manipulating Audio with DocArray

Besides text, we can even manipulate audio files using DocArray. In this example, we will reverse an audio file. The first step is to convert an audio file (.wav file in this example) into a Document by loading it and then converting it into an audio blob. Then we can reverse this blob which in turn will reverse the audio. You can listen to the input and output files for this examples here.

from docarray import Document
doc = Document(uri="hello.wav").load_uri_to_audio_blob()
doc.blob = doc.blob[::-1]
doc.save_audio_blob_to_file("elloh.wav")

Manipulating different data types and working with powerful tools like DocArray adds to the speed of building deep learning applications. You can find examples of all the data types here.

Summary

DocArray closely aligns with Python’s list, thus making it super easy and intutive for developers to get started. Behind the simple syntax, DocArray comes with numerous outstanding capabilities compared to Python’s list. It expands from one-dimensional data types like text to multidimensional complex data types like images, audio, etc. It also lets you perform a lot of data processing tasks out-of-the-box with its simple abstracted functions. Data professionals will agree that efficient understanding and pre-processing of data is a key to building scalable, fast AI applications.

References

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: