Original Source Here
For data scientists and engineers, speed is important along with accuracy. For accuracy, we built Finetuner, which lets you finetune neural networks to achieve top performance on downstream tasks. Concerning speed, Jina was already fast, but now it’s even faster. That’s all down to DocArray – the perfect scalable data structure for deep learning.
DocArray has been created to remove all the shortcomings in existing data structures, especially for ML and data science-related tasks. Here is a comparison of
DocArray with other data structures.
DocArray is included with Jina 3.0 and above, and it can be accessed after installation. You can install it:
- Via Pip (no extra dependency installed):
pip install docarray
- Via conda (no extra dependency installed):
conda install -c conda-forage docarray
- Full install:
pip install "docarray[full]"
For more details on installation, please refer to this documentation.
DocArray is the first step towards standardizing a data structure for machine learning and data science. The aim of
DocArray is to provide a single, robust, efficient, and powerful data structure for all your data processing needs. It comes packed with the following features:
Easy to use:
DocArrayis completely independent of Jina. It can be used to handle unstructured data without any prior knowledge of Jina’s framework.
Support for multiple data types:
DocArraysupports all kinds of data including text, image, audio, video, and data types like ndarray, JSON, or Pandas dataframe.
Easy operations: Sharing data over a network is easier with
DocArraysince it allows serialization and deserialization of data ensuring network packets are transmitted fast and not lost. Not only this, it’s now easier to create, vectorize and embed Documents without external processing.
DocArrayis much faster than the previous
DocumentArraythat was built into Jina core. We conducted tests on 100,000
DocumentArrays averaged over 5 repetitions, and the results speak for themselves:
Neural Search Ready: DocArray is a one-stop solution for all your data pre-processing needs. It comes pre-loaded with basic functionalities like
.match(), etc. to ensure easy compatibility with Jina core and letting you build neural search solutions in no time.
Less overhead for networking: Now being independent, DocArray has fewer layers and allows better access. DocArray lets you interoperate with other frameworks very easily.
Operations on DocArray
DocArray has two components:
Document is the basic data type in Jina, and every piece of data, be it text, audio, video, etc is converted into a
Document for further processing. A
DocumentArray is a group of
DocArray allows users to manipulate and work with the data stored in
DocumentArrays. Let’s look at them in detail:
- Without any attributes:
from docarray import Document doc = Document()
- With attributes:
from docarray import Document doc = Document(parameter)
Nested data: This is all about Documents inside Documents. When we talk about nesting, the important thing to understand is granularity and adjacency:
a. Granularity: nesting Documents vertically. This is achieved by the
b. Adjacency: nesting Documents horizontally. This is achieved by the
To see these in action and know more about nesting data in DocArray, please see here.
from docarray import Document doc = Document(id="one", chunks=[Document(id="two")])
DocArraymakes it really easy to send and receive Documents and it is designed to be "ready-to-wire". Serialization is supported for JSON, bytes, dict, and protobuf, and the code for all of them can be found here.
Embeddings: Embeddings are multi-dimensional representations of Documents. The Document’s
.embeddingattribute contains its vector embeddings. Examples and sample usage can be found here.
Visualization: Visualization is very important with image and video data.
DocArrayhelps you achieve this using the
.plot()method. If you want to see the organization of your nested data, you can use the
- Constructing a
DocumentArray: You can construct a
DocumentArrayusing a single file, multiple local files, a list of Documents, and even from empty Documents. A simple
DocumentArraywould look something like this:
from docarray import DocumentArray docs = DocumentArray()
That’s not all. You can perform many other operations while creating a DocumentArray:
Serialization: Serialization of
DocumentArrayis similar to that of a
Document. You can also serialize a
DocumentArrayto be sent across a network, in bytes, JSON, cloud, base64, Protobuf, list, or dataframe. An in-depth overview of all the processes is listed here.
DocumentArrayelements can be accessed as easily as Python lists.
DocumentArraylets you access deeply-nested Documents by id, ellipses, multiple ids, boolean masks, or nested structure, and the usage can be seen here.
DocArrayalso makes it easy to find nearest neighbors within Documents in the DocumentArray. If the
.embeddings()attribute is set for a
DocumentArray, we can use the
match()method for finding the nearest neighbour
Evaluate matches: You can evaluate the result using the
DocArray. The results are stored in the
evaluations()field of the
Documents. You can get more information about the performance such as average precision, reciprocal rank, f1 score, etc in this document.
Since we talked about the different features of
DocArray and what makes it the most suitable data structure for deep learning, now it’s time to look at the capabilities of
DocArray in action. Let’s look at two examples of
DocArray for two different data types:
Manipulating Text with DocArray
Let’s see a simple example of representing text in
DocArray. Representing text is as simple as creating an instance of
Document and adding the text to it. The
Document supports data in any language. In this first step, data is converted into a
Document to be ready for further processing.
from docarray import Document #convert a simple sentence into a Document in Jina doc = Document(text='hello, world') # convert a sentence in Hindi to a Document in Jina doc = Document(text=’👋 नमस्ते दुनिया!’)
Diving a bit deeper, let’s see how we can build a very simple text matching function. The idea is to input a query sentence, match it with sentences in our dataset, and return the matched sentences.
First, we need to load the dataset from a URL, convert it into text, and put it into a
Document. The dataset is the e-book of Pride and Prejudice that can be found here.
from docarray import Document doc = Document(uri="https://www.gutenberg.org/files/1342/1342-0.txt").load_uri_to_text()
Next, since our dataset is an amalgamation of long sentences, we need to break it into smaller chunks that can be converted into a DocumentArray. We split the sentences using the ‘\n’ symbol i.e. whenever a new line is encountered, we store that sentence as a
Document in the
docs = DocumentArray(Document(text = s.strip()) for s in doc.text.split('\n') if s.strip())
Next comes the vectorization of features (i.e. we need to convert our features into indices in a vector/matrix). The features in this example become the embeddings of each
Document in our
DocumentArray. There are many ways to do this but, a faster and space-efficient way is to use feature hashing. It works by taking the features, applying a hash function that can hash the values and return them as indices. But,
DocArray saves us from the computation, and using feature hashing is as easy as a single line of code:
docs.apply(lambda doc: doc.embed_feature_hashing())
Finally, we take our query sentence, convert it into a
Document, vectorize it and then match it with the vectors of the
Documents in the
DocumentArrays. We take the query sentence as "she entered the room" from Pride and Prejudice and try to fetch similar sentences.
# query sentence query = (Document(text="she entered the room").embed_feature_hashing().match(docs, limit=5, exclude_self=True, metric="jaccard", use_scipy=True)) # print the results print(query.matches[:, ('text', 'scores__jaccard')])
Here are the arguments used in the
limit: It specifies the number of results to be returned. In our example, we specified it to be 5, so 5 sentences will be returned.
exclude_self: If set to True, the sentence won’t be matched to itself. Otherwise, it will match itself and return the query sentence as one of the results.
metric: It defines which metric is used for calculating the nearest neighbors. We use the jaccard metric in this example.
use_scipy: Framework is automatically chosen depending on the embeddings. Here, we use scipy for better processing.
Manipulating Audio with DocArray
Besides text, we can even manipulate audio files using
DocArray. In this example, we will reverse an audio file. The first step is to convert an audio file (.wav file in this example) into a
Document by loading it and then converting it into an audio blob. Then we can reverse this blob which in turn will reverse the audio. You can listen to the input and output files for this examples here.
from docarray import Document doc = Document(uri="hello.wav").load_uri_to_audio_blob() doc.blob = doc.blob[::-1] doc.save_audio_blob_to_file("elloh.wav")
Manipulating different data types and working with powerful tools like
DocArray adds to the speed of building deep learning applications. You can find examples of all the data types here.
DocArray closely aligns with Python’s list, thus making it super easy and intutive for developers to get started. Behind the simple syntax, DocArray comes with numerous outstanding capabilities compared to Python’s list. It expands from one-dimensional data types like text to multidimensional complex data types like images, audio, etc. It also lets you perform a lot of data processing tasks out-of-the-box with its simple abstracted functions. Data professionals will agree that efficient understanding and pre-processing of data is a key to building scalable, fast AI applications.
Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot