Resolving Practical Challenges of Data Lakes and Versioning

https://miro.medium.com/max/1200/0*5fLSHHSDAct9YKqN

Original Source Here

Resolving Practical Challenges of Data Lakes and Versioning

Photo by fabio on Unsplash

In today’s world of software and the internet, it is never wrong to say that almost all applications are dependent on data — user data, data regarding views of a page and clicks on an ad, etc. Managing all this data has evolved through the years. From the days of storing structural data in blocks, files, and databases, we’ve evolved to cloud-based storage systems that leverage object storage models.

These object storage models are highly effective in managing unstructured data by storing the original data along with the metadata, attributes, and a unique key to identify it. Most data reliant applications today make use of data lakes to store structured as well as unstructured data. Processing is done to convert the data into a structured format only when it needs to be read. So, it makes the process simple and flexible.

Challenges with Data Lakes

If you’ve ever worked with large data lakes and data sets, you would have encountered issues with deleting sensitive data from object storage and being unable to recover it. Sometimes, recovery policies don’t work or are extremely tedious and require too many permissions. At other times, while modifying data or code, you may have encountered difficulties with modifying it without affecting the original data.

These challenges can be detrimental to a project if not resolved on time. The best way to deal with them is through data versioning technology. Git is the most popular, however for data lakes and large-scale storage, another solution is LakeFS.

Resolving Data Issues with Open Source LakeFS

LakeFS is a data management and data versioning tool that provides git-like functionality for data. It allows you to create atomic, versioned data lakes with which you time travel from one version of data to another using simple git-like commands or an interface. Then you can use this data to run all kinds of data jobs from complex ETL to ML models and analytics.

The best part: LakeFS is open source.

Using LakeFS, we can version data in a manner similar to what we do with code using git. We can make commits on new changes, create new branches for new experiments, and merge them back to the master or reset them. You can also revert to old commits if there is some bad data in your current version.

LakeFS Architecture

LakeFS uses object-based storage models and a copy-on-write mechanism to version data. Let me break it down into small pieces for easy understanding. We discussed object storage in the introduction, and the copy-on-write mechanism means there is no actual duplication of data. It creates logical copies based on metadata from each version of data — to be more precise, a snapshot at that point in time. It links each object to a path understandable by the LakeFS protocol.

Source

LakeFS makes use of SSTables (graveler model) for commits that store metadata about the current snapshot of data for each commit. When accessing a certain commit it uses those pieces of key-value pairs of metadata to build your version. By leveraging this concept of object store, we can create as many logical copies of the data aka commits and can travel back to any point in time. That’s how scalable LakeFS is.

How to Use LakeFS for Data Versioning

Now let’s run through a quick demo. I’ll walk you through a short use case of LakeFS for a small repo. I’ll be creating a new object for the iris data, a teeny-tiny dataset.

First, to create a local LakeFS setup, I use docker-compose to create the containers for the object model and the PostgresSQL database. Run this command from the LakeFS official documentation to start the containers.

curl https://compose.lakefs.io | docker-compose -f — up

Verify your installation at http://127.0.0.1:8000/setup

If your setup is successful, you’ll see this screen.

Once you create the user, you’ll get a set of credentials that you’ll use to access the repos.

Save the credentials somewhere or download the YAML file provided by LakeFS to store it because it won’t appear again. Now, head over to the login page and enter the key ID and secret key to access the repositories. You’ll find an empty screen with the option to create one.

Here, you can either create a new repo on your own or import it. I’m creating a repository with the name “test-repo”.

The storage namespace is local://test and the default branch will be main. Now, click on Create repository. The empty repo is now created and we can upload our data into it.

We can see several options — commits, branches, tags, actions, user settings and administration settings.

In the upload object option, I’m uploading the iris.data file that has the dataset for the iris model. After the upload is complete, we’ll commit the changes with the message “first-commit”.

Now, under the commits tab, we can see the commit with more information about each commit. If we click on them, we can see a detailed list of changes for that commit.

Now, we’ll create a new branch called “test-branch” using the Create Branch option under the Branches tab. Here, add the iris.names file.

Similarly, we’ll commit the new changes for that branch which will not disturb the main branch because commits in LakeFS are atomic and versioned.

We can see that the “test-branch” has one extra commit than the master branch.

Now with this data in place, we can perform data analysis, run jobs, create git-like actions, add hooks, create ML models and most importantly deploy all of this into your pipeline and cloud storage easily.

Last but not the least, LakeFS allows you to configure default policies like retention policy for garbage collection, branch protection rules and gives a command-line interface using the lakectl command.

Conclusion

Data Science is already booming and data versioning will gain more attraction soon. Knowing a tool like LakeFS will definitely give you an upper hand among your peers. I encourage you to try it for yourself and see how simple and easy it is to use.

Thank you for reading this article!

AI/ML

Trending AI/ML Article Identified & Digested via Granola by Ramsey Elbasheer; a Machine-Driven RSS Bot

%d bloggers like this: