Technology – Paurian Café

A 7-Minute Introduction to DataOps

It was a typical business morning for the Technology Consulting Group in early 2000. The Internet was booming, technology was growing in every possible genre, and we were hiring. The secretary saw the new recruit’s email come up on screen. Ten minutes later, all the documents on the network had been overwritten with zeros… the document files were there, but the content was erased.

What first looked like an email containing a resumé was actually a virus. Millions of dollars of research, invoices and contacts were gone. We had backups, of course, but Iomega Jazz drives turned out to be unreliable. Within a few months, the prosperous and growing company closed its doors. All it took was losing its intellectual property – its data.

Fast-Forward to Today

There’s a new trend across many of companies today regarding data, databases and reporting. For the past few years, companies have begun treating data, its structure, and its presentation with the same regard as software managers treat code. This is because businesses have experienced some serious hurt over the past two decades by not applying DataOps principles.

It costs small businesses to go bankrupt. e.g. Technology Consulting Group.
It costs large businesses billions of dollars in damages; e.g. Samsung, Uber, Progressive.
It can even make large businesses go under; e.g. It nearly ended Pixar.

What is DataOps?

DataOps is a workflow process that ensures the quality, reliability, governance and security of data, schemata, queries and reports. It offers easy delivery, quick recovery, new insights, and change transparency. It captures metadata such as the person who made the change, the associate who requested the change, what exactly was changed, tickets associated with the change, and why. It amplifies feedback loops and encourages experimentation, allowing the team to learn from mistakes to achieve mastery.

DataOps takes processes from three other well known workflows in the IT industry: Agile, DevOps and LEAN. The concept has been loosely applied in software engineering businesses since the late 2000’s, but hadn’t been formalized until mid-2021.

The Trinity of DataOps

Agile methodology is a framework for project management that that focuses on broken-down iterations of work called sprints. At the start of each sprint, a set of work items is divided amongst members of the team, or placed on a kanban board. At the end of each sprint, teams reflect on the work performed to find improvements in their strategy for the next sprint.

DevOps is the practice of team-sharing files through a central repository to coordinate and collaborate within and outside the team while communicating with each other through a set of tools. DevOps provides additional tools to consolidate work, build the product, document, unit test, systems test and, if testing is successful, deploy the product.

Lean is used to continually verify the quality of a product and the security methods that protect it. For example, definitions of data can change over time, and not having a system in place to check this allows misleading data to enter the system. Data that at one time meant “A”, now means “B” and should be handled differently. Data that could compromise the company or its clients also needs to be handled in special ways to ensure privacy.

How Does DataOps Differ From DevOps?

There are several factors that are unique to DataOps, and DataOps incorporates many aspects of DevOps within its process, but DataOps is not a superset of DevOps.

Feature	DataOps	DevOps
Sharing work on the same file?	Reports are siloed and SQL scripts are functionally atomic. This makes splitting work on the same entity very difficult. Pair programming is required to share development on the same file.	Source code is mapped by lines that are easily split between developers. Multiple developers can check-out work on the same file at the same time.
What teams are involved?	Business Operations Data Science Business Intelligence Data Governance Data Management Data Operations IT Operations Compliance	Engineering IT Operations Software Development Quality Assurance Security User Experience Design Operations
What skills are involved?	Data management Data science Data analysis Data integration Data quality Data security Statistics Reporting Business IT operations Data operations Application engineering Data engineering Data governance	Requirements gathering Application architecture Software engineering Software development Application integrations Coding Testing Quality control Quality assurance Security IT operations Continuous Integration Continuous Delivery
What is the pipeline like?	Develop the data product Manage the data resources (ETL) Test to ensure quality Release to users Manage usage Monitor usage and results	Design the application / changes Develop and build the application Test to ensure quality Release to users Monitor usage and error logs
Agile Planning	Usually Kanban based; some work is planned up-front of each sprint, but most work flows through the board as requests come in. Loosely structured; More organic.	Usually Scrum based; all work is planned up-front of each sprint. Highly structured, mechanical and organized.
LEAN	Focuses on source-of-truth and data governance principles while cards are pulled or distributed from the board throughout the Sprint.	Focuses on DRY, SOLID coding principles after the Sprint has begun and work has been assigned.

What tools can be used to apply DataOps?

DevOps Source Control

TFS, Subversion, or Git can hold source files like: schemas, type table seeds, utility scripts, views, functions, procedures, configuration files, and certain types of ETL packages and reports. Basically anything that you can load in notepad and read is a good candidate for Git. However, the nature of Tableau and Power BI report files requires a little more tooling.

Power BI recently added source control to its server and it is amazing! It separates the report definition from the other components and uploads the pieces up to a Git repository. You can then compare the XML of the reports’ definitions (and other components) to see what changes took place. It provides a text box for developers to comment on version changes and commit only the reports that apply.

Tableau, however, only keeps the prior 9 copies and ditches the rest; Tableau just has a rolling backup the latest 10 versions. So for Tableau reports, either Git with LFS enabled, or Bitbucket are better options than relying on the server. Either of these options allow commits with comments, versioning, tagging, merging and conflict resolution. Of the two, I would recommend a cloud-managed Git repository system.

For data changes, such as values in type or look-up tables and system data that is handled outside an administration console, such as price changes, put the data change in a script that can be checked-in, reviewed, and vetted to a test environment. Redgate provides some good tools for this such as Flyaway and SQL Data Compare.

Recommendation: Azure DevOps Git for SQL Server & Power BI, GitHub Actions for Oracle & Tableau, Redgate Flyaway and SQL Backup Pro for data

Kanban Board

Jira, Azure DevOps, Monday and Wrike all come highly recommended. Since this is the central location for filing and working off requests and features, it’s best to research which of these would be best suited for you and your team.

Recommendation: Azure DevOps for SQL Server & Power BI, Jira or Monday for Oracle & Tableau

Lean methodology

Lean methodology rests on two pillars that provide a framework for all Lean projects: Continuous improvement and respect for people. It’s more about how to use the tools, people and resources at hand to create a feedback loop that improves process and product for the client and the workers. It can use the tools already mentioned, but adapts for the unique needs of your service. Consider a system that allows you to extend it with plugins and contains workflows and pipelines to automate as much of the development and deployment as possible.

Recommendation: Azure DevOps, Jira, Monday, Jenkins, GitLab … whatever fits your team and client needs best.

Conclusion

With tools and a structured process that involves the whole team you can implement a process that protects the core components and intellectual property of your company. It allows you to efficiently and confidently release changes that effect everyone. If there’s ever a failure because of database, report, or data changes you have a way to roll back quickly. Backups can only go so far, and should be last-ditch efforts to recover from a disaster. DataOps is a methodology that keeps your data and its availability safe and operational while keeping your teams productive.

Encountered x file(s) that should have been pointers, but weren’t

We encountered this issue when trying to merge master after someone had committed a slew of PDFs. Although we couldn’t identify the exact situation that caused this problem, the result is that nobody could merge master into their branches.

It looks like this was once filed as an issue with git, itself: https://github.com/git-lfs/git-lfs/issues/1939

But amongst the threaded comments, it looks like there is a root issue with the file types that are used to compare files and that on strange and rare occasions a unicorn fart fills the git-void with pain by changing these on the server.

After several dozen attempts, I came across this gem of a command:

$ git lfs migrate import --everything --include='*.pdf'

migrate: override changes in your working copy? [Y/n] Y
migrate: changes in your working copy will be overridden ...
migrate: Sorting commits: ..., done
migrate: Rewriting commits: 100% (5940/5940), done
migrate: Updating refs: ..., done
migrate: checkout: ..., done

The pain was almost over. Let’s see how merge does, now.

$ git merge --ff-only

fatal: Not possible to fast-forward, aborting.

Hmmm… Okay. Let’s not force a fast-forward merge; let’s just do a regular merge.

$ git merge

warning: Cannot merge binary files: Deployment Scripts/CopyFiles/UpperGreatLakesSinglesChallenge.pdf (HEAD vs. refs/remotes/origin/bugfix/13032-ems-error-saving-ppc)
:
:
CONFLICT (add/add): Merge conflict in Deployment Scripts/CopyFiles/UpperGreatLakesSinglesChallenge.pdf
Auto-merging Deployment Scripts/CopyFiles/UpperGreatLakesSinglesChallenge.pdf
:
:

Then I performed a manual merge on all the files by taking the server’s file (on the right in Sublime Merge, when resolving conflicts).
But Sublime Merge couldn’t actually commit the non-changes, so I went back to the git console to wrap up tackling the lfs-merge nightmare (like a Dream Warrior to Krueger):

$ git commit

[bugfix/13032-ems-error-saving-ppc f093b0c49] Merge remote-tracking branch 'refs/remotes/origin/bugfix/13032-ems-error-saving-ppc' into bugfix/13032-ems-error-saving-ppc

Tag: Technology

Many Businesses Aren’t Protecting This Valuable Asset