Many Businesses Aren’t Protecting This Valuable Asset

full frame shot of eye

A 7-Minute Introduction to DataOps

It was a typical business morning for the Technology Consulting Group in early 2000. The Internet was booming, technology was growing in every possible genre, and we were hiring. The secretary saw the new recruit’s email come up on screen. Ten minutes later, all the documents on the network had been overwritten with zeros… the document files were there, but the content was erased.

What first looked like an email containing a resumé was actually a virus. Millions of dollars of research, invoices and contacts were gone. We had backups, of course, but Iomega Jazz drives turned out to be unreliable. Within a few months, the prosperous and growing company closed its doors. All it took was losing its intellectual property – its data.

Fast-Forward to Today

There’s a new trend across many of companies today regarding data, databases and reporting. For the past few years, companies have begun treating data, its structure, and its presentation with the same regard as software managers treat code. This is because businesses have experienced some serious hurt over the past two decades by not applying DataOps principles.

  • It costs small businesses to go bankrupt. e.g. Technology Consulting Group.
  • It costs large businesses billions of dollars in damages; e.g. Samsung, Uber, Progressive.
  • It can even make large businesses go under; e.g. It nearly ended Pixar.

What is DataOps?

DataOps is a workflow process that ensures the quality, reliability, governance and security of data, schemata, queries and reports. It offers easy delivery, quick recovery, new insights, and change transparency. It captures metadata such as the person who made the change, the associate who requested the change, what exactly was changed, tickets associated with the change, and why. It amplifies feedback loops and encourages experimentation, allowing the team to learn from mistakes to achieve mastery.

DataOps takes processes from three other well known workflows in the IT industry: Agile, DevOps and LEAN. The concept has been loosely applied in software engineering businesses since the late 2000’s, but hadn’t been formalized until mid-2021.

The Trinity of DataOps

Agile methodology is a framework for project management that that focuses on broken-down iterations of work called sprints. At the start of each sprint, a set of work items is divided amongst members of the team, or placed on a kanban board. At the end of each sprint, teams reflect on the work performed to find improvements in their strategy for the next sprint.

DevOps is the practice of team-sharing files through a central repository to coordinate and collaborate within and outside the team while communicating with each other through a set of tools. DevOps provides additional tools to consolidate work, build the product, document, unit test, systems test and, if testing is successful, deploy the product.

Lean is used to continually verify the quality of a product and the security methods that protect it. For example, definitions of data can change over time, and not having a system in place to check this allows misleading data to enter the system. Data that at one time meant “A”, now means “B” and should be handled differently. Data that could compromise the company or its clients also needs to be handled in special ways to ensure privacy.

How Does DataOps Differ From DevOps?

There are several factors that are unique to DataOps, and DataOps incorporates many aspects of DevOps within its process, but DataOps is not a superset of DevOps.

FeatureDataOpsDevOps
Sharing work on the same file?Reports are siloed and SQL scripts are functionally atomic. This makes splitting work on the same entity very difficult. Pair programming is required to share development on the same file.Source code is mapped by lines that are easily split between developers. Multiple developers can check-out work on the same file at the same time.
What teams are involved?Business Operations
Data Science
Business Intelligence
Data Governance
Data Management
Data Operations
IT Operations
Compliance
Engineering
IT Operations
Software Development
Quality Assurance
Security
User Experience
Design
Operations
What skills are involved?Data management
Data science
Data analysis
Data integration
Data quality
Data security
Statistics
Reporting
Business
IT operations
Data operations
Application engineering
Data engineering
Data governance
Requirements gathering
Application architecture
Software engineering
Software development
Application integrations
Coding
Testing
Quality control
Quality assurance
Security
IT operations
Continuous Integration
Continuous Delivery
What is the pipeline like?Develop the data product
Manage the data resources (ETL)
Test to ensure quality
Release to users
Manage usage
Monitor usage and results
Design the application / changes
Develop and build the application
Test to ensure quality
Release to users
Monitor usage and error logs
Agile PlanningUsually Kanban based; some work is planned up-front of each sprint, but most work flows through the board as requests come in. Loosely structured; More organic.Usually Scrum based; all work is planned up-front of each sprint. Highly structured, mechanical and organized.
LEANFocuses on source-of-truth and data governance principles while cards are pulled or distributed from the board throughout the Sprint.Focuses on DRY, SOLID coding principles after the Sprint has begun and work has been assigned.

What tools can be used to apply DataOps?

DevOps Source Control

TFS, Subversion, or Git can hold source files like: schemas, type table seeds, utility scripts, views, functions, procedures, configuration files, and certain types of ETL packages and reports. Basically anything that you can load in notepad and read is a good candidate for Git. However, the nature of Tableau and Power BI report files requires a little more tooling.

Power BI recently added source control to its server and it is amazing! It separates the report definition from the other components and uploads the pieces up to a Git repository. You can then compare the XML of the reports’ definitions (and other components) to see what changes took place. It provides a text box for developers to comment on version changes and commit only the reports that apply.

Tableau, however, only keeps the prior 9 copies and ditches the rest; Tableau just has a rolling backup the latest 10 versions. So for Tableau reports, either Git with LFS enabled, or Bitbucket are better options than relying on the server. Either of these options allow commits with comments, versioning, tagging, merging and conflict resolution. Of the two, I would recommend a cloud-managed Git repository system.

For data changes, such as values in type or look-up tables and system data that is handled outside an administration console, such as price changes, put the data change in a script that can be checked-in, reviewed, and vetted to a test environment. Redgate provides some good tools for this such as Flyaway and SQL Data Compare.

Recommendation: Azure DevOps Git for SQL Server & Power BI, GitHub Actions for Oracle & Tableau, Redgate Flyaway and SQL Backup Pro for data

Kanban Board

Jira, Azure DevOps, Monday and Wrike all come highly recommended. Since this is the central location for filing and working off requests and features, it’s best to research which of these would be best suited for you and your team.

Recommendation: Azure DevOps for SQL Server & Power BI, Jira or Monday for Oracle & Tableau

Lean methodology

Lean methodology rests on two pillars that provide a framework for all Lean projects: Continuous improvement and respect for people. It’s more about how to use the tools, people and resources at hand to create a feedback loop that improves process and product for the client and the workers. It can use the tools already mentioned, but adapts for the unique needs of your service. Consider a system that allows you to extend it with plugins and contains workflows and pipelines to automate as much of the development and deployment as possible.

Recommendation: Azure DevOps, Jira, Monday, Jenkins, GitLab … whatever fits your team and client needs best.

Conclusion

With tools and a structured process that involves the whole team you can implement a process that protects the core components and intellectual property of your company. It allows you to efficiently and confidently release changes that effect everyone. If there’s ever a failure because of database, report, or data changes you have a way to roll back quickly. Backups can only go so far, and should be last-ditch efforts to recover from a disaster. DataOps is a methodology that keeps your data and its availability safe and operational while keeping your teams productive.

GIT Gotchas and other things that make you cry

Lions and Tigers and Bears, Oh My!

Encountered x file(s) that should have been pointers, but weren’t

We encountered this issue when trying to merge master after someone had committed a slew of PDFs. Although we couldn’t identify the exact situation that caused this problem, the result is that nobody could merge master into their branches.

Not No Body! Not No How!
Not No Body! Not No How!

It looks like this was once filed as an issue with git, itself: https://github.com/git-lfs/git-lfs/issues/1939

But amongst the threaded comments, it looks like there is a root issue with the file types that are used to compare files and that on strange and rare occasions a unicorn fart fills the git-void with pain by changing these on the server.

After several dozen attempts, I came across this gem of a command:

$ git lfs migrate import --everything --include='*.pdf'

migrate: override changes in your working copy? [Y/n] Y
migrate: changes in your working copy will be overridden ...
migrate: Sorting commits: ..., done
migrate: Rewriting commits: 100% (5940/5940), done
migrate: Updating refs: ..., done
migrate: checkout: ..., done

The pain was almost over. Let’s see how merge does, now.

$ git merge --ff-only

fatal: Not possible to fast-forward, aborting.

Hmmm… Okay. Let’s not force a fast-forward merge; let’s just do a regular merge.

$ git merge

warning: Cannot merge binary files: Deployment Scripts/CopyFiles/UpperGreatLakesSinglesChallenge.pdf (HEAD vs. refs/remotes/origin/bugfix/13032-ems-error-saving-ppc)
:
:
CONFLICT (add/add): Merge conflict in Deployment Scripts/CopyFiles/UpperGreatLakesSinglesChallenge.pdf
Auto-merging Deployment Scripts/CopyFiles/UpperGreatLakesSinglesChallenge.pdf
:
:

Then I performed a manual merge on all the files by taking the server’s file (on the right in Sublime Merge, when resolving conflicts).
But Sublime Merge couldn’t actually commit the non-changes, so I went back to the git console to wrap up tackling the lfs-merge nightmare (like a Dream Warrior to Krueger):

$ git commit

[bugfix/13032-ems-error-saving-ppc f093b0c49] Merge remote-tracking branch 'refs/remotes/origin/bugfix/13032-ems-error-saving-ppc' into bugfix/13032-ems-error-saving-ppc
See the source image

Stack Overflow: over 1k and back again

A month ago, I checked into stack overflow to see if I could help someone out – you know, be a good boy and give back to the community.

A person had a basic question about string concatenation in SQL. I promptly answered and provided a code snippet example. The OP was overjoyed! Angels heralded from the heavens, baby kittens were born. I broke the 1,000 point barrier when he checked my answer.

The next day I hopped on and my credits were back down below 1k. Wait, what!?

He unchecked my answer. Was it wrong? Nope. And neither was he.

It turns out someone posted an answer he liked better from which I learned a few things, which is awesome!!

  • it’s not about me or my points. I was disappointed that I lost my lovely 1k status, but that just reflected a selfish motive. I learned that I need to work on my character.
  • SQL 2017 has a new function: STRING_AGG() which finally performs what people have been wanting and hacking in SQL scripts since 2008. (Oracle 11gR2 had this ability in 2009 with the LISTAGG function.)
  • StackOverflow doesn’t have an alert for when your answer is declined. When someone accepts your answer, promotes it or demotes it, a little red or green badge appears over your score. If someone unchecks your answer, you don’t see a red badge. It would be nice if they drew the user to the question so, like in this case, he can learn some new tricks.

In this industry, one of the properties that sets apart a jr. developer from a seasoned one is the variety, value and vastness of knowledge built from experiences and experimentation. A little adjustment and this could have been an automated lesson. Why doesn’t StackOverflow take its vast knowledge base and perform a medical analysis on people’s posts (questions) to find similar questions that have been answered? StackOverflow has more opportunities to be mined for those who look for the potential.

(Image by Pawel Janiak on Unsplash)

(Image by Pawel Janiak on Unsplash)