2.1 Data Analytics Lifecycle Overview

The Data Analytics Lifecycle is designed specifically for Big Data problems and data science projects. The lifecycle has six phases, and project work can occur in several phases at once. For most phases in the lifecycle, the movement can be either forward or backward. This iterative depiction of the lifecycle is intended to more closely portray a real project, in which aspects of the project move forward and may return to earlier stages as new information is uncovered and teammembers learn more about various stages of the project. This enables participants to move iterativelythrough the process and drive toward operationalizing the project work.

2.1.1 Key Roles for a Successful Analytics Project

In recent years, substantial attention has been placed on the emerging role of the data scientist. In October 2012, Harvard Business Review featured an article titled “Data Scientist: The Sexiest Job of the 21st Century” [1], in which experts DJ Patil and Tom Davenport described the new role and how to find and hire data scientists. More and more conferences are held annually focusing on innovation in the areas of Data Science and topics dealing with Big Data. Despite this strong focus on the emerging role of the data scientist specifically, there are actually seven key roles that need to be fulfilled for a high- functioning data science team to execute analytic projects successfully.

Figure 2.1 depicts the various roles and key stakeholders of an analytics project. Each plays acritical part in a successful analytics project. Although seven roles are listed, fewer or more people can accomplish the work depending on the scope of the project, the organizational structure, and theskills of the participants. For example, on a small, versatile team, these seven roles may be fulfilledby only 3 people, but a very large project may require 20 or more people. The seven roles follow.

📷Business User: Someone who understands the domain area and usually benefits from the results. This person can consult and advise the project team on the context of the project, the value of the results, and how the outputs will be operationalized. Usually a business analyst, line manager, or deep subject matter expert in the project domain fulfills this role.

📷Project Sponsor: Responsible for the genesis of the project. Provides the impetus and requirements for the project and defines the core business problem. Generally provides the funding and gauges the degree of value from the final outputs of the working team. Thisperson sets the priorities for the project and clarifies the desired outputs.

📷Project Manager: Ensures that key milestones and objectives are met on time and at theexpected quality.

📷Business Intelligence Analyst: Provides business domain expertise based on a deepunderstanding of the data, key performance indicators (KPIs), key metrics, and business intelligence from a reporting perspective. Business Intelligence Analysts generally create dashboards and reports and have knowledge of the data feeds and sources.

📷Database Administrator (DBA): Provisions and configures the database environment to support the analytics needs of the working team. These responsibilities may include providingaccess to key databases or tables and ensuring the appropriate security levels are in place relatedto the data repositories.

📷Data Engineer: Leverages deep technical skills to assist with tuning SQL queries for datamanagement and data extraction, and provides support for data ingestion into the analytic sandbox, which was discussed in Chapter 1, “Introduction to Big Data Analytics.” Whereasthe DBA sets up and configures the databases to be used, the data engineer executes the actual data extractions and performs substantial data manipulation to facilitate the analytics.The data engineer works closely with the data scientist to help shape data in the right ways foranalyses.

📷Data Scientist: Provides subject matter expertise for analytical techniques, data modeling, and applying valid analytical techniques to given business problems. Ensures overall analytics objectives are met. Designs and executes analytical methods and approacheswith the data available to the project.

Figure 2.1 Key roles for a successful analytics project

Although most of these roles are not new, the last two roles—data engineer and data scientist—havebecome popular and in high demand [2] as interest in Big Data has grown.

2.1.1 Background and Overview of Data Analytics Lifecycle

The Data Analytics Lifecycle defines analytics process best practices spanning discovery to project completion. The lifecycle draws from established methods in the realm of data analytics and decision science. This synthesis was developed after gathering input from data scientists andconsulting established approaches that provided input on pieces of the

process. Several of the processes that were consulted include these:

📷Scientific method [3], in use for centuries, still provides a solid framework for thinking aboutand deconstructing problems into their principal parts. One of the most valuable ideas of the scientific method relates to forming hypotheses and finding ways to test ideas.

📷CRISP-DM [4] provides useful input on ways to frame analytics problems and is a popularapproach for data mining.

📷Tom Davenport’s DELTA framework [5]: The DELTA framework offers an approach for dataanalytics projects, including the context of the organization’s skills, datasets, and leadershipengagement.

📷Doug Hubbard’s Applied Information Economics (AIE) approach [6]: AIE provides a framework for measuring intangibles and provides guidance on developing decision models,calibrating expert estimates, and deriving the expected value of information.

📷“MAD Skills” by Cohen et al. [7] offers input for several of the techniques mentioned inPhases 2–4 that focus on model planning, execution, and key findings.

Figure 2.2 presents an overview of the Data Analytics Lifecycle that includes six phases. Teams commonly learn new things in a phase that cause them to go back and refine the work done in priorphases based on new insights and information that have been uncovered. For this reason, Figure 2.2is shown as a cycle. The circular arrows convey iterative movement between phases until the team members have sufficient information to move to the next phase. The callouts include sample questions to ask to help guide whether each of the team members has enough information and hasmade enough progress to move to the next phase of the process. Note that these phases do not represent formal stage gates; rather, they serve as criteria to help test whether it makes sense to stay in the current phase or move to the next.

Figure 2.2 Overview of Data Analytics Lifecycle

Here is a brief overview of the main phases of the Data Analytics Lifecycle:

📷Phase 1—Discovery: In Phase 1, the team learns the business domain, including relevant history such as whether the organization or business unit has attempted similar projects in the past from which they can learn. The team assesses the resources available to support the project in terms of people, technology, time, and data. Important activities in this phase include framing the business problem as an analytics challenge that can be addressed in subsequentphases and formulating initial hypotheses (IHs) to test and begin learning the data.

📷Phase 2—Data preparation: Phase 2 requires the presence of an analytic sandbox, in which the team can work with data and perform analytics for the duration of the project. The team needs to execute extract, load, and transform (ELT) or extract, transform and load (ETL) toget data into the sandbox. The ELT and ETL are

sometimes abbreviated as ETLT. Data should be transformed in the ETLT process so the team can work with it and analyze it. In this phase, the team also needs to familiarize itself with the data thoroughly and take steps to condition the data (Section 2.3.4).

📷Phase 3—Model planning: Phase 3 is model planning, where the team determines themethods, techniques, and workflow it intends to follow for the subsequent model building phase. The team explores the data to learn about the relationships between variables andsubsequently selects key variables and the most suitable models.

📷Phase 4—Model building: In Phase 4, the team develops datasets for testing, training, and production purposes. In addition, in this phase the team builds and executes models based onthe work done in the model planning phase. The team also considers whether its existing tools will suffice for running the models, or if it will need a more robust environment for executing models and workflows (for example, fast hardware and parallel processing, if applicable).

📷Phase 5—Communicate results: In Phase 5, the team, in collaboration with majorstakeholders, determines if the results of the project are a success or a failure based on thecriteria developed in Phase 1. The team should identify key findings, quantify the business value, and develop a narrative to summarize and convey findings to stakeholders.

📷Phase 6—Operationalize: In Phase 6, the team delivers final reports, briefings, code, and technical documents. In addition, the team may run a pilot project to implement the models ina production environment.

Once team members have run models and produced findings, it is critical to frame these results in a way that is tailored to the audience that engaged the team. Moreover, it is critical to frame the results of the work in a manner that demonstrates clear value. If the team performs a technically accurate analysis but fails to translate the results into a language that resonates with the audience, people will not see the value, and much of the time and effort on the project will have been wasted.

The rest of the chapter is organized as follows. Sections 2.2–2.7 discuss in detail how each of the six phases works, and Section 2.8 shows a case study of incorporating the Data Analytics Lifecycle in areal-world data science project.

2.1 Phase 1: Discovery

The first phase of the Data Analytics Lifecycle involves discovery (Figure 2.3). In this phase, the data science team must learn and investigate the problem, develop context and understanding, and learn about the data sources needed and available for the project. In addition, the team formulatesinitial hypotheses that can later be tested with data.

Figure 2.3 Discovery phase

2.1.1 Learning the Business Domain

Understanding the domain area of the problem is essential. In many cases, data scientists will have deep computational and quantitative knowledge that can be broadly applied across many disciplines. An example of this role would be someone with an advanced degree in appliedmathematics or statistics.

These data scientists have deep knowledge of the methods, techniques, and ways for applyingheuristics to a variety of business and conceptual problems. Others in this area

may have deep knowledge of a domain area, coupled with quantitative expertise. An example of this would be someone with a Ph.D. in life sciences. This person would have deep knowledge of afield of study, such as oceanography, biology, or genetics, with some depth of quantitativeknowledge.

At this early stage in the process, the team needs to determine how much business or domain knowledge the data scientist needs to develop models in Phases 3 and 4. The earlier the team canmake this assessment the better, because the decision helps dictate the resources needed for the project team and ensures the team has the right balance of domain knowledge and technicalexpertise.

2.1.2 Resources

As part of the discovery phase, the team needs to assess the resources available to support theproject. In this context, resources include technology, tools, systems, data, and people.

During this scoping, consider the available tools and technology the team will be using and the types of systems needed for later phases to operationalize the models. In addition, try to evaluate the level of analytical sophistication within the organization and gaps that may exist related to tools, technology, and skills. For instance, for the model being developed to have longevity in anorganization, consider what types of skills and roles will be required that may not exist today. For theproject to have long-term success, what types of skills and roles will be needed for the recipients of the model being developed? Does the requisite level of expertise exist within the organization today, or will it need to be cultivated? Answering these questions will influence the techniques the team selects and the kind of implementation the team chooses to pursue in subsequent phases of the Data Analytics Lifecycle.

In addition to the skills and computing resources, it is advisable to take inventory of the types of data available to the team for the project. Consider if the data available is sufficient to support the project’s goals. The team will need to determine whether it must collect additional data, purchase it from outside sources, or transform existing data. Often, projects are started looking only at the dataavailable. When the data is less than hoped for, the size and scope of the project is reduced to work within the constraints of the existing data.

An alternative approach is to consider the long-term goals of this kind of project, without being constrained by the current data. The team can then consider what data is needed to reach the long-term goals and which pieces of this multistep journey can be achieved today with the existing data. Considering longer-term goals along with short-term goals enables teams to pursue more ambitious projects and treat a project as the first step of a more strategic initiative, rather than as a standaloneinitiative. It is critical to view projects as part of a longer-term journey, especially if executing projects in an organization that is new to Data Science and may not have embarked on the optimum datasets to support robust analyses up to this point.

Ensure the project team has the right mix of domain experts, customers, analytic talent, and project management to be effective. In addition, evaluate how much time is needed and if the team has theright breadth and depth of skills.

After taking inventory of the tools, technology, data, and people, consider if the team has sufficient resources to succeed on this project, or if additional resources are needed. Negotiating for resources at the outset of the project, while scoping the goals, objectives, and feasibility, is generally more useful than later in the process and ensures sufficient time to execute it properly. Project managers and key stakeholders have better success negotiating for the right resources at this stage ratherthan later once the project is underway.

2.1.3 Framing the Problem

Framing the problem well is critical to the success of the project. Framing is the process of stating the analytics problem to be solved. At this point, it is a best practice to write down the problem statement and share it with the key stakeholders. Each team member may hear slightly differentthings related to the needs and the problem and have somewhat different ideas of possible solutions. For these reasons, it is crucial to state the analytics problem, as well as why and to whom it isimportant. Essentially, the team needs to clearly articulate the current situation and its mainchallenges.

As part of this activity, it is important to identify the main objectives of the project, identify what needs to be achieved in business terms, and identify what needs to be done to meet the needs. Additionally, consider the objectives and the success criteria for the project. What is the team attempting to achieve by doing the project, and what will be considered “good enough” as an outcome of the project? This is critical to document and share with the project team and key stakeholders. It is best practice to share the statement of goals and success criteria with the team andconfirm alignment with the project sponsor’s expectations.

Perhaps equally important is to establish failure criteria. Most people doing projects prefer only to think of the success criteria and what the conditions will look like when the participants are successful. However, this is almost taking a best-case scenario approach, assuming that everythingwill proceed as planned and the project team will reach its goals. However, no matter how well planned, it is almost impossible to plan for everything that will emerge in a project. The failure criteria will guide the team in understanding when it is best to stop trying or settle for the results that have been gleaned from the data. Many times people will continue to perform analyses past the point when any meaningful insights can be drawn from the data. Establishing criteria for both success and failure helps the participants avoid unproductive effort and remain aligned with the project sponsors

2.1.4 Identifying Key Stakeholders

Another important step is to identify the key stakeholders and their interests in the project. Duringthese discussions, the team can identify the success criteria, key risks, and stakeholders, which should include anyone who will benefit from the project or will be significantly impacted by the project. When interviewing stakeholders, learn about the domain area and any relevant history from similar analytics projects. For example, the team may identify the results each stakeholder wants from the project and the criteria it will use to judge the success of the project.

Keep in mind that the analytics project is being initiated for a reason. It is critical to

articulate the pain points as clearly as possible to address them and be aware of areas to pursue or avoid as the team gets further into the analytical process. Depending on the number of stakeholders and participants, the team may consider outlining the type of activity and participation expected from each stakeholder and participant. This will set clear expectations with the participants andavoid delays later when, for example, the team may feel it needs to wait for approval from someone who views himself as an adviser rather than an approver of the work product.

2.1.5 Interviewing the Analytics Sponsor

The team should plan to collaborate with the stakeholders to clarify and frame the analytics problem. At the outset, project sponsors may have a predetermined solution that may not necessarily realize the desired outcome. In these cases, the team must use its knowledge andexpertise to identify the true underlying problem and appropriate solution.

For instance, suppose in the early phase of a project, the team is told to create a recommender system for the business and that the way to do this is by speaking with three people and integratingthe product recommender into a legacy corporate system. Although this may be a valid approach, it is important to test the assumptions and develop a clear understanding of the problem. The data science team typically may have a more objective understanding of the problem set than the stakeholders, who may be suggesting solutions to a given problem. Therefore, the team can probe deeper into the context and domain to clearly define the problem and propose possible paths from the problem to a desired outcome. In essence, the data science team can take a more objective approach, as the stakeholders may have developed biases over time, based on their experience. Also, what may have been true in the past may no longer be a valid working assumption. Onepossible way to circumvent this issue is for the project sponsor to focus on clearly defining the requirements, while the other members of the data science team focus on the methods needed toachieve the goals.

When interviewing the main stakeholders, the team needs to take time to thoroughly interview theproject sponsor, who tends to be the one funding the project or providing the high-level requirements. This person understands the problem and usually has an idea of a potential workingsolution. It is critical to thoroughly understand the sponsor’s perspective to guide the team in getting started on the project. Here are some tips for interviewing project sponsors:

📷📷Prepare for the interview; draft questions, and review with colleagues. Use open-ended questions; avoid asking leading questions.

📷Probe for details and pose follow-up questions.

📷Avoid filling every silence in the conversation; give the other person time to think.

📷Let the sponsors express their ideas and ask clarifying questions, such as “Why? Is that correct?Is this idea on target? Is there anything else?”

📷Use active listening techniques; repeat back what was heard to make sure the team heard itcorrectly, or reframe what was said.

📷Try to avoid expressing the team’s opinions, which can introduce bias; instead, focus onlistening.

📷Be mindful of the body language of the interviewers and stakeholders; use eye contact whereappropriate, and be attentive.

📷Minimize distractions.

📷Document what the team heard, and review it with the sponsors.

Following is a brief list of common questions that are helpful to ask during the discovery phase when interviewing the project sponsor. The responses will begin to shape the scope of the projectand give the team an idea of the goals and objectives of the project.

📷📷What business problem is the team trying to solve? What isthe desired outcome of the project?

📷What data sources are available?

📷📷What industry issues may impact the analysis? Whattimelines need to be considered?

📷Who could provide insight into the project?

📷Who has final decision-making authority on the project?

📷How will the focus and scope of the problem change if the following dimensions change:

📷Time: Analyzing 1 year or 10 years’ worth of data?

📷People: Assess impact of changes in resources on project timeline.

📷Risk: Conservative to aggressive

📷Resources: None to unlimited (tools, technology, systems)

📷Size and attributes of data: Including internal and external data sources

2.1.6 Developing Initial Hypotheses

Developing a set of IHs is a key facet of the discovery phase. This step involves forming ideas that the team can test with data. Generally, it is best to come up with a few primary hypotheses to testand then be creative about developing several more. These IHs form the basis of the analytical teststhe team will use in later phases and serve as the foundation for the findings in Phase 5. Hypothesis testing from a statistical perspective is covered in greater detail in Chapter 3, “Review of Basic DataAnalytic Methods Using R.”

In this way, the team can compare its answers with the outcome of an experiment or test to generate additional possible solutions to problems. As a result, the team will have a much richer set of observations to choose from and more choices for agreeing upon the most impactful conclusionsfrom a project.

Another part of this process involves gathering and assessing hypotheses from stakeholders anddomain experts who may have their own perspective on what the

problem is, what the solution should be, and how to arrive at a solution. These stakeholders would know the domain area well and can offer suggestions on ideas to test as the team formulates hypotheses during this phase. The team will likely collect many ideas that may illuminate the operating assumptions of the stakeholders. These ideas will also give the team opportunities to expand the project scope into adjacent spaces where it makes sense or design experiments in a meaningful way to address the most important interests of the stakeholders. As part of this exercise,it can be useful to obtain and explore some initial data to inform discussions with stakeholders during the hypothesis-forming stage.

2.1.7 Identifying Potential Data Sources

As part of the discovery phase, identify the kinds of data the team will need to solve the problem.Consider the volume, type, and time span of the data needed to test the hypotheses. Ensure that the team can access more than simply aggregated data. In most cases, the team will need the raw data to avoid introducing bias for the downstream analysis. Recalling the characteristics of Big Data fromChapter 1, assess the main characteristics of the data, with regard to its volume, variety, and velocity of change. A thorough diagnosis of the data situation will influence the kinds of tools and techniques to use in Phases 2-4 of the Data Analytics Lifecycle. In addition, performing dataexploration in this phase will help the team determine the amount of data needed, such as the amount of historical data to pull from existing systems and the data structure. Develop an idea ofthe scope of the data needed, and validate that idea with the domain experts on the project.

The team should perform five main activities during this step of the discovery phase:

📷Identify data sources: Make a list of candidate data sources the team may need to test the initial hypotheses outlined in this phase. Make an inventory of the datasets currently availableand those that can be purchased or otherwise acquired for the tests the team wants to perform.

📷Capture aggregate data sources: This is for previewing the data and providing

high-level understanding. It enables the team to gain a quick overview of the data and performfurther exploration on specific areas. It also points the team to possible areas of interest withinthe data.

📷Review the raw data: Obtain preliminary data from initial data feeds. Begin understanding theinterdependencies among the data attributes, and become familiar with the content of the data, itsquality, and its limitations.

📷Evaluate the data structures and tools needed: The data type and structure dictate which tools the team can use to analyze the data. This evaluation gets the team thinking about whichtechnologies may be good candidates for the project and how to start getting access to these tools.

📷Scope the sort of data infrastructure needed for this type of problem: In addition to thetools needed, the data influences the kind of infrastructure that’s required, such as disk storageand network capacity.

Unlike many traditional stage-gate processes, in which the team can advance only when

specific criteria are met, the Data Analytics Lifecycle is intended to accommodate more ambiguity.This more closely reflects how data science projects work in real-life situations. For each phase ofthe process, it is recommended to pass certain checkpoints as a way of gauging whether the team is ready to move to the next phase of the Data Analytics Lifecycle.

The team can move to the next phase when it has enough information to draft an analytics plan and share it for peer review. Although a peer review of the plan may not actually be required by the project, creating the plan is a good test of the team’s grasp of the business problem and the team’s approach to addressing it. Creating the analytic plan also requires a clear understanding of the domain area, the problem to be solved, and scoping of the data sources to be used. Developing success criteria early in the project clarifies the problem definition and helps the team when it comes time to make choices about the analytical methods being used in later phases.