GCP: Streaming Analytics Pipelines

Streaming is basically data processing on unbounded data.
Streaming is an execution engine (system, service, runner) capable of processing unbounded data.
When designed correctly a streaming processing engine can provide:
  • low latency,
  • speculative or partial results,
  • the ability to flexibly reason about time
  • controls for correctness,
  • the power to perform complex analysis.
Batch vs Stream: If you wanna look for fraud transactions 3 days ago….that’s
batch. If u wanna look at it as it happens…that’s streaming.

GCP: Machine Learning

Label = true answer

Input = predictor variable(s): what you can use to predict the label

Example = input + corresponding label

Model = math function that takes input variables and creates approximation to

Training = adjusting model to minimize error

Prediction = using model on unlabeled data

Batch: a group of training data set or samples (estimated to 100-500 samples by batch).

During the learning and optimization process, instead of loading all the training dataset at once, they are loaded in sets of specific size. These sets are called batches.

Supervised learning: Data have label with the input

Unsupervised learning: Data does not have labels

Structured data consist of rows and columns

Regression model: Linear data representation (f : x -> y ), used when label is continuous. (e.g., tip: 3,8 baby weight : 7.9, etc)

Regression Algorithms

  • Linear Regression
  • Regression Trees(e.g. Random Forest)
  • Support Vector Regression (SVR)
  • …etc

Classification model curvy data representation, used when label has a discrete number of values or classes. Discrete number can only take certain values like rolling dice (1, 2, 3, 4, 5, 6)

Algorithms for classification

  • Decision Trees
  • Logistic Regression
  • Naive Bayes
  • K Nearest Neighbors
  • Linear SVC (Support vector Classifier)
  • …..etc


image source url here

Learning process to find the label: considering input value (x1, x2,…, xn)

Training :

  1. Attribute random weight to input value
  2. Run and compare the output of this batch an example (label + input value)
  3. Calculate error on labeled dataset
  4. Change the weights so that the error goes down. This change determines the learning rate
  5. If the error goes down, Evaluate the overall dataset





Weights/bias = parameters we optimize
Batch size = the amount of data we compute error on
Epoch = one pass through entire dataset ?????
Gradient descent = process of reducing error sed to find the best parameters related to weights
Evaluation = is the model good enough? Has to be done on full dataset
Training = process of optimizing the weights; includes gradient descent +

Neuron = one unit of combining inputs
Hidden layer = set of neurons that operate on the same set of inputs
Features = transformations of inputs, such as x^2
Feature engineering = coming up with what transformations to include

Mean Square Error: the loss measure for regression problems
Cross-entropy: the loss measure for classification problems

Confusion matrix


Accuracy = ∑Correct answer of ML/ Total

Precision = Positive Predictive Value : When ML says it is true :

∑ (True Positive + True Negative) / Total

Recall is true positive rate : ∑ True Positive/Total

More info : ML introduction

GCP: Dataflow

Pipeline: BigQuery + Cloud Storage

A Pipeline is a directed graph of steps: The beginning of this graph is called source, any performed step within this graph is called transformer, the end of this graph is called sink.

Source –> Transformer –> Sink

A pipeline is executed on the cloud by a Runner; each step is elastically scaled.

A pipeline process :

  1. Read in data,
  2. transform it,
  3. write out

A pipeline developer can branch, merge, use if-then statements, etc.

Data in a pipeline are represented by PCollection
○ Supports parallel processing
○ Not an in-memory collection; can be unbounded

ParDo allows for parallel processing acting on one  item at a time .

Useful for:
○ Filtering (choosing which inputs to emit)
○ Converting one Java type to another
○ Extracting parts of an input (e.g., fields of TableRow)
○ Calculating values from different parts of inputs

GCP: DataLab

Setting Up DataLab

gcloud auth list

gcloud config set core/project <PROJECT_ID>

gcloud config set compute/zone us-central1-f

datalab create –no-create-repository –machine-type n1-standard-2 image-class









If you lose connection to Datalab for some reason, use this command to reconnect:

datalab connect –zone us-central1-c –port 8081 my-datalab

When creating a notebook, you need to define directories for

  • preprocessing
  • model
  • prediction


Preprocessing uses a Dataflow pipeline to convert the image format, resize images, and run the converted image through a pre-trained model to get the features or embeddings. You can also do this step using alternate technologies like Spark or plain Python code if you like.

GCP: BigQuery



bq ls

bq ls publicdata:


bq  mk


bq ls <tablename>

bq show <tablename>.<schemaname>

bq show publicdata:samples.shakespeare

bq help query

bq query « SELECT word, corpus, COUNT(word) FROM publicdata:samples.shakespeare WHERE word CONTAINS ‘huzzah’ GROUP BY word, corpus »

GCP: Kubernetes

This is my summary of the Google cloud quick start for Kubernetes found here.

In Kubernetes Engine, a container cluster consists of at least one cluster master and multiple worker machines called nodes. These master and node machines run the Kubernetes cluster orchestration system.


  • Cluster master runs the Kubernetes control plane processes, including the Kubernetes API server, scheduler, and core resource controllers. deciding what runs on all of the cluster’s nodes. e.g., scheduling workloads, like containerized applications, and managing the workloads’ lifecycle, scaling, and upgrades.
  • Nodes run (i) Docker runtime (ii) Kubernetes node agent (kubelet) which communicates with the master and responsible for starting and running Docker containers scheduled on that node. (Node machine type n1-standard-1, with 1 virtual CPU and 3.75 GB of memory, OS image can be specialized).
  • Kubernetes API calls : The master and nodes also communicate using Kubernetes APIs, directly via HTTP/gRPC, or indirectly, by running commands from the Kubernetes command-line client (kubectl) or interacting with the UI in the GCP Console.
  • In Kubernetes Engine, there are also a number of special containers that run as per-node agents to provide functionality such as log collection and intra-cluster network connectivity.