- low latency,
- speculative or partial results,
- the ability to flexibly reason about time
- controls for correctness,
- the power to perform complex analysis.
Label = true answer
Input = predictor variable(s): what you can use to predict the label
Example = input + corresponding label
Model = math function that takes input variables and creates approximation to
Training = adjusting model to minimize error
Prediction = using model on unlabeled data
Batch: a group of training data set or samples (estimated to 100-500 samples by batch).
During the learning and optimization process, instead of loading all the training dataset at once, they are loaded in sets of specific size. These sets are called batches.
Supervised learning: Data have label with the input
Unsupervised learning: Data does not have labels
Structured data consist of rows and columns
Regression model: Linear data representation (f : x -> y ), used when label is continuous. (e.g., tip: 3,8 baby weight : 7.9, etc)
- Linear Regression
- Regression Trees(e.g. Random Forest)
- Support Vector Regression (SVR)
Classification model curvy data representation, used when label has a discrete number of values or classes. Discrete number can only take certain values like rolling dice (1, 2, 3, 4, 5, 6)
Algorithms for classification
- Decision Trees
- Logistic Regression
- Naive Bayes
- K Nearest Neighbors
- Linear SVC (Support vector Classifier)
image source url here
Learning process to find the label: considering input value (x1, x2,…, xn)
- Attribute random weight to input value
- Run and compare the output of this batch an example (label + input value)
- Calculate error on labeled dataset
- Change the weights so that the error goes down. This change determines the learning rate
- If the error goes down, Evaluate the overall dataset
Weights/bias = parameters we optimize
Batch size = the amount of data we compute error on
Epoch = one pass through entire dataset ?????
Gradient descent = process of reducing error sed to find the best parameters related to weights
Evaluation = is the model good enough? Has to be done on full dataset
Training = process of optimizing the weights; includes gradient descent +
Neuron = one unit of combining inputs
Hidden layer = set of neurons that operate on the same set of inputs
Features = transformations of inputs, such as x^2
Feature engineering = coming up with what transformations to include
Mean Square Error: the loss measure for regression problems
Cross-entropy: the loss measure for classification problems
Accuracy = ∑Correct answer of ML/ Total
Precision = Positive Predictive Value : When ML says it is true :
∑ (True Positive + True Negative) / Total
Recall is true positive rate : ∑ True Positive/Total
More info : ML introduction
Pipeline: BigQuery + Cloud Storage
A Pipeline is a directed graph of steps: The beginning of this graph is called source, any performed step within this graph is called transformer, the end of this graph is called sink.
Source –> Transformer –> Sink
A pipeline is executed on the cloud by a Runner; each step is elastically scaled.
A pipeline process :
- Read in data,
- transform it,
- write out
A pipeline developer can branch, merge, use if-then statements, etc.
Data in a pipeline are represented by PCollection
○ Supports parallel processing
○ Not an in-memory collection; can be unbounded
ParDo allows for parallel processing acting on one item at a time .
○ Filtering (choosing which inputs to emit)
○ Converting one Java type to another
○ Extracting parts of an input (e.g., fields of TableRow)
○ Calculating values from different parts of inputs
Setting Up DataLab
gcloud auth list
gcloud config set core/project <PROJECT_ID>
gcloud config set compute/zone us-central1-f
datalab create –no-create-repository –machine-type n1-standard-2 image-class
If you lose connection to Datalab for some reason, use this command to reconnect:
datalab connect –zone us-central1-c –port 8081 my-datalab
When creating a notebook, you need to define directories for
Preprocessing uses a Dataflow pipeline to convert the image format, resize images, and run the converted image through a pre-trained model to get the features or embeddings. You can also do this step using alternate technologies like Spark or plain Python code if you like.
bq ls publicdata:
bq ls <tablename>
bq show <tablename>.<schemaname>
bq show publicdata:samples.shakespeare
bq help query
bq query « SELECT word, corpus, COUNT(word) FROM publicdata:samples.shakespeare WHERE word CONTAINS ‘huzzah’ GROUP BY word, corpus »
This is my summary of the Google cloud quick start for Kubernetes found here.
In Kubernetes Engine, a container cluster consists of at least one cluster master and multiple worker machines called nodes. These master and node machines run the Kubernetes cluster orchestration system.
- Cluster master runs the Kubernetes control plane processes, including the Kubernetes API server, scheduler, and core resource controllers. deciding what runs on all of the cluster’s nodes. e.g., scheduling workloads, like containerized applications, and managing the workloads’ lifecycle, scaling, and upgrades.
- Nodes run (i) Docker runtime (ii) Kubernetes node agent (
kubelet) which communicates with the master and responsible for starting and running Docker containers scheduled on that node. (Node machine type
n1-standard-1, with 1 virtual CPU and 3.75 GB of memory, OS image can be specialized).
- Kubernetes API calls : The master and nodes also communicate using Kubernetes APIs, directly via HTTP/gRPC, or indirectly, by running commands from the Kubernetes command-line client (kubectl) or interacting with the UI in the GCP Console.
- In Kubernetes Engine, there are also a number of special containers that run as per-node agents to provide functionality such as log collection and intra-cluster network connectivity.
Création d’une API Node.Js avec IBM Blumix