Technology: March 2019

Saturday, 30 March 2019

WebServices Security

Areas to be addressed as part of Security(WS Security)

Authentication

Username token profile, X 509 Certificates, SAML(Singlesign On)

Confidentiality

Encryption/Decryption

Integrity

Signatures(hash value of message)

non-repudiation

TimeStamp(prevent replay attacks)

Stateless Authentication Mechanisms

Basic Auth

Client sends Username/Password for every request as state is not maintained in Server
User/passwd details are sent as part of Request Header
Concatenate Username & password with : as delimiter
Base 64 encoding on the concatenated String
Encoded string is passed as value to Header key 'Authorization' with a prefix as 'Basic '
This is not Secure as the encoded string can be decoded.
So, always send it over https request to protect it
We make Bae 64 encoding to handle non-http compatible chars in username/passwd
Advantages

Simple, Stateless Server, Supported by all browsers,

Disadvantages

requires https to protect
it is subjected to replay attacks
Logout is tricky(Browser caching)

Digest access Authentication

This mechanism does Encription(https://en.wikipedia/org/wiki/Digest_access_authentication)

Asymmetric cryptography

https://en.wikipedia/org/wiki/Public-key_cryptography
Both client and server Uses public and private keys

OAuth

https://en.wikipedia/org/wiki/OAuth

JSon Web Tokens

https://en.wikipedia/org/wiki/JSON_Web_Token

Interceptors & Filters

Interceptors are designed to manipulate Entities(input and output streams).
Filters manipulate Headers/uris/matadata information
Interceptors manipulate actual body of request/response
Filters are used for cross cutting concerns like Logging, security
Interceptors are used to Encode an entity response
Filters & Interceptors work on Client too.

SOAP

UsernameToken

Most used ways while using SOAP based webservices
It defines standard to pass Username & Password inside SOAP header
Steps to configure

Create interceptors in cxf-servlet.xml
create password callback handler

Encryption/Decryption Concepts

Symmetric or Private

Sender encrptps the data with private key
Receiver decrpts with his private key
This is expensive as vendor has to maintain pair of keys for every application/user

Public key crptography

Data will be encrypted with public key
Decription will be done with private key
Private key cannot be derived even if hacker knows Public key
RSA is public key encrpytion algorithm

Java Keytool

Key and Certificate management utility
When Public & private key are generated KEYSTORE file will be created. This is the place where Private and Public keys are stored. This file is password protected.
For each Private key, we give a alias/username & also a password
Public key can be exported into Certificate which can be distributed across our client applications.
keytool - genkeypair -alias mykey -keypass mykeypass -keystore -mykeystore.jks -storepass mystorepass -validity 100 -dname "cn=Venkat Desu, ou=ws, o=VenkatInc,c=IN"
Export public key out of keystore(to distribute to client)

keytool -export -rfc -keystore mykeystore.jks -storepass mystorepass -alias mykey -file MyCert.cer

Import certificates into alternate keystores

keytool -import -trustcacerts -keystore servicekey store.jks -storepass mystorepass -alias mykey -file MyCert.cer -noprompt

Signatures

To ensure integrity of data and is not tampered on the way
It is fixed length value that is calculated using content of the message by applying algorithm on it. This value is also known as hash. The hash is calculated using a private key
Hash is sent to server side along with the message
On the serverside, the hash is recalculated using Public key of corresponding private key.
Content is termed as not tampered when both the hashes are same.

Non Repudiation

TimeStamp

This has both Creation & Expiry times.
Server rejects the message when expiry time is past the current time

OAuth2 security for REST

Restful applications are slightly different from webapplications. Webapplns are directly used by end user and end user authenticates. Where as Restful applications, it is the application that authenticates with Restful application.

Friday, 29 March 2019

HTTP Clients

HTTP Requests

Initial Line

method (GET, POST, Delete..)
path

Headers

Language
client version
type(response type)

Body

input fields, user/pwd details

HTTP Response

Initial/status Line

Status code
message
version

Headers

type
length
Language

Body(optional)

formatted/plain response

Browsers support ONLY GET & POST requests.

GET

Doesn't have BODY
Pass parameters directly in URL(Query String)

POST

To Create

Put and Patch

To update existing resource
PUT to update whole resource
Patch to update some of the attribute of a resource
Doesn't receive/support form data. They receive/support only form-urlencoded.

Delete

form-urlencoded method for body

HTTP Status/Response

404 - resource doesn't exist
https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

Oauth

grant_type: client_credentials //In body

client_id
client_secret

grant_type:password

Client credentials
User credentials

Ways to send access token to the actual request

Body of the request

access_token:<value of the token>

Header of the request

Authorizatoin:Bearer <value of the token>

Sunday, 17 March 2019

Webservice - Software system designed to support interoperable machine-to-machine interaction over a network.

Microservices are small apps that comprise a bigger app and are communicating by means of uniform defined interfaces.

REST
&small well chosen deployable units
& cloud enabled

An architectural approach for developing a single application as a suite of small services, each running in its own process and communicating with lightweight mechanisms.

Advantages and Enhancements compared to monolithic application.

Monolytics

Intimidating to new developers
Overburdened developer environment - lower productivity
Huge code base
Scaling can be very difficult as ONLY Scale Up and get better hard ware is the only option.
Scale out which brings infinite capacity is NOT an option here.
Entire application has to be upgraded when every underlying Java/Java EE/apps server upgrade happens.
Things may get worse when one tries to use different languages or frameworks for different parts of application.

Microservices

Individually deployable with each having separate storage system
Decentralized data management
Decentralized governance
Decomposition of the application into smaller services brings smaller and reasonable code bases which are much more developer friendly.
We can scale out each part independently depending on load.
Independent upgrades from rest of the system.
Flexibility of technological choices both in frameworks and languages.
Easier to evolve

Microservices are not a silver bullet. They bring in quiet a lot of complexity in management, maintenance, deployment, coordination and system design. While individual components are simple, the big picture is more complex.

Breaking Monolith into Microservices

No single way of breaking
Start with apps functionality and business capabilities.
Try to isolate that perform single business capability. Identity its dependencies and connections and draw boundaries so that two components are not connected.
Where to start

Begin with the component with least connections/dependencies
Separate database access and entity definitions and not share them between components.
Each micro component is responsible for its data.
Avoid EJB remote or similar calls and SWITCH to HTTP(Soap/REST)
Avoid stateful timers and schedules - externalize them.
Separate UI
Services everywhere for everything
Microservices should be stateless.

Java Microservice Frameworks overview

Java EE based- Payara Micro

Payara from glassfish which provides support for full app server
Micro version cut down to run from a single JAR for microservices. Internally it is still an app server

Combination of Java EE & other libraries - Dropwizard

Java framework for developing lightweight services
Servlet, Jax-RS and Bean validation from Java EE
Several other libraries like - Jackson, JDBI, Guava, Liquibase, and so.
This is healthy mix of both worlds- Enterprise and non-enterprise
This can be added as dependency and run directly from fat Jar

Spring based - Spring boot

Enables/supports Standalone spring applications
Tailored for modular, lightweight apps or microservices

Stateless microservices do not keep any data that is required to execute future requests

Use exclusively "Application Scoped" or "Request Scoped" CDI beans, or singleton instances in general
SPA webapplications are ideal as they keep state in the users browsers.

Continuous Integration: Practice of merging all code changes of developers as shared copy as frequently as possible.
Continuous Deployment/Delivery: Producing software in small cycles which can be reliably release at any time
Infrastructure as code: Monitoring and provisioning the resources
Monitoring an logging: Ability to detect and mitigate quickly and easily.

Serverless Computing:

A cloud computing execution model in which the cloud provider dynamically manages the allocation of machine resources
Pricing is based on resources consumed

Coupling & Cohesion

Coupling is the degree of interdependence between the modules
Cohesion is the degree to which elements inside module belong to

Domain driven design

Domain is the area for which software is being developed
Model is the representation of domain. A set of abstractions that describes some aspects of domain such as relationships

Challenges

Deciding boundries is evoutionary
Configuration management due to number of env & instances

Configuration server to maintain the properties of all env & instances

dynamically distribute loads

Naming server(Eurekha) for service registry & service discovery

Centralized logs
fault tolerance systems

Advantages

New technology & process adoption(different process in different technologies)
Dynamic scaling
Faster release cycles

Standardizing

Standarize port & applicaitons

API Gateways- Common features that we need to implement for all the micro services. Every call to every micro service is authenticated & authorization.

Authentication, Authorization and security
Rate limits - #of calls per hr
Fault tolerance - default response when any service is down.
Service Aggregation - External consumer who wants to call 15 services as part of a single call

Spring Eurekha- naming server for service discovery
Ribbon - load balancer
Zuul - API Gateway (zuul filter for Logging,

Distributed Tracing

Spring Cloud Sleuth - Assing a unique ID to a request so that it can be traced across the components/systems.
RabbitMQ(or we can use elastic search to consolidate & logs the logs)

Where there is log message, microservice will put on Q. Zipkin will take from Q.

ZipkinDistrbutedTrackingServer - to get consolidated view across all services

Spring Cloud Bus

At appln startup, all the micro services register with cloud bus. When there is any change in configuration, if refresh is called on any of the instances, the micro service instance sends event to Spring Cloud bus which propogates the event to all instances that were registered with it.

Fault Tolerance with Hystrix

Sunday, 3 March 2019

Regression Analysis

Inferential Statistics

Regression Analysis is common method of prediction. It is used when ever there is causal relationship between variables.

Points to note

Correlation doesn't imply causation. Some variables are strangely correlated while few are unexpectedly not correlated

Linear Regression is a linear approximation of a causal relationship between two or more variables.

Regression models are highly valuable as they are one of the most common ways to make inferences and predictions.
Process of Linear regression

Get sample data
Design model that works for the sample
Make predictions for the whole population
Dependent variables are predicted from Independent variables.
Y=F(x1,x2,x3.....)
Dependent variable Y is a function of the independent variables x1,x2,...

Simple Linear Regression is the simplest regression model.

y hat=b0 +b1x1 (hat stands for estimated or predicted value)

b0 is intercept on the line graph
b1 is the slope of the line

Correlation vs Regression

Correlation doesn't imply causation. It is degree of relationship between two variables.
Correlation is degree of inter relation between two variables.
Regression Analysis is about how one variable effects another.
Regression is based on causaulity. It shows no degree of connection but cause and effect.
Correlation P(x,y) is same as P(y,x)
Regression is one way
Correlation is single point on graph.
Regression is best fitting line between the data points that minimizes distance between them.

Decomposing Linear model

Sum of squares total(SST) - sigma(yi - ymu)^2 - diff between actual/actual value & mean
sum of squares regression(SSR) - sigma(yhat - ymu)^2 - diff between predicted value & mean
sum of squares error(SSE) - diff between observed value & predicted value

SST=SSR+SSE
Total variability = Explained variability + unexplained variability

R ^2(R squared ) = SSR/SST = variability explained by the regression/total variability
The R-squared shows how much of the total variability of the dataset is explained by your regression model. This may be expressed as: how well your model fits your data. It is incorrect to say your regression line fits the data, as the line is the geometrical representation of the regression equation. It also incorrect to say the data fits the model or the regression line, as you are trying to explain the data with a model, not vice versa.

R Squared measures the goodness of fit of your model
More factors you include, higher the R Squared
R Squared ranges between 0 & 1. 1 means the model explains entire variability of data.

Ordinary Least squares (min SSE)
    =min sigma ei^2
    s(b) is the OLS estimator of beta for a simple linear regression
    s(b) = sigma(yi - xi^Tb)^2 = (y-Xb)^T(y-Xb)

Regression Tables

Model summary

Multiple R
R square
Adjusted R Square
Standared error - sqrt(SSE/(n-2))
Observations

Anova table (Analysis of Variance)

Table with coeffecients(This is heart of regressions)
intercept(beta 0)
independent variable(beta 1)

Adjusted R Square

It penalizes execessive use of variables
The adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model. The adjusted R-squared increases only if the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected by chance.

Saturday, 2 March 2019

Statistics for Data Science & Business Analysis

Types of Data

Categorical -

Categories or groups like car brands
Questions to Yes/No answers

Numerical

Discreet(finite number)
Continuous (infinite)

Measurement Levels

Qualitative

Nominal (cannot be ordered)

Car brands, seasons in year,

Ordinal

tastes disgusting, unappetizing, neutral, tasty, delicious

Quantitative

Interval

doesnt have true 0
Temperatures,

Ratio

Have 0

Representation of Categorical variables.

Frequency distribution tables
Bar charts
Pie charts(relative percentage/frequency)
Pareto diagrams

This is special type of Bar chart, where categories are shown in descending order of frequency. And a curve on the same graph showing cumulative frequency.
This combines strong sides of the bar and pie charts

Pareto Principle

80% of the effect comes from 20% of the causes.(80-20 rule).

Numerical Variables

Frequency distribution table with intervals
interval width = (largest number-smallest number)/#desired intervals
relative frequency which is equal to frequency/total frequency

Representation of Numerical variables.

Histogram

Represent Relationship between two variables

Cross tables -> Variation of Bar chart or side by side bar chart
Scatter plots-> two numerical variables

Measures of Central Tendency

Mean(denoted by mu or xbar)

Simple average
Downside: Affected by outliers. Hence, not enough to make definite conclusions

Median

Middle number

Mode

value that occurs most offten

Measures of Central tendency should be used together rather than independently. There is no single best measure. But using only one is definitely worse.

Measures of Asymmetry

Skewness - indicates whether the data is concentrated/situated on one side

Positive Skew or right skew

Ourliers are towards right in graph
mean>median
tail is towards right or outliers

Zero skew

mean=median=mode
mean<median
left skew as outliers are towards left
Tail leads to outliers

Measures of Variability

Variance

It measures the dispersion of set of data points around their mean
(Population)sigma sqr = sum of square differences between observed values and population mean divided by #observations
(Sample) S sqr = sum of square differences between observed values and sample mean divided by #sample observations - 1
Why do we square

Dispersion is distance and cannot be negative. Hence we square the differences.
It amplifies large differences

Why is sample variance bigger than population variance

They are corrected upwards, to reflect higher potential variability

Standard Deviation(sigma)

Sqrt of Population Variance
Sqrt of Sample Variance

Coefficient of Variation(cv)

Standard deviation / mean OR standard deviation relative to mean
This is also called relative standard deviation
cv=sigma / mu
cv cap= S/x mu
Why cv

Standard deviation is the most common measure of variability for a single Dataset
Comparing standard deviation of 2 datasets is meaning less but not comparing cv
comparing cv means comparing variability of 2 datasets.

Measures of Relationship between Variables

Covariance

When two variables are correlated, the main statistic to measure this correlation is called covariance.
cov(x,y)=Sxy = sum of (x-xmu) * (y-ymu) /n-1

Linear Correlation Coefficient

cov(x,y)/std(x)*std(y) = Sxy/Sx*Sy
Correlation Coefficient is always between 1 & -1
Correlation 0 means-> they are independent
Correlation between x & y is same as correlation between y & x

Causality

Direction of causal relationships

Correlation doesn't imply causation
Correlation <0.2 is considered as very low and can be disregard

Inferential/predictive Statistics

Predicting Population value based on Sample data

Distributions(Probability distributions) - It is a function that shows the possible values for a variable and how often they occur

Normal(Gaussian distribution) or bell curve

N~(mu,variance)
mean=median=mode
It is symmetrical around mean
Standard Normal Distribution

Z~N(0,1)
z score z=x-mu/sigma

Central Limit Theorem(Sampling distribution of the mean)

No matter the distribution of population, sampling distribution approximates Normal distribution. And its mean is same as population mean & variance is population variance by sample size
Sampling distribution ~ N(mu,variance/n)
Standard Error is standard deviation of the distribution formed by the sample means

Binomial

Uniform

Estimators & Estimates

Estimator is an approximation solely on sample information. Specific value is called estimate.
There may be many estimates of same variable. They all have 2 properties

Effeciency
Bias

Estimators are like judges looking for most efficient unbiased estimators. Unbiased estimator has the expected value equal to population parameters.
Types of estimate

Point estimate

eg: Sample mean xmu is point estimate of population mean mu. Anything added to mu is bias
S sqr is estimate of sigma square
The most efficient estimator is the unbiased estimator with smallest variance.

Confidence interval estimates

Confidence interval is the range within which you expect the population parameter to be.
Confidence level denoted by 1-alpha (alpha is between 0 & 1). If confidence level required is 99%, then alpha is 1%
Formula for confidence interval is
[point estimate - reliability factor * standard error, point estimate + reliability factor * standard error]
[mu - z alpha/2* sigma/sqrt(n), mu + z alpha/2* sigma/sqrt(n)]

Confidence interval when population variance is known

[xmu - z alpha/2* sigma/sqrt(n), mu + z alpha/2* sigma/sqrt(n)]

Confidence interval when population variance is unknown

[xmu - t n-1, alpha/2* S/sqrt(n), mu + t n+1, alpha/2* S/sqrt(n)]

Confidence interval is also called as Margin of error.

Confidence intervals for two means with dependent samples

eg: Blood samples before and after pill
Take the difference of original and after blood sample for each observation.
Calculate the confidence interval with the differnces

Confidence intervals for two means with independent samples

Known population variances(sample sizes can be different)

variance of difference = variance of set one/set one size + variance of set two/set two size

Confience interval

(xmu - ymu)- z alpha/2* sqrt(variance of set one/set one size + variance of set two/set two size)

Confidence intervals for two means with independent samples

UnKnown population variances(sample sizes can be different)

Pooled variance=(nx-1)sx^2+(ny-1)sy^2/(nx+ny-2)

Confience interval

Hypothesis is an idea that can be tested.(To get yes/no answer from confidence interval). The approach is to use test ie hypothesis
Alpha - The probability of rejecting the null hypothesis, if it is true.

Formulate a hypothesis
find right test
execute the tests
make a decision

Type 1 error: Reject a true null hypothesis. This is also called false positive. This is denoted by alpha.

Type 2 error: Accept a false null hypothesis This is denoted by beta. This is also called false negative.

Regression Analysis

------------
Discreet probability distribution

Binomial distribution

Each trial has fixed probability of success
Number of trial/sample size if fixed
Trial are identical where each trial has sample possible outcomes
Proability of getting X success in N trial is

ncx*p^x*q^n-x

Multinomial distribution

It is generalizaiton of binomial distribution
This has many outcomes each with a fixed probability
probabiltiy is computes as

n!/x1!x2!x3!..* p1^x1p2^x2...

Hypergeometric distribution

A sample of n individuals selected WITHOUT replacement
probability is

McX*N-Mcn-x/Ncn

X is number of success in a sample of n
N-total population size
n- selected sample
M-total number of success

Poisson distribution

Counting number of times an event occurs in a interval of time, area, volume etc
if X is number of success in an interval of fixed length t, then probability distribution formula is

(lamda*t)^x*e^-lambda*t/x!
lamda is average number of occurences of event of length 1

Geometric distribution

A trial is repeated until success occurs

p*q^x-1
mean = 1/p
variance = q/p^2

Negative binomial distribution

Number of trials until the rth success
when r=1 then negative binomial becomes geometric distribution
x-1cr-1*q^x-r*p^r