Saturday, 30 March 2019

WebServices Security

Areas to be addressed as part of Security(WS Security)
  • Authentication
    • Username token profile,  X 509 Certificates, SAML(Singlesign On)
  • Confidentiality
    • Encryption/Decryption
  • Integrity
    • Signatures(hash value of message)
  • non-repudiation 
    • TimeStamp(prevent replay attacks)
Stateless Authentication Mechanisms
  • Basic Auth
    • Client sends Username/Password for every request as state is not maintained in Server
    • User/passwd details are sent as part of Request Header  
    • Concatenate Username & password with : as delimiter
    • Base 64 encoding on the concatenated String
    • Encoded string is passed as value to Header key 'Authorization' with a prefix as 'Basic '
    • This is not Secure as the encoded string can be decoded.
    • So, always send it over https request to protect it
    • We make Bae 64 encoding to handle non-http compatible chars in username/passwd
    • Advantages
      • Simple, Stateless Server, Supported by all browsers, 
    • Disadvantages
      • requires https to protect
      • it is subjected to replay attacks
      • Logout is tricky(Browser caching)
  • Digest access Authentication
    • This mechanism does Encription(https://en.wikipedia/org/wiki/Digest_access_authentication)
  •  Asymmetric cryptography
    • https://en.wikipedia/org/wiki/Public-key_cryptography
    • Both client and server Uses public and private keys
  •  OAuth
    •  https://en.wikipedia/org/wiki/OAuth
  • JSon Web Tokens
    • https://en.wikipedia/org/wiki/JSON_Web_Token
Interceptors & Filters
  • Interceptors are designed to manipulate Entities(input and output streams).
  • Filters manipulate Headers/uris/matadata information
  • Interceptors manipulate actual body of request/response
  • Filters are used for cross cutting concerns like Logging, security
  • Interceptors are used to Encode an entity response
  • Filters & Interceptors work on Client too.

SOAP
  1. UsernameToken
    • Most used ways while using SOAP based webservices
    • It defines standard to pass Username & Password inside SOAP header
    • Steps to configure
      • Create interceptors in cxf-servlet.xml
      • create password callback handler
Encryption/Decryption Concepts
  • Symmetric or Private
    • Sender encrptps the data with private key
    • Receiver decrpts with his private key
    • This is expensive as vendor has to maintain pair of keys for every application/user
  • Public key crptography
    • Data will be encrypted with public key
    • Decription will be done with private  key
    • Private key cannot be derived even if hacker knows Public key
    • RSA is public key encrpytion algorithm
Java Keytool
  • Key and Certificate management utility
  • When Public & private key are generated KEYSTORE file will be created. This is the place where Private and Public keys are stored. This file is password protected. 
  • For each Private key, we give a alias/username & also a password
  • Public key can be exported into Certificate which can be distributed across our client applications. 
  • keytool - genkeypair -alias mykey -keypass mykeypass -keystore -mykeystore.jks -storepass mystorepass -validity 100 -dname "cn=Venkat Desu, ou=ws, o=VenkatInc,c=IN"
  • Export public key out of keystore(to distribute to client)
    • keytool -export -rfc -keystore mykeystore.jks -storepass mystorepass -alias mykey -file MyCert.cer 
  • Import certificates into alternate keystores
    •  keytool -import -trustcacerts -keystore servicekey store.jks -storepass mystorepass -alias mykey -file MyCert.cer -noprompt
Signatures
  • To ensure integrity of data and is not tampered on the way
  • It is fixed length value that is calculated using content of the message by applying algorithm on it. This value is also known as hash. The hash is calculated using a private key
  • Hash is sent to server side along with the message
  • On the serverside, the hash is recalculated using Public key of corresponding private key. 
  • Content is termed as not tampered when both the hashes are same.
Non Repudiation
  • TimeStamp
    • This has both Creation & Expiry times. 
    • Server rejects the message when expiry time is past the current time
OAuth2 security for REST
  • Restful applications are slightly different from webapplications. Webapplns are directly used by end user and end user authenticates. Where as Restful applications, it is the application that authenticates with Restful application.
    •  



Friday, 29 March 2019

HTTP Clients

HTTP Requests
  • Initial Line
    • method (GET, POST, Delete..)
    • path
  • Headers
    • Language
    • client version
    • type(response type)
  • Body
    • input fields, user/pwd details

HTTP Response
  • Initial/status Line
    • Status code
    • message
    • version
  • Headers
    • type
    • length
    • Language
  • Body(optional)
    • formatted/plain response
Browsers support ONLY GET & POST requests.
GET
  • Doesn't have BODY
  • Pass parameters directly in URL(Query String)
POST
  • To Create
Put and Patch
  • To update existing resource
  • PUT to update whole resource
  • Patch to update some of the attribute of a resource
  • Doesn't receive/support form data. They receive/support only form-urlencoded.
Delete
  • form-urlencoded method for body
HTTP Status/Response
  • 404 - resource doesn't exist
  • https://en.wikipedia.org/wiki/List_of_HTTP_status_codes

Oauth
  • grant_type: client_credentials //In body
    • client_id
    • client_secret
  • grant_type:password
    • Client credentials
    • User credentials
Ways to send access token to the actual request
  • Body of the request
    • access_token:<value of the token>
  • Header of the request
    • Authorizatoin:Bearer <value of the token>
    •  

Sunday, 17 March 2019

Microservices

Webservice - Software system designed to support interoperable machine-to-machine interaction over a network.

Microservices are small apps that comprise a bigger app and are communicating by means of uniform defined interfaces.
  • REST
  • &small well chosen deployable units
  • & cloud enabled
  •  
An architectural approach for developing a single application as a suite of small services, each running in its own process and communicating with lightweight mechanisms.

Advantages and Enhancements compared to monolithic application.
  • Monolytics
    • Intimidating to new developers
    • Overburdened developer environment - lower productivity
    • Huge code base
    • Scaling can be very difficult as ONLY Scale Up and get better hard ware is the only option.
    • Scale out which brings infinite capacity is NOT an option here.
    • Entire application has to be upgraded when every underlying Java/Java EE/apps server upgrade happens.
    • Things may get worse when one tries to use different languages or frameworks for different parts of application.
  • Microservices
    • Individually deployable with each having separate storage system
    • Decentralized data management
    • Decentralized governance 
    • Decomposition of the application into smaller services brings smaller and reasonable code bases which are much more developer friendly. 
    • We can scale out each part independently depending on load.
    • Independent upgrades from rest of the system. 
    • Flexibility of technological choices both in frameworks and languages.
    • Easier to evolve
Microservices are not a silver bullet. They bring in quiet a lot of complexity in management, maintenance, deployment, coordination and system design. While individual components are simple, the big picture is more complex. 
Breaking Monolith into Microservices
  • No single way of breaking 
  • Start with apps functionality and business capabilities. 
  • Try to isolate that perform single business capability. Identity its dependencies and connections and draw boundaries so that two components are not connected.
  • Where to start
    • Begin with the component with least connections/dependencies
    • Separate database access and entity definitions and not share them between components. 
    • Each micro component is responsible for its data.
    • Avoid EJB remote or similar calls and SWITCH to HTTP(Soap/REST)
    • Avoid stateful timers and schedules - externalize them.
    • Separate UI
    • Services everywhere for everything
    • Microservices should be stateless.
Java Microservice Frameworks overview
  • Java EE based- Payara Micro
    • Payara from glassfish which provides support for full app server
    • Micro version cut down to run from a single JAR for microservices. Internally it is still an app server
  • Combination of Java EE & other libraries - Dropwizard
    • Java framework for developing lightweight services
    • Servlet, Jax-RS and Bean validation from Java EE
    • Several other libraries like - Jackson,  JDBI, Guava, Liquibase, and so.
    • This is healthy mix of both worlds- Enterprise and non-enterprise
    • This can be added as dependency and run directly from fat Jar
  •  Spring based - Spring boot
    • Enables/supports Standalone spring applications
    • Tailored for modular, lightweight apps or microservices
Stateless microservices do not keep any data that is required to execute future requests
  • Use exclusively "Application Scoped" or "Request Scoped" CDI beans, or singleton instances in general
  • SPA webapplications are ideal as they keep state in the users browsers.


  • Continuous Integration: Practice of merging all code changes of developers as shared copy as frequently as possible.
  • Continuous Deployment/Delivery: Producing software in small cycles which can be reliably release at any time
  • Infrastructure as code:  Monitoring and provisioning the resources 
  • Monitoring an logging: Ability to detect and mitigate quickly and easily.
Serverless Computing: 
  • A cloud computing execution model in which the cloud provider dynamically manages the allocation of machine resources
  • Pricing is based on resources consumed
Coupling & Cohesion
  • Coupling is the degree of interdependence between the modules
  • Cohesion is the degree to which elements inside module belong to
Domain driven design
  • Domain is the area for which software is being developed
  • Model is the representation of domain. A set of abstractions that describes some aspects of domain such as relationships
Challenges
  • Deciding boundries is evoutionary
  • Configuration management due to number of env & instances
    • Configuration server to maintain the properties of all env & instances
  • dynamically distribute loads
    • Naming server(Eurekha) for service registry & service discovery
  • Centralized logs
  • fault tolerance systems
Advantages
  • New technology & process adoption(different process in different technologies)
  • Dynamic scaling
  • Faster release cycles
Standardizing
  • Standarize port & applicaitons
API Gateways- Common features that we need to implement for all the micro services. Every call to every micro service is authenticated & authorization.
  • Authentication, Authorization and security
  • Rate limits - #of calls per hr
  • Fault tolerance - default response when any service is down.
  • Service Aggregation - External consumer who wants to call 15 services as part of a single call

  •  
  • Spring Eurekha- naming server for service discovery
  • Ribbon - load balancer
  • Zuul - API Gateway (zuul filter for Logging,
 Distributed Tracing
  • Spring Cloud Sleuth - Assing a unique ID to a request so that it can be traced across the components/systems.
  • RabbitMQ(or we can use elastic search to consolidate & logs the logs)
    • Where there is log message, microservice will put on Q. Zipkin will take from Q.
  • ZipkinDistrbutedTrackingServer - to get consolidated  view across all services
Spring Cloud Bus
  • At appln startup, all the micro services register with cloud bus. When there is any change in configuration, if refresh is called on any of the instances, the micro service instance sends event to Spring Cloud bus which propogates the event to all instances that were registered with it.
  •  
Fault Tolerance with Hystrix
  •  

Sunday, 3 March 2019

Regression Analysis

Inferential Statistics

Regression Analysis is common method of prediction. It is used when ever there is causal relationship between variables.

Points to note
  • Correlation doesn't imply causation. Some variables are strangely correlated while few are unexpectedly not correlated
Linear Regression is a linear approximation of a causal relationship between two or more variables.
  • Regression models are highly valuable as they are one of the most common ways to make inferences and predictions. 
  • Process of Linear regression
    • Get sample data
    • Design model that works for the sample
    • Make predictions for the whole population
    • Dependent variables are predicted from Independent variables.
      Y=F(x1,x2,x3.....)
      Dependent variable Y is a function of the independent variables x1,x2,...
Simple Linear Regression is the simplest regression model.
  • y hat=b0 +b1x1 (hat stands for estimated or predicted value)
    • b0 is intercept on the line graph
    • b1 is the slope of the line
Correlation vs Regression
  • Correlation doesn't imply causation. It is degree of relationship between two variables.
  • Correlation is degree of inter relation between two variables. 
  • Regression Analysis is about how one variable effects another. 
  • Regression is based on causaulity. It shows no degree of connection but cause and effect.
  • Correlation P(x,y) is same as P(y,x)
  • Regression is one way
  • Correlation is single point on graph.
  • Regression is best fitting line between the data points that minimizes distance between them.
Decomposing Linear model
  • Sum of squares total(SST) - sigma(yi - ymu)^2 - diff between actual/actual value & mean
  • sum of squares regression(SSR) - sigma(yhat - ymu)^2  - diff between predicted value & mean
  • sum of squares error(SSE) - diff between observed value & predicted value
SST=SSR+SSE
Total variability = Explained variability + unexplained variability

R ^2(R squared ) = SSR/SST = variability explained by the regression/total variability
    The R-squared shows how much of the total variability of the dataset is explained by your regression model. This may be expressed as: how well your model fits your data. It is incorrect to say your regression line fits the data, as the line is the geometrical representation of the regression equation. It also incorrect to say the data fits the model or the regression line, as you are trying to explain the data with a model, not vice versa.
  • R Squared measures the goodness of fit of your model
  • More factors you include, higher the R Squared
  • R Squared ranges between 0 & 1. 1 means the model explains entire variability of data.

Ordinary Least squares (min SSE)
    =min sigma ei^2
    s(b) is the OLS estimator of beta for a simple linear regression
    s(b) = sigma(yi - xi^Tb)^2 = (y-Xb)^T(y-Xb)

Regression Tables
  • Model summary
    • Multiple R
    • R square
    • Adjusted R Square
    • Standared error - sqrt(SSE/(n-2))
    • Observations
  • Anova table (Analysis of Variance)
    • SSE
    • SSR
    • SST 
  • Table with coeffecients(This is heart of regressions)
  •         intercept(beta 0)
  •         independent variable(beta 1)

Adjusted R Square
  • It penalizes execessive use of variables
  • The adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model. The adjusted R-squared increases only if the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected by chance.

Saturday, 2 March 2019

Statistics for Data Science & Business Analysis

Types of Data
    • Categorical - 
      • Categories or groups like car brands 
      • Questions to Yes/No answers
    • Numerical
      • Discreet(finite number)
      • Continuous (infinite)
 Measurement Levels
  • Qualitative
    • Nominal (cannot be ordered)
      • Car brands, seasons in year, 
    • Ordinal
      • tastes disgusting, unappetizing, neutral, tasty, delicious
  • Quantitative
    • Interval
      •  doesnt have true 0
      • Temperatures,
    • Ratio
      • Have 0
 Representation of Categorical variables.
  • Frequency distribution tables
  • Bar charts
  • Pie charts(relative percentage/frequency)
  • Pareto diagrams
    • This is special type of Bar chart, where categories are shown in descending order of frequency. And a curve on the same graph showing cumulative frequency. 
    • This combines strong sides of the bar and pie charts
Pareto Principle
80% of the effect comes from 20% of the causes.(80-20 rule).
Numerical Variables
  • Frequency distribution table with intervals
  • interval width = (largest number-smallest number)/#desired intervals
  • relative frequency which is equal to frequency/total frequency
Representation of Numerical variables.
  • Histogram
Represent Relationship between two variables
  • Cross tables -> Variation of Bar chart or side by side bar chart
  • Scatter plots-> two numerical variables
Measures of Central Tendency
  • Mean(denoted by mu or xbar)
    • Simple average 
    • Downside: Affected by outliers. Hence, not enough to make definite conclusions
  • Median
    • Middle number
  • Mode
    • value that occurs most offten
Measures of Central tendency should be used together rather than independently. There is no single best measure. But using only one is definitely worse.

Measures of Asymmetry
  • Skewness - indicates whether the data is concentrated/situated on one side
    • Positive Skew or right skew
      • Ourliers are towards right in graph
      • mean>median
      • tail is towards right or outliers
    • Zero skew
      • mean=median=mode
      • mean<median
      • left skew as outliers are towards left
      • Tail leads to outliers
Measures of Variability
  • Variance
    • It measures the dispersion of set of data points around their mean
    • (Population)sigma sqr = sum of square differences between observed values and population mean divided by #observations  
    • (Sample) S sqr = sum of square differences between observed values and sample mean divided by #sample observations  - 1
    • Why do we square
      • Dispersion is distance and cannot be negative. Hence we square the differences.
      • It amplifies large differences
    • Why is sample variance bigger than population variance
      • They are corrected upwards, to reflect higher potential variability
  • Standard Deviation(sigma)
    • Sqrt of Population Variance
    • Sqrt of Sample Variance
  • Coefficient of Variation(cv)
    • Standard deviation / mean OR standard deviation relative to mean
    • This is also called relative standard deviation
    • cv=sigma / mu
    • cv cap= S/x mu
    • Why cv
      • Standard deviation is the most common measure of variability for a single Dataset
      • Comparing standard deviation of 2 datasets is meaning less but not comparing cv
      • comparing cv means comparing variability of 2 datasets.
Measures of Relationship between Variables
  • Covariance
    • When two variables are correlated, the main statistic to measure this correlation is called covariance.
    • cov(x,y)=Sxy = sum of (x-xmu) * (y-ymu) /n-1
  • Linear Correlation Coefficient
    • cov(x,y)/std(x)*std(y) = Sxy/Sx*Sy
    • Correlation Coefficient is always between 1 & -1
    • Correlation 0 means-> they are independent
    • Correlation between x & y is same as correlation between y & x
    • Causality
      • Direction of causal relationships
    • Correlation doesn't imply causation
    • Correlation <0.2 is considered as very low and can be disregard 
Inferential/predictive Statistics
  • Predicting Population value based on Sample data
Distributions(Probability distributions) - It is a function that shows the possible values for a variable and how often they occur

  • Normal(Gaussian distribution) or bell curve
    • N~(mu,variance)
    • mean=median=mode
    • It is symmetrical around mean
    • Standard Normal Distribution
      • Z~N(0,1) 
      • z score z=x-mu/sigma
    • Central Limit Theorem(Sampling distribution of the mean)
      • No matter the distribution of population, sampling distribution approximates Normal distribution.  And its mean is same as population mean & variance is population variance by sample size
      • Sampling distribution ~ N(mu,variance/n) 
      • Standard Error is standard deviation of the distribution formed by the sample means
  • Binomial
    •  
  • Uniform
Estimators & Estimates 
  • Estimator is an approximation solely on sample information. Specific value is called estimate.
  • There may be many estimates of same variable. They all have 2 properties
    • Effeciency 
    • Bias
  • Estimators are like judges looking for most efficient unbiased estimators. Unbiased estimator has the expected value equal to population parameters.
  • Types of estimate
    • Point estimate
      • eg: Sample mean xmu is point estimate of population mean mu. Anything added to mu is bias
      • S sqr is estimate of sigma square
      • The most efficient estimator is the unbiased estimator with smallest variance.
    • Confidence interval estimates
      • Confidence interval is the range within which you expect the population parameter to be.
      • Confidence level denoted by 1-alpha (alpha is between 0 & 1). If confidence level required is 99%, then alpha is 1%
      • Formula for confidence interval is
        [point estimate - reliability factor * standard error, point estimate + reliability factor * standard error]
         [mu - z alpha/2* sigma/sqrt(n), mu + z alpha/2* sigma/sqrt(n)]
    • Confidence interval when population variance is known
      •  [xmu - z alpha/2* sigma/sqrt(n), mu + z alpha/2* sigma/sqrt(n)]
    • Confidence interval when population variance is unknown
      •  [xmu - t n-1, alpha/2* S/sqrt(n), mu + t n+1, alpha/2* S/sqrt(n)]
    •  Confidence interval is also called as Margin of error.
Confidence intervals for two means with dependent samples
  • eg: Blood samples before and after pill
  • Take the difference of original and after blood sample for each observation.
  • Calculate the confidence interval with the differnces
Confidence intervals for two means with independent samples
  • Known population variances(sample sizes can be different)
    • variance of difference = variance of set one/set one size + variance of set two/set two size
  • Confience interval
    • (xmu - ymu)- z alpha/2* sqrt(variance of set one/set one size + variance of set two/set two size)

Confidence intervals for two means with independent samples
  • UnKnown population variances(sample sizes can be different)
    • Pooled variance=(nx-1)sx^2+(ny-1)sy^2/(nx+ny-2)
  • Confience interval

Hypothesis is an idea that can be tested.(To get yes/no answer from confidence interval). The approach is to use test ie hypothesis
Alpha - The probability of rejecting the null hypothesis, if it is true.
  1. Formulate a hypothesis
  2. find right test
  3. execute the tests
  4. make a decision
Type 1 error: Reject a true null hypothesis. This is also called false positive. This is denoted by alpha.
Type 2 error: Accept a false null hypothesis This is denoted by beta. This is also called false negative.

Regression Analysis




------------
Discreet probability distribution
  • Binomial distribution
    • Each trial has fixed probability of success
    • Number of trial/sample size if fixed
    • Trial are identical where each trial has sample possible outcomes
    • Proability of getting X success in N trial is
      • ncx*p^x*q^n-x
  • Multinomial distribution
    • It is generalizaiton of binomial distribution
    • This has many outcomes each with a fixed probability
    • probabiltiy is computes as
      • n!/x1!x2!x3!..* p1^x1p2^x2...
  • Hypergeometric distribution
    • A sample of n individuals selected WITHOUT replacement
    • probability is
      • McX*N-Mcn-x/Ncn
        • X is number of success in a sample of n
        • N-total population size
        • n- selected sample
        • M-total number of success
  • Poisson distribution
    • Counting number of times an event occurs in a interval of time, area, volume etc
    • if X is number of success in an interval of fixed length t, then probability distribution formula is
      • (lamda*t)^x*e^-lambda*t/x!
      • lamda is average number of occurences of event of length 1
  • Geometric distribution
    • A trial is repeated until success occurs
      • p*q^x-1
      • mean = 1/p
      • variance = q/p^2
  • Negative binomial distribution
    • Number of trials until the rth success 
    • when r=1 then negative binomial becomes geometric distribution
    • x-1cr-1*q^x-r*p^r
    •