The intent of this paper is to draw baseline to the knowledge obtained during my diving into ML area so far (work in progress), briefly describe key points that would serve as anchors in future to refresh it in memory.

If imagine neural network as a black box, on input you pass the dataset that describes a classification problem you work on and on output you get dataset with predictions where each item is a class model match to original input with the highest probability among others classes.

In the nutshell, this black box is a math function

				classes = dataset * weights + biases + hyperparameters
			


, which is called a decision rule. The sum of decision rules is called model.

To make this model work on all dataset we have to find right weights and hyperparameters where this equation will be always true. We have formalise this additional clause.

This formalisation is called a cost and loss functions where loss function represents the difference between correct labels (trained data) and predicted ones and cost function is a mean of sum of loss functions applied to each input item. (y-f(x) vs sum(y-f(x))/n). It should be minimized to find best weights. (cost and loss functions often mess and uses as synonyms in term of cost function)

Often decision rule includes regularisation - extra parameter that penalized the model, - with intent to prevent overfitting (L1, L2)

The minimization of cost function is done over gradient descent. The explanation of GD is out of scope of this paper, but I want to stress on:
- split complex (compound) function on partial functions
- derivative of functions by weights matrix
- building Jacobian -- derivative of complex (compound) function is done over chain rule and derivative of each function will give us Jacobian. Where in case of D(classes = dataset*weights + biases) each row of Jacobian is a partial derivative on Wij filled by zeros in places where is no such Wij and ones ('1') in opposite cases.


			[
				[D1G1 ... DntG1],
					  ...
				[D1Gt ... DntGt]
			]
			, where G1 =W11*x + W12x + ... 
			

Jacobian of ML model

Building a neural network is done over forward traverse where ouput of each layer is an input of first layer and a result on each layer is product of feature vector on weight matrix. After that cost functions like softmax is applied to obtain loss and gradient to perform back propagation from last layer to the first one with gradient as an input. On the last step regularisation is applied.

You have to be accurate with extracting and prepare input dataset.

TODO Cross validation, activation functions, metrics: confusion matrix, precision and recall, F1 score(macro, micro, harmonic), necessary math, best practices of debugging the model

A few related posts:
Getting skills in ML Computer Vision
Case study: ML. Housing price prediction