This is my first attempt on implementing a statistical method. This problem was given to me by lejmer (who also helped me later on building more efficient code to achieve this) when we were debating on the need for higher resource allocation to run scripts so it can run longer and faster. The major problem faced by those who want to implement statistics based methods is that they run out of processing time or need to limit the data samples. My point was that such things need be implemented with an algorithm which suits pine instead of trying to port a python code directly. And yes, I am able to demonstrate that by using this implementation of Chatterjee Correlation.
🎲 What is Chatterjee Correlation? The Chatterjee rank Correlation Coefficient (CCC) is a method developed by Sourav Chatterjee which can be used to study non linear correlation between two series. Full documentation on the method can be found here: arxiv.org/pdf/1909.10140.pdf
Algorithm can be simplified as follows: 1. Get the ranks of X 2. Get the ranks of Y 3. Sort ranks of Y in the order of X (Lets call this SortedYIndices) 4. Calculate the sum of adjacent Y ranks in SortedYIndices (Lets call it as SumOfAdjacentSortedIndices) 5. And finally the correlation coefficient can be calculated by using simple formula CCC = 1 - (3*SumOfAdjacentSortedIndices)/(n^2 - 1)
🎲 Looks simple? What is the catch? Mistake many people do here is that they think in Python/Java/C etc while coding in Pine. This makes code less efficient if it involves arrays and loops. And the simple code may look something like this.
But, problem here is the number of loops run. Remember pine executes the code on every bar. There are loops run in array.sort_indices and another loop we are running to calculate SumOfAdjacentSortedIndices. Due to this, chances of program throwing runtime errors due to script running for too long are pretty high. This limits greatly the number of samples against which we can run the study. The options to overcome are
Limit the sample size and calculate only between certain bars - this is not ideal as smaller sets are more likely to yield false or inconsistent results.
Start thinking in pine instead of python and code in such a way that it is optimised for pine. - This is exactly what we have done in the published code.
🎲 How to think in Pine? In order to think in pine, you should try to eliminate the loops as much as possible. Specially on the data which is continuously growing.
My first thought was that sorting takes lots of time and need to find a better way to sort series - specially when it is a growing data set. Hence, I came up with this library which implements Binary Insertion Sort. https://www.tradingview.com/script/ZudhdK9O-BinaryInsertionSort/
Replacing array.sort_indices with binary insertion sort will greatly reduce the number of loops run on each bar. In binary insertion sort, the array will remain sorted and any item we add, it will keep adding it in the existing sort order so that there is no need to run separate sort. This allows us to work with bigger data sets and can utilise full 20,000 bars for calculation instead of few 100s.
However, last loop where we calculate SumOfAdjacentSortedIndices is not replaceable easily. Hence, we only limit these iterations to certain bars (Even though we use complete sample size). Plots are made for only those bars where the results need to be printed.
🎲 Implementation Current implementation is limited to few combinations of x and fixed y. But, will be converting this into library soon - which means, programmers can plug any x and y and get the correlation.
Our X here can be
Average volume
ATR
And our Y is distance of price from moving average - which identifies trend.
Thus, the indicator here helps to understand the correlation coefficient between volume and trend OR volatility and trend for given ticker and timeframe. Value closer to 1 means highly correlated and value closer to 0 means least correlated. Please note that this method will not tell how these values are correlated. That is, we will not be able to know if higher volume leads to higher trend or lower trend. But, we can say whether volume impacts trend or not.
Please note that values can differ by great extent for different timeframes. For example, if you look at 1D timeframe, you may get higher value of correlation coefficient whereas lower value for 1m timeframe. This means, volume to trend correlation is higher in 1D timeframe and lower in lower timeframes.