Comparison of Linear Regression and LSTM (Long Short-Term Memory) in Cryptocurrency Prediction

Cryptocurrency, particularly Bitcoin, has become a major topic in the financial and digital trading sectors due to its ability to facilitate direct transactions without intermediaries and the transparency offered by blockchain technology. However, the high volatility of Bitcoin prices necessitates accurate prediction methods to support better investment decisions. This research aims to compare the accuracy of Linear Regression and Long Short-Term Memory (LSTM) methods in predicting Bitcoin prices using historical data from Yahoo Finance. The research process begins with the collection of historical Bitcoin price data from September 17, 2014, to July 15, 2024, followed by data processing that includes cleaning and splitting the dataset into training and test data. Linear Regression and LSTM models are applied to the training data and tested to evaluate their performance in price prediction. The findings indicate that the LSTM model significantly outperforms the Linear Regression model in terms of prediction accuracy, achieving much lower Mean Squared Error (MSE), Root Mean Squared Error (RMSE), and a perfect R² score of 1.00 on both datasets, alongside an impressive F1 Score of 0.99. In contrast, the Linear Regression model demonstrates higher errors and an F1 Score of 0.88, indicating its limitations in capturing the complexities of Bitcoin price dynamics. These findings suggest that LSTM is more effective in modeling temporal patterns and fluctuations in Bitcoin prices, providing better accuracy and guidance for investors in this highly dynamic market.


Introduction
Cryptocurrency has become a frequently discussed topic in recent years.As the first digital currency to use cryptographic systems for direct transactions between two parties without intermediaries, cryptocurrency has seen rapid growth in the financial, business, and trading sectors [1].Cryptocurrency represents the first implementation of blockchain technology, utilizing a distributed system and consensus-based database with high cryptographic security and transparency.This enables the use of a distributed and immutable ledger, ensuring that every transaction cannot be manipulated, thereby eliminating the need for a trusted third party [2].
One of the most famous cryptocurrencies is Bitcoin, introduced by Satoshi Nakamoto in January 2009.Bitcoin is governed by an open-source software system that allows anyone to modify it [3].Since its introduction, Bitcoin has shown remarkable value growth with significant price fluctuations, peaking in November 2021 at $68,000 per coin.However, the Bitcoin market exhibits very high volatility, up to 10 times higher than the volatility of foreign exchange rates [4].An illustration of Bitcoin's growth can be seen below: Figure 1 illustrates the historical growth of Bitcoin, highlighting significant price peaks and fluctuations.This contextualizes the volatility and the need for accurate predictive models in cryptocurrency trading.In the context of cryptocurrency price research and analysis, various methods have been used to predict price movements.Linear regression is one of the commonly used methods due to its simplicity [5].Linear Regression is used to build a model that identifies the linear relationship between independent variables (such as opening price, highest price, and trading volume) and the dependent variable (closing price) [6].
Linear Regression is chosen for its simplicity and ability to establish a linear relationship between historical prices and future prices.However, it has limitations in capturing non-linear patterns and temporal dependencies inherent in financial time series data.Alternatively, the Long Short-Term Memory (LSTM) method, which is part of artificial neural networks, has gained significant attention.LSTM is a type of artificial neural network method that has the capability to handle sequential data, such as stock or cryptocurrency price data.This method is designed to model temporal patterns in Bitcoin price data, which can aid in future price predictions [7].
LSTM, a type of recurrent neural network, is selected due to its capability to model sequential data and capture long-term dependencies, making it well-suited for predicting highly volatile and temporally dependent cryptocurrency prices.The combination of these models allows us to compare a traditional statistical approach with a more advanced machine learning method [8].
A previous study conducted by Khalis Sofi, Aswan, Supriyadi Sunge, Sasmitoh Rahmad Riady, and Antika Zahrotul Kamalia in 2021 compared the linear regression, LSTM, and GRU algorithms for predicting stock prices.The results demonstrated that LSTM had an advantage in stock price prediction.The study reported an RMSE value of 0.048, an MSE of 0.002, and an MAE of 0.038 for LSTM, whereas for Linear Regression, the RMSE value was 4.621, the MSE was 2.136, and the MAE was 2.890 [9].This research aims to improve the accuracy of cryptocurrency price predictions, particularly Bitcoin, by utilizing historical data from Yahoo Finance.The methods employed, namely Linear Regression and Long Short-Term Memory (LSTM), are expected to address the limitations of previous predictions by identifying temporal patterns and enhancing prediction accuracy.This, in turn, provides more accurate information regarding cryptocurrency price fluctuations to support more effective investment decision-making and increase investor confidence in this dynamic market.
Based on the problem formulation above, the research questions that arise are as follows: 1. How does the performance of linear regression compare to LSTM in predicting Bitcoin prices? 2. Which method provides higher prediction accuracy and reliability in the context of the high volatility often seen in cryptocurrency prices?

Research Methodology
In preparing this research report, several stages were carried out.The research began with problem identification, literature review, data collection, data processing, model implementation using Linear Regression and LSTM, analysis and comparison, and was concluded with results and conclusions.The stages involved in the research process are illustrated in Figure 2 below:

Problem Identification
The research began by identifying the main problem to be addressed, which is to improve the accuracy of Bitcoin price predictions by comparing the performance of Linear Regression and LSTM.

Literature Review
This stage involves reviewing relevant literature to understand the context and theories underlying the research.The literature review helps identify gaps in previous research and forms the basis for the approach used in this study.

Data Collection
Historical Bitcoin price data was obtained from Yahoo Finance, covering the period from September 17, 2014, to July 15, 2024.The collected data includes variables such as date, opening price, highest price, lowest price, closing price, adjusted price, and trading volume.

Data Processing
The collected data will be processed through the following stages:

Reading and Processing Data
The data is read from a CSV file containing columns Date, Open, High, Low, Close, and Volume.The Date column is converted to a date format and set as the index in the data frame.Subsequently, the data is sorted by date to ensure the correct order and displayed for an initial check to verify the accuracy and consistency of the information.

Data Selection and Cleaning
The Close column is selected as the target variable to be predicted, while all other columns are used as features.Next, the data is cleaned by removing rows containing missing values (NaN) to ensure the quality and consistency of the dataset..

Data Splitting
The dataset is divided into training and testing data using `train_test_split()` from scikit-learn, with 80% of the data used for training and 20% for testing.This splitting ensures that the model is trained on the training data and validated on unseen testing data.

Model Implementation
Linear Regression and Long Short-Term Memory (LSTM) models are applied to the processed data.The models are trained using the training set and tested using the testing set to evaluate their performance.

Linear Regression
The process of predicting Bitcoin prices using linear regression involves several important steps.First, Bitcoin price data is collected and prepared, including normalization to ensure consistent feature scaling.Next, the data is split into two sets: a training set to build the model and a testing set for evaluation.The linear regression model is built using the formula: where  is the predicted Bitcoin price,  represents independent variables such as historical prices,  ! is the intercept,  " is the regression coefficient, and  is the prediction error.After the model is trained using the training data, predictions are made on the testing data, and the results are compared with the actual prices to evaluate the model's accuracy.[10].

LSTM
The process of predicting Bitcoin prices using LSTM begins with the collection and normalization of Bitcoin price data to ensure consistent feature scaling.The data is then split into training and testing sets.The LSTM model is built using Keras, which involves components such as the forget gate to determine which information to discard, the input gate to decide which new information to store, and the output gate to determine the next hidden state value.After training the model, predictions are made on the testing data, and the predicted results are compared with the actual prices to evaluate the model's performance.[11].
The main formulas used in LSTM (Long Short-Term Memory) include [12]: • Forget Gate: Determines which part of the previous cell state to discard.
Where  is the sigmoid function,  # is the weight matrix for the forget gate, ℎ %&" is the previous hidden state,  % is the input at time, and  # is the bias term for the forget gate.• Input Gate: Decides the extent of new information to add to the cell state.
Where  ' is the weight matrix for the input gate and  ' is the bias term for the input gate.• Candidate Cell State: Generates candidate updates for the new cell state, with values between -1 and 1.
Where tanh is the hyperbolic tangent function,  ) is the weight matrix for the cell state, and  ) is the bias term for the cell state.
• Cell State Update: Combines old and new information to update the cell state.
Where C %&" is the previous cell state.
• Output Gate: Regulates the information output from the cell state.
Where W * is the weight matrix for the output gate and b * is the bias term for the output gate.
• Hidden State: Generates the output based on the updated cell state and output gate.

Analysis and Comparison
The next step is the analysis and comparison, where the results of applying linear regression and LSTM in cryptocurrency prediction are evaluated.At this stage, the analysis presents a detailed assessment of each model's performance in predicting cryptocurrency prices, as well as a comparison between the results obtained from the two methods.The findings from this analysis will be used as a basis to conclude the advantages and disadvantages of each model in the context of cryptocurrency prediction.

Dataset Management
Historical Bitcoin price data was obtained from a CSV file covering the period from September

Linear Regression
The linear regression model is trained using the training data and tested with the test data.Evaluation is performed by calculating the Mean Squared Error (MSE), Root Mean Squared Error (RMSE), R² Score, and F1 Score to measure the model's performance.The evaluation results are presented in Table 1.

LSTM
The LSTM model is built using the Keras framework.This model is trained with the training data and tested with the test data.Evaluation is conducted by calculating MSE, RMSE, R² Score, and F1 Score.The evaluation results are presented in Table 2.  Figure 4 shows the predicted Bitcoin prices using the Linear Regression model on the test data.As in the previous graph, the x-axis represents the date, and the yaxis represents the Bitcoin price in USD.The blue line depicts the actual Bitcoin prices, while the red dashed line illustrates the predicted prices on the test data.This graph provides a visualization of the performance of the Linear Regression model in predicting Bitcoin prices based on data that was not used in training the model, thereby indicating the model's ability to generalize.Scores highlight that the LSTM maintains its superior predictive capability with a score of 0.99, while linear regression achieves an F1 Score of 0.99.This indicates that while linear regression provides reasonable results, it struggles to capture complex price dynamics, particularly in the context of Bitcoin's high volatility.

Limitations and Recommendations
The linear regression model may not adequately capture the nonlinear patterns in Bitcoin price data, which can affect prediction accuracy.Additionally, the size of the dataset and feature selection can also impact model performance.While the LSTM model has demonstrated strong performance, its effectiveness can also be influenced by the chosen hyperparameters, and the amount of data used.We recommend further tuning of LSTM parameters and considering additional features to enhance model performance in the future.

Conclusion
The Overall, the LSTM model demonstrates superior performance in terms of prediction accuracy compared to Linear Regression.While Linear Regression still provides reasonable results, LSTM offers better accuracy and may be more suited to capturing complex temporal patterns in the data, making it a more effective choice for predicting Bitcoin prices.

Figure 3
Figure 3 shows the predicted Bitcoin prices using the Linear Regression model on the training data.In the graph, the x-axis represents the date, and the y-axis represents the Bitcoin price in USD.The blue line displays the actual Bitcoin prices, while the orange dashed line indicates the predicted prices on the training data.This graph provides an overview of how well the Linear Regression model can predict Bitcoin prices based on the trained data.
In the analysis of the Long Short-Term Memory (LSTM) model, the evaluation results show impressive performance on both datasets, training and test data.On the training data, the model achieved a Mean Squared Error (MSE) of 1,192,873.76and a Root Mean Squared Error (RMSE) of 1,092.19,with a perfect R² Score of 1.00.This indicates that the LSTM model can predict the training data with extremely high accuracy and very low prediction error.On the test data, the model demonstrated an MSE of 949,865.30and an RMSE of 974.61, with an R² Score remaining at 1.00.This success suggests that the LSTM model is not only highly effective in learning from the training data but also in generalizing to unseen data, with optimal prediction accuracy and very minimal error.The F1 Score for both the training and test data was an impressive 0.99, indicating an exceptional balance between precision and recall.This high F1 Score reinforces the model's capability to minimize false positives and false negatives, affirming its robustness in capturing the underlying patterns and trends in the data exceptionally well, resulting in highly accurate predictions on both datasets.

Figure 5
Figure 5 shows the predicted Bitcoin prices using the LSTM model on the training data.In the graph, the xaxis represents the date, and the y-axis represents the Bitcoin price in USD.The blue line displays the actual Bitcoin prices, while the orange dashed line indicates the predicted prices on the training data.This graph provides an overview of how well the LSTM model can predict Bitcoin prices based on the trained data.

Figure 6 .
Figure 6.Test Data Prediction Graph for LSTM

Table 1 .
Evaluation Table for the Linear Regression ModelIn the analysis of the Linear Regression model, the evaluation results indicate a reasonably good performance but with some limitations.On the training data, the model produced a Mean Squared Error (MSE) of 130,148,318.66and a Root Mean Squared Error (RMSE) of 11,408.26.The R² Score on the training data was 0.63, suggesting that the model explains about 63% of the variability in the training data, leaving 37% unexplained.On the test data, the model showed an MSE of 127,306,850.20 and an RMSE of 11,283.03,with a slightly improved R² Score of 0.66.This indicates that the linear regression model performs well in predicting test data, with slightly lower error compared to the training data.The F1 Score on the training data was 0.88, and on the test data, it improved slightly to 0.89.These scores reflect a balance between precision and recall, suggesting that the model is effective in minimizing false positives and false negatives, although further refinements are needed to enhance its overall predictive accuracy.However, there is still room for improvement in accuracy and precision of predictions for both the training and test data, as the variability explanation remains suboptimal.

Table 2 .
Evaluation Table for the LSTM Model

Table 3
As illustrated inTable 3, the LSTM model demonstrates considerably lower Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) compared to linear regression, with values of 1,192,873.76and1,092.19 for training data, while linear regression reports MSE of 130,148,318.66andRMSE of 11,408.26.Additionally, LSTM achieves a perfect R² Score of 1.00, in contrast to the 0.63 achieved by the linear regression model.In terms of F1 Score, which balances precision and recall, the LSTM model excels with a score of 0.99 on the training data, compared to 0.88 for linear regression.

Table 4
further reinforces these findings, showing that on the test data, the LSTM model continues to perform exceptionally well with an MSE of 949,865.30and RMSE of 974.61, while linear regression records higher errors with MSE of 130,148,318.66and RMSE of 11,408.26.Both models achieve a perfect R² Score of 1.00 for LSTM and 0.66 for linear regression, but the F1 evaluation of the Linear Regression and Long Short-Term Memory (LSTM) models reveals a clear difference in prediction accuracy.The Linear Regression model yields higher Mean Squared Error (MSE) and Root Mean Squared Error (RMSE) values, with 130,148,318.66and11,408.26 on the training data, and 127,306,850.20 and 11,283.03 on the test data.The R² Scores are 0.63 for the training data and 0.66 for the test data, indicating that while the model explains a substantial portion of the data variability, significant errors remain.Additionally, the F1 Score for the Linear Regression model is 0.88 on the training data and 0.89 on the test data, reflecting its ability to balance precision and recall but also highlighting its limitations in capturing complex patterns.In contrast, the LSTM model reports significantly lower MSE and RMSE values, with 1,192,873.76and1,092.19 on the training data, and 949,865.30and974.61 on the test data.The perfect R² Score of 1.00 on both datasetsshows that the LSTM model can explain the entire variability of the data with very high accuracy and minimal error.The F1 Score for the LSTM model stands at an impressive 0.99 on both training and test datasets, underscoring its exceptional performance in maintaining a balance between precision and recall.