Excurse: Introduction to databases
In my previous post I started to explain the SAP Coding Masters Challenge and the assignments, which we had to solve. We should analyse data taken from twitter API to answer different questions. In this post I will tell about the second challenge and make briefly introduction into databases.
Second Challenge
In the second challenge our teams should solve similar exercises, but this time not with our technologies but with the SAP new HANA database. The challenge was organised to give us the opportunity to get to know this new technology and to work with it. At this point I would like to introduce how the data is stored in conventional databases and how it is different from the SAP HANA.
Brief Introduction into Databases
Almost every computer program consists of application logic and program data. The data is even more important for the business than logic. It's said that: the business without application logic adapted to its needs will be lame on both its feet, but without data it will die. You can imagine that traditional business can have huge amounts of data and this data should be handled somehow. For these requirements you need a database. Database is a data together with the data management system, which handles access to the data and performs user queries on it.
You can imagine that the data is stored in tables. For example in TWEETS table where tweets are stored, you can find such attributes like user_id, tweet_id, tweet_text, geo_information etc. Each row represents a single tweet and each column a corresponding attribute. The tweet can be identified by tweet_id. It means if you know a tweet_id, you can unambigiously find the necessary tweet. Tweet_id is called primary key for the table tweets.
But how can we store the user name, address or date of birth? If we would put it to the TWEETS table, we will create redundant data, because this information should be saved for each tweet. And what if the user have many thousands of tweets? An additional disadvantage of storing redundant data is very complex change management. If some of the user information is changed, this information should be updated for every tweet, which the user has ever posted. To avoid the redundancies it is better to put all relevant user information into another table USERS. This table contains only user information. To identify a user we will use user_id as primary key.
Ok, now we have TWEETS and USERS tables, but how we can combine both in such a way that we could find for example all tweets of a special user? For this case we have to put a link to a user in each tweet, which the user has written. And what can be better than the primary key of the user. So we put user_id to the TWEETS table. The user_id in the TWEETS table is called foreign key. In such a way we created a relation between TWEETS and USERS. A conventional database have a lot of such relations. That's why it is called relational database.
tweet_id | tweet_text | user_id | latitude | longitude | ... |
---|---|---|---|---|---|
1 | look at my food | user_1 | 12.3456 | 44.2456 | ... |
But how can we store the user name, address or date of birth? If we would put it to the TWEETS table, we will create redundant data, because this information should be saved for each tweet. And what if the user have many thousands of tweets? An additional disadvantage of storing redundant data is very complex change management. If some of the user information is changed, this information should be updated for every tweet, which the user has ever posted. To avoid the redundancies it is better to put all relevant user information into another table USERS. This table contains only user information. To identify a user we will use user_id as primary key.
Ok, now we have TWEETS and USERS tables, but how we can combine both in such a way that we could find for example all tweets of a special user? For this case we have to put a link to a user in each tweet, which the user has written. And what can be better than the primary key of the user. So we put user_id to the TWEETS table. The user_id in the TWEETS table is called foreign key. In such a way we created a relation between TWEETS and USERS. A conventional database have a lot of such relations. That's why it is called relational database.
user_id | username | name | ... | |
---|---|---|---|---|
user_1 | johndoe | john | johndoe@email.com | ... |
Row based data storing
Actually, the data cannot be physically stored tablewise in the memory because the memory is linear. So, the information contained in tables should be serialized. The simplest way to do it is to store data rowwise in the memory. The TWEETS table will be serialized like following:
1 | text | 1 | 12.3456 | 44.2456 | ... | 2 | text2 | user_2 | 12.3456 | 44.2456 | ... |
This is the way how the data was stored in the database since it creation in early 60th years. The main advantage of this solution is that you can add new rows to the end of the table without its restructurisation.
Additionally, you can get information from all columns also very quickly. But what if we need to get only few columns. For example if we want to find only tweets, which were posted in a special geographical area, we just need two columns longitude and latitude. But nevertheless we have to read much more information because the memory read head reads the data in chunks and the size of one chunk can be much more larger than the size of required attributes.
Column based data storing
For the case described above it is better to store data columnwise.
In this case only required attributes will be read and the database can answer the query more quickly without retrieving unnecessary information from it. But if we will need to execute a query with many attributes like add new tweet posted by user, the memory read head should jump between rows to answer the query and reorganize every row.
1 | 2 | text1 | text2 | user_1 | user_2 | 12.3456 | 12.3456 | 44.2456 | 44.2456 | ... | ... |
In this case only required attributes will be read and the database can answer the query more quickly without retrieving unnecessary information from it. But if we will need to execute a query with many attributes like add new tweet posted by user, the memory read head should jump between rows to answer the query and reorganize every row.
SAP HANA
SAP HANA combines the both technologies to get the best of them. The main information is stored in columns, so the quick query execution for analytical queries is possible. But when the new row should be inserted into database, the data is firstly inserted into temporary delta store, where the data is stored row wise and therefore can be inserted very fast. After constant period of time or by user request it is combined with the column based main store. The process of combination is called merge. This simple trick and also many other additional technologies allow SAP HANA to execute user requests much more quickly than traditional databases. The following SAP HANA marketing video shows the performance difference with a race metaphor.
Conclusion
Ok this was very short introduction into database technologies and in HANA specially. I suppose that for readers without technical background the topic could be too complex and for readers with technical background it is too simple and can be even boring. But I need these database basics to be able to explain the second and the thired challenge in more details.
That's why I need your feedback. Was it difficult and interesting for you to understand technical details?
Do I need to go deeper in the topic or you just want to know more general things about the competition?
For each contextual comment I would be very greatful.
Comments
Post a Comment