Friday, December 22, 2006

The invisible I/O thread failures are no more

To get the status of the replication slave, it is possible to check the Last_Error and Last_Errno fields from SHOW SLAVE STATUS. Unfortunately, they only give information about the status of the SQL thread (and not always that either). If the I/O thread fails, for example, because the server configuration is not correctly set up, or if the connection to the master is lost due to a network outage, it is necessary to dig through the error log to find out the reason. This might be possible, although annoying, for a DBA to do since he has access to the files on the machine where the server is running, but when using automatic recovery applications that watch the status of the replication, this is not practical. It is also easier to see the status of the server through a normal client connection, compared to logging into the machine and starting to locate the files.

This is actually quite stupid, especially since it is possible to individually check if the threads are running, so to make it possible to check the status of the threads from a client (an application or a user connecting directly to the server), I just added four new fields to the output from SHOW SLAVE STATUS: Last_SQL_Error, Last_SQL_Errno, Last_IO_Error, and Last_IO_Errno. The new fields were added last, and the two old fields Last_Error and Last_Errno are just aliases for Last_SQL_Error and Last_SQL_Errno respectively. Adding the new fields last and keeping the two old fields intact allow old applications to work as normal since they either use positional arguments or find the column by name. New applications, however, can take advantage of these new fields.

Friday, October 20, 2006

Documentation for the unit tests API used by MySQL

For some time, MySQL has been using a unit test framework based on the Test Anything Protocol (TAP) used by Perl and PHP.

The framework consists of a C library that can be used to generate TAP output suitable for processing with, for example, the Test::Harness Perl module. In order to allow Test::Harness to execute the compiled programs, a simple wrapper called unit.pl exists in the unittest/ directory in the MySQL server tree.

The documentation for the MyTAP API is available at http://www.kindahl.net/mytap/doc/, until I can find another home for it. All comments are welcome.

Friday, September 15, 2006

Replication and the disappearing statements

After reading Todd Huss blog about the gotcha when using statement-based replication, where statements can "disappear" (that is, not be applied to the slave database), I believe that I can shed some light on the reason for this behavior.

Before that, some background.

Traditionally, MySQL has been using what is called statement-based replication. Statement-based replication replicates the changes to the slave by sending the actual statement that was executed on the master over to the slave, and the slave subsequently executes that statement. Of course, only statements that change something will be sent to the slave.

Sometimes, you don't want to send all changes to the slaves. So therefore it is possible to prevent the master from sending changes to some databases using the --binlog-do-db and --binlog-ignore-db switches, which will allow you to filter out statements that updates certain databases (this is not the whole story, more about that filtering later).

This works well for most queries, such as:

INSERT INTO products   SET name='Gizmo2000', price='$2000'
But suppose that we have two databases db1 and db2 and we decide to not replicate changes to db1 but will replicate changes to db2. Now, consider the following statement:
UPDATE db1.foo, db2.foo
SET db1.foo.a = db2.foo.a,
    db2.foo.b = db2.foo.b;
The statement updates both db1 and db2, so shall we replicate it or shall we not? Since we need to handle even this situation in a consistent manner, the current database is used to decide if the statement shall be replicated or not. (I didn't actually write the code, since it pre-dates me beginning at MySQL, but after being immersed in the code for almost two years, I'm pretty sure this is the reason.) This works well for most users, since one usually work with one database only, and set the current database to that before actually starting doing changes. However, for some special cases, like the one mentioned by Todd, it starts to look strange.

Recently (that is, in 5.1), MySQL released something called row-based replication, where the master sends the actual rows that were inserted/deleted/updated to the slave, and the slave then subsequently insert/delete/update those rows from the database. For each row, the database and table that the row belongs to is known, so if you are using row-based replication (option --binlog-format=row to the server, or use the SET GLOBAL BINLOG_FORMAT=ROW), the filtering will be done on the actual table being changed even if the statement updates several different tables in different databases.

I'll summarize with some general advice when using statement-based replication:

  • Don't qualify your table names with a database name. If you do, you might have trouble with the replication, so this is something to look for.
  • If you are going to make changes to tables in a database, always USE the database to set the current database correctly.
  • Don't use multi-table updates (or other statements that manipulate several tables) unless the tables are all in the same database.
  • Even if you are using multi-table updates on tables in the same database you might have problems. So watch out for any statement that manipulates several tables and make sure that they work by testing them.

Thursday, August 10, 2006

More ways to encourage ideas

Again, I cannot help but follow up on Zack's post on How to Come Up With Ideas. In the modern days of hypercompetition, where today's state-of-the-art solution quickly become yesterdays news, you have to set an environment where you continuously come up with new ideas and new solutions. Creating such an environment is not an easy task, since people are... well, people.

With this in mind, I find especially Zack's first five points critical, but I would like to add two more items that I personally feel are missing.

  1. Set an example

    When working as a doctoral student it was mandatory for all researchers to attend research seminars given by visiting fellows. The department's head professor always attended these seminars, and it was always the case that the most "stupid" and obvious questions was asked by the head professor. On the surface, the questions looked stupid, but actually they were very much to-the-point.

    This, of course, made all us doctoral students start asking "stupid" questions. Many of the questions asked were indeed stupid, but that didn't stop us since nobody frowned or criticized us. Over time, the questions grew more and more precise and the students started finding many gems of insight into the subjects that were presented.

    As time passed, I started realizing that the asking of questions where deliberately to create inquisitive and questioning researchers out of students that were used to learn just enough to pass the exam. By setting an example, he created an atmosphere were questions were asked, and where it was expected that some of the questions were stupid, some where silly, but that was nevertheless what you have to do as a researcher: constantly ask questions.

  2. Enjoy yourself

    People's minds work best when relaxed and when not forced to produce ideas. Adopting the view of treating it as an experiment and taking the occational break are good ways to ensure that people are relaxed and not trying to force ideas to come (which never works, at least, it has never worked for me). In general, if you enjoy the session, you will also be more relaxed and more ideas will appear. Some ways to make the occation enjoyable and more relaxed is: provide food (for some reason, many insight come to me when I am going to fetch an extra sandwich, or pouring a coffee), make sure you have plenty of time ("Hey guys! We need 10 new fresh ideas before lunch!" is a sure killer), and don't keep a strict schedule (It depends on the people involved, but maybe you could bring the lunch to the room instead of going out?)

Tuesday, August 08, 2006

Two more ways to kill good ideas

Since Zack brough 8 ways to kill good ideas, I thought I'd add two of my own that I see popping up frequently.

  1. Insist on following procedure

    People work differently, and those coming up with ideas that they want to try out are usually not good rule-followers. When forced to follow a certain procedure, only because it's company policy or because management want to reduce the risk (which is usually what the procedures are for), the idea will surely not get implemented.

  2. Punish failures

  3. On an interview I was once asked the question "do you have many bad ideas?". I answered that "90% of my ideas are usually bad", to which the interviewer smirked and said "Not more? That's pretty good." This was for a job where the continous creativity of their employees where the very essence of their survival, so the managers assumed that there would be many bad ideas and only a few good ideas. They assumed that time would be wasted on bad ideas, but considered that as part of the trade.

    By punishing failures, you send a clear message that risk-taking is not looked kindly upon. Therefore nobody will take any risks; you do not waste time on any bad ideas, but in effect you will not get any of those world-changing good ideas either.

Tuesday, May 16, 2006

Replication of DELETE FROM versus TRUNCATE TABLE

A bug titled DELETE FROM inconsistency for NDB (Bug#19066) dropped into my lap, and while fixing it, we had to make some hard decisions on what should be considered the "correct" way to solve this.

The bug is related to the difference between TRUNCATE TABLE and DELETE FROM with no WHERE clause. On the surface, they seem to be equivalent, but when digging deeper, we will see that there is big difference between the statement when replication comes into play.

Before delving into the problem and the solution, I'll start by recapitulate some selected parts of the manual.

  • The TRUNCATE TABLE and DELETE FROM with no condition are "logically equivalent": TRUNCATE TABLE empties a table completely. Logically, this is equivalent to a DELETE statement that deletes all rows, but there are practical differences under some circumstances. (MySQL 5.1 Reference Manual, Section 13.2.9)
  • The TRUNCATE TABLE is implemented as a DROP + CREATE: Truncate operations drop and re-create the table, which is much faster than deleting rows one by one. (MySQL 5.1 Reference Manual, Section 13.2.9)
  • The DELETE FROM without a condition will delete each row in the table: ...a DELETE statement with no WHERE clause deletes all rows.(MySQL 5.1 Reference Manual, Section 13.2.1)
In other words, both statements are supposed to empty the table. When dealing with replication, however, the concept of logical equivalence is pushed to the limit. Normally, the operational behaviour and the post-condition is what we replicants use to guide us into what a certain statement should do. So what are the operational behaviour and the post-condition of the two statements?

The operational behaviour of TRUNCATE TABLE (let's just call it TRUNCATE henceforth) is to drop the table and re-create it and the post-condition is an empty table. In contrast, the operational behaviour of a DELETE FROM with no WHERE clause (let's call it DELETE-ALL henceforth) is to remove each row from the table, and the post-condition is that every row that were in the table is removed.

Wait a minute... there is no difference between these two. In both cases, the end result is an empty table! Isn't it?

Yes, that is correct... if you are running an isolated statement and executing the operations on the original table. Isolation on statement level typically achived using locks (either table locks or row locks), so there will usually not be any other process interfering with the operation.

Now, let us see what it is that makes these two statements behave differently even though they superficially appear to be equivalent.

NDB Cluster swings, but misses. The original bug report mention NDB, which is special for this particular bug in that the storage engine does not have (global) table locks [sic]. When NDB deletes the rows of a table using DELETE-ALL, it does it row-by-row. If another client starts to insert rows into that table, it might be that the rows remains in the table after the DELETE-ALL has completed. So the table will not be empty.

Interestingly enough, the TRUNCATE for NDB is implemented as a DELETE-ALL, so neither of the statements will empty the table if another client starts inserting rows.

Circular replication grapples and throws. Suppose that you have a setup like the multi-master setup described by Giuseppe Maxia.

In this case, the replication can progress in a circle. Suppose that you're adding rows to one of the servers (A) in the circle, while at the same time trying to empty the same table on another of the servers (B). Does it make sense to remove the inserted row together with the all the rows that were present in B? Not really... that would make B delete things that it has no idea about. It could well be that the administrator checked the table visually, decided that there's nothing there that needed, and issued a DELETE-ALL. It doesn't help if the operator lock the table to make sure that nobody changes the table. In this sense, DELETE-ALL should delete what the server knows about, and nothing else.

Multi-source replication is standing by to finish the job. We don't have multi-source replication yet, but we will have it in the near future. When we have and use it, there is a huge difference between emptying the table and deleting all rows in the table. Reading data from several different masters and inserting it into one table is a convenient way to aggregate data from, e.g., several different branches of a company (where the table contain a branch id as well, to avoid conflict). If one branch decides to drop all the data it has, only the rows related to that branch should be deleted, not data about all branches.

The moral of the story

  • If you want an empty table, you should use TRUNCATE TABLE, which is guaranteed to empty the table regardless of how the replication is set up.
  • If you want to delete the rows that are in the table (on the master), use DELETE FROM.

Wednesday, April 26, 2006

Row-based replication for the future

I just read Eric Bergen's blog on row-based replication and application development from the presentation of Row-based replication at the User's Conference. Eric is giving all kinds of new ways to use the replication, for example that for a statement you can now configure the replication to only replicate changes to one of the tables in a multi-table statement. He is, however, missing the most important aspect: everything that we can do now and will be able to do in the future that we couldn't do with just statement-based replication. Here are some things that you can do with row-based replication that was not possible with statement-based replication.

Cluster Replication Cluster replication is already in 5.1, but it's worth to mention since it could not be handled with statement-based replication. Inside the cluster, everything is rows: rows are passed back and forth between the nodes and rows are collected to form result of queries. If you are inserting data into the tables using the MySQL NDB Cluster handler ha_ndbcluster, it will be replicated as usual and both statement-based and row-based replication will work. However, if you are using the NDB API to insert rows into the cluster, the rows are "lost" since the server never see those rows. To solve this problem, we invented an "injector" whose sole purpose is to inject rows into the binary log. The injector is created inside ha_ndbcluster and the rows that are inserted into the cluster are injected into the binary log.

Replication of individual partitions to different servers This is not in any version of the server and there are (not yet) any plans to add it, but it's something that is feasible when having row-based replication. Eric is mentioning ways to separate the rows going to different tables, and only replicate one of the tables. This is not limited to replicating different tables to different servers, you could even replicate different parts of the same table to different servers. For example, assume that you want to partition your data depending on how frequently it is accessed. Now, imagine that you could set up partitions for your table of blogs like this:

CREATE TABLE blogs (id INT ..., freq INT, ...)
  PARTITION BY RANGE (freq) (
   PARTITION blog0 VALUES LESS THAN (10),
   PARTITION blog1 VALUES LESS THAN (100),
      ...
      PARTITION blog5 VALUES LESS THAN (1000000)
 );
Further imagine that you could set up replication to replicate the different partitions to different servers. We could, for example assume that we wanted to place the blog5 partition on high-end servers dedicated for handling high loads, while the less frequently accessed blogs would be placed on low-end servers. Each access would then also update the statistics, causing the row to move to other servers as it becomes less frequently accessed. To prevent the row from moving back and forth between partitions when it is bordering one of the ranges, we could do the partitioning on the result of calling a UDF, which implements hysteresis by taking trends into account.

Friday, April 21, 2006

Replication of ALTER TABLE with AUTO_INCREMENT

Just got Bug#16993 closed, which provides a good lesson into the complications of replication. The original title was RBR: ALTER TABLE ZEROFILL AUTO_INCREMENT is not replicated correctly, but the problem is not related to row-based replication (RBR) nor to ZEROFILL. The culprit is adding an AUTO_INCREMENT column to a table, and it is not actually a bug (to be frank, it depends on your point of view, but "fixing" this bug causes more headaces than it solves as you'll see in a moment). What the example code in the bug description does is creating a table on the slave and the master, but inserting (identical) rows into the tables in different order on the slave and the master; for example in this way:
master> CREATE TABLE ages(name CHAR(30), age INT);
master> SET SQL_LOG_BIN=FALSE;
master> INSERT INTO ages SET name='Mats', age=37;
master> INSERT INTO ages SET name='Lill', age=25;
master> INSERT INTO ages SET name='Jon',  age=4;
master> SET SQL_LOG_BIN=TRUE;
 slave> INSERT INTO ages SET name='Mats', age=37;
 slave> INSERT INTO ages SET name='Jon',  age=4;
 slave> INSERT INTO ages SET name='Lill', age=25;
Now, if you look at the tables, you will see that the same rows are present in both tables (I'm just showing the result on one of the servers since they are identical):
mysql> SELECT * FROM ages ORDER BY name, age;
+------+------+
| name | age  |
+------+------+
| Mats |   37 | 
| Jon  |    4 | 
| Lill |   20 | 
+------+------+
3 rows in set (0.00 sec)
Why the ORDER BY? Well, potentially the rows could be listed in different order on the master and slave because of one of the following reasons:
  • If this were a real database, we would have been running the master and the slave separately for a while before deciding that we should replicate them and, even though the tables contain the same rows now, the rows would potentially be listed in different order.
  • We are using NDB Cluster as storage engine, and there we have no guarantee on the order of the rows, regardless of the order they were inserted.
Now we (or management) decides that we need to assign a unique id to each person in the table. That is easy, just add a column with an AUTO_INCREMENT option. Since we have replication running, the same change will be made to the table on the slave.
master> ALTER TABLE ages
    ..>   ADD id INT AUTO_INCREMENT PRIMARY KEY;
To our surprise, we do not get the same id:s assigned to the people on the master as on the slave. Look here:
master> SELECT * FROM ages ORDER BY name, age;
+------+------+----+
| name | age  | id |
+------+------+----+
| Jon  |    4 |  3 | 
| Lill |   20 |  2 | 
| Mats |   37 |  1 | 
+------+------+----+

slave> SELECT * FROM ages ORDER BY name, age;
+------+------+----+
| name | age  | id |
+------+------+----+
| Jon  |    4 |  2 | 
| Lill |   20 |  3 | 
| Mats |   37 |  1 | 
+------+------+----+
So, what is happening under the hood? When executing an ALTER TABLE:
  1. a new table is created with the extra column
  2. the rows are copied over one by one from the old table to the new table in a storage engine-specific order
  3. the old table is dropped
  4. the new table is renamed to the name of the old table
Observe that the ALTER TABLE statement is replicated by statement, so it gets executed on both the slave and the master. So, if we're using MyISAM as storage engine, the rows are copied in the order they were inserted. For other storage engines, you get other row orders. So much for the problem, what about the solution and why is this not considered a bug? I will start with the solution, since this explains why we decided not to produce a bug fix for it, but rather adding a caveat to the documentation. The solution is simple: sort the rows before when adding them to the table. To do this, you have to repeat the steps above yourself, but add an ORDER BY clause when inserting the rows. So, the steps to add a column are:
  1. Create a new table with an extra column. Since you might want to add the new column at an arbitrary place, I give a generic solution.
    CREATE TABLE new_ages LIKE ages;
    ALTER TABLE new_ages
      ADD id INT AUTO_INCREMENT PRIMARY KEY;
    
  2. Copy the rows into the new table.
  3. INSERT INTO new_ages(name,age) 
      SELECT name,age FROM ages ORDER BY name,age;
    
  4. Drop the old table.
    DROP TABLE ages;
    
  5. Rename the old table to the new table.
    ALTER TABLE new_ages RENAME ages;
    
Now, why do we not write a patch to fix this "bug"? Simply put, the ORDER BY involves a complete sort of the table so implementing a patch to do it this way would force sorting the table for every execution of an ALTER TABLE. In addition, it might not be necessary to sort the rows on every column in the table. Forcing this solution on every user that needs to do an ALTER TABLE would be a bigger evil than leaving the implementation as it is. And of course, if don't care where the new column is, you can replace the first two steps with:
CREATE TABLE new_ages(id INT AUTO_INCREMENT PRIMARY KEY)
  SELECT name,age FROM ages ORDER BY name,age;

Friday, March 24, 2006

Row-based replication and user defined functions

A little more than a year ago, I was hired to implement row-based replication to the MySQL database server. Since the principles are easy enough, I thought this would be a straightforward task to be done in a few months, tops. As always, I quickly got punished for my hybris: getting a basic row-based replication up and running was relatively straightforward but as the saying goes, the devil is in the details. Row-based replication is now safely tucked away in MySQL 5.1 for anybody who wishes to use it, but the obvious question is then what does it give me and why should I use it? When using statement based replication, the replication is accomplished by replicating the actual SQL statements to the slave server directly. That works fine for most statements, but in some situations, this does not work as expected. This time, we will only look into one such situation. Consider the following SQL statements:
UPDATE account
 SET   balance = balance + 100
 WHERE name = "Sakila";
INSERT INTO transactions
      VALUES ('Sakila', 'deposit', 100, UUID(), NULL);
The purpose is to update the balance and add a line to the transaction log that to keep track of all the transactions done to the accounts, including the user-id clerk that performed the transaction. This works fine when executed on the master, but when the slave thread executes the statements, it will be the UUID() of the SQL thread on the slave, which is really not what we want. It is, of course, possible to handle this for the built-in SQL functions, but for user defined functions (UDFs), there is no chance at all that we can make it work. Instead of logging the entire statement, we can log just the row change made to the table, and this is what row-based replication is about. That way, we will insert exactly the same row at the slave as we did on the master. To control the logging format, we introduced a new server variable BINLOG_FORMAT that can take the values STATEMENT, MIXED, and ROW. The formats STATEMENT and ROW do what you expect, use statement-based replication or row-based replication respectively, while the MIXED mode will temporarily switch to row-based replication for the statement if it uses a function that will give a different results when executed on the master and by the SQL thread. So, assume that I've got the following two tables:
master> describe transactions;
+--------+------------------------------+------+-----+-------------------+-------+
| Field  | Type                         | Null | Key | Default           | Extra |
+--------+------------------------------+------+-----+-------------------+-------+
| name   | char(20)                     | YES  |     |                   |       |
| kind   | enum('deposit','withdrawal') | YES  |     |                   |       |
| amount | decimal(10,2)                | YES  |     |                   |       |
| clerk  | int(11)                      | YES  |     |                   |       |
| time   | timestamp                    | YES  |     | CURRENT_TIMESTAMP |       |
+--------+------------------------------+------+-----+-------------------+-------+
5 rows in set (0.00 sec)

master> describe account;
+---------+---------------+------+-----+---------+-------+
| Field   | Type          | Null | Key | Default | Extra |
+---------+---------------+------+-----+---------+-------+
| name    | char(20)      | YES  |     |         |       |
| balance | decimal(10,2) | YES  |     |         |       |
+---------+---------------+------+-----+---------+-------+
2 rows in set (0.01 sec)
Here's the how you do to set replication to use MIXED binlog format and execute the statements above:
master> SET BINLOG_FORMAT=MIXED;
Query OK, 0 rows affected (0.00 sec)

master> UPDATE account
    -> SET balance = balance + 100
    -> WHERE name = 'Sakila';
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0

master> INSERT INTO transactions
    -> VALUES ('Sakila', 'deposit', 100, UUID(), NULL);
Query OK, 1 row affected, 1 warning (0.00 sec)

master> SHOW BINLOG EVENTS FROM 975;
+-------------------+------+------------+-----------+-------------+------------------------------------------------------------------------------+
| Log_name          | Pos  | Event_type | Server_id | End_log_pos | Info                                                                         |
+-------------------+------+------------+-----------+-------------+------------------------------------------------------------------------------+
| master-bin.000001 | 975  | Query      | 1         | 1102        | use `test`; UPDATE account SET balance = balance + 100 WHERE name = 'Sakila' |
| master-bin.000001 | 1102 | Table_map  | 1         | 1155        | table_id: 17 (test.transactions)                                             |
| master-bin.000001 | 1155 | Write_rows | 1         | 1206        | table_id: 17 flags: STMT_END_F                                               |
+-------------------+------+------------+-----------+-------------+------------------------------------------------------------------------------+
3 rows in set (0.00 sec)
The UPDATE statement is logged as a statement using a query event, while the INSERT statement is logged using a table map event and an event containing the rows inserted (one, in this case). I'll explain more about the table map events and the various forms of row-containing events, but for now you have to trust me when I say that there are rows in the Write_rows event. :)

Tuesday, March 21, 2006

First day back from the Developers Conference in Sorrento, and the first post to this blog (thought I might just as well get started, I've been thinking about it for a while but always been distracted). Since I've been to Italy before, I knew part of what was waiting: excellent food, terrific coffee, and nice friendly people. What I wasn't prepared for was the intense discussions that were continously going on. The last year, in Prague, it was to a large part just the down-to-earth work of getting the code ready for the 5.0 release; this time, however, the conference was vibrant with ideas for the future and ways to leverage the skills of the people in MySQL to produce even better services. Before the conference, I was mostly busy with getting the row-based replication in shape for shipping. All it all, I would say that it holds up very well. Since it's a fresh feature, there's (of course) a list of bugs, but these are mostly annoyances like the occational extra events in the binary log that doesn't really affect the result, but that are not needed. I feel I have a pretty good control of the code right now, and know where the problematic spots are. There's, however, nothing like community testing to find the really difficult problems. I really want this feature to be rock-solid, so I'm prepared to get swamped with bug reports. Please give me your worst. :)