XX’s Blog

New site for This Week in RisingWave

2023-04-02T00:00:00+00:00

This Week in RisingWave is now migrated to a new dedicated site: this-week-in-risingwave.vercel.app.

By the way, xxchan’s blog also has a new domain name: xxchan.me. The old domain name xxchan.github.io should still work.

Please tell me if any of the links or other things like RSS are broken! Thanks! 🥰

This Week in RisingWave #6 (20230325)

2023-03-25T00:00:00+00:00

This blog series is my personal comments about (part of) the development of RisingWave.

Please take it as an unofficial and no-promise supplement.

Features Updates 🌟

`NULLS {FIRST | LAST}`

feat(common): support NULLS {FIRST | LAST} by richardchien · Pull Request #8485

NULL s can be very tricky in SQL. Do you known you can specify its ordering?

> select * from t order by x /* nulls last */; 
+--------+
| x      |
|--------|
| 1      |
| 2      |
| <null> |
+--------+

> select * from t order by x nulls first;
+--------+
| x      |
|--------|
| <null> |
| 1      |
| 2      |
+--------+

It was tricky to support because previously in RisingWave, we only considered ascending/descending ordering, but not NULL s’ ordering. That caused inconsistent NULL s ordering around the system. When it comes to SQL query results, the NULL values were sometimes the largest, sometimes the smallest.

This also blocked our progress to test RisingWave against other database’s test suites, because in PostgreSQL, NULL s are the largest by default when compared against non-NULL values, while in SQLite and DuckDB they are the smallest by default. With inconsistent ordering of NULL s and no support for NULLS { FIRST | LAST } in order-by clauses, we can neither align with the behavior of one of these databases, nor specify a fixed NULL ordering in test queries to align behaviors of all databases.

Recently, we did a thorough refactoring to force all components in our system to use an unified struct type representing ordering, OrderType . As a result, this week we were finally able to implement NULLS { FIRST | LAST } for order-by clauses at ease, by simply adding a new field, NullsAre , to OrderType , specifying whether NULL s are largest or smallest. By setting Largest as default value of NullsAre , we achieved the same default ordering behavior as PostgreSQL.

Rename relations

feat: support alter rename relations including table/mview/view/sink/index by yezizp2012 · Pull Request #7745

As I mentioned earlier, we are working on DDL support ( ALTER TABLE ), which is quite tricky. We’ve already done quite some work for supporting ADD/DROP COLUMN . Here’s another DDL: renaming relations is supported. This might be slightly easier than ADD/DROP COLUMN , but there are still some tricky parts to consider carefully, like updating all related relations all at the same time.

For example, you can use the command ALTER TABLE t_1 RENAME TO t_2 to rename a table and its related relations will be modified recursively, which you can check with command such as SHOW CREATE MATERIALIZED VIEW mv_x . There is no effect on the stored data.

Performance Optimizations 💪

Bushy tree join ordering

feat: Bushy tree join ordering by KveinAxel · Pull Request #8316

RisingWave is a streaming processing system that aims to provide real-time low latency for our users. By reducing the depth of the join tree, we can effectively minimize latency.

See the illustration below:

To reduce the distance that barriers from join inputs need to travel to the join output, we convert a left-deep tree with a height of 3 into a bushy tree with a height of 5.

Rusty stuff 🦀️

We ❤️ Rust! This section is about some general Rust related issues.

zld

Cannot run risingwave binary on Mac OS M1 (when linked by zld ) · Issue #8608

@ahmedriza reported that he cannot run the binary on his Macbook M1 due to “symbol not found”.

$ ./target/debug/risingwave
dyld[14116]: symbol not found in flat namespace (__ZN15protobuf_native2io23DeleteCodedOutputStreamEPN6google8protobuf2io17CodedOutputStreamE)
Abort trap: 6

After some investigation, he found that it’s because he used zld in his global cargo config, and RisingWave has a repository-level cargo config that uses lld. Although it is still not clear whether the error is caused by zld or the conflict between the cargo configs, he could solve the problem by just using lld.

This is also one of the reasons why I enjoy open source: hackers are very willing to raise issues and even analyze and solve problems themselves. I can also learn a lot from their thoroughly analysis. From this issue, I learned:

How cargo config ($HOME/.cargo/config.toml, and /projects/.cargo/config.toml) works in more detail
There’s another linker zld. (But it’s already deprecated in favor of lld .)

New Contributors

(Of course, @ahmedriza is also a great new contributor!)

feat: Add support for array_length function in psql by kamalesh0406 · Pull Request #8636

This week we have another first-time contributor @kamalesh0406. He said “This is my first time writing rust code” 😲.

So it seems RisingWave’s good first issues are a good way to learn and practice Rust. 😄 Don’t hesitate and just join us if you are also interested in the development of an open source database system, or Rust!

Finally, welcome to join the RisingWave Slack community.

So much for this week. See you next week! 🤗

This Week in RisingWave #5 (20230318)

2023-03-18T00:00:00+00:00

This blog series is my personal comments about (part of) the development of RisingWave.

Please take it as an unofficial and no-promise supplement.

Notable changes 🌟

Temporal join

Lots of production scenarios contain a fact table and several dimension tables, where users want to enrich (join) their fact table with dimension tables. Different from regular stream joins, under the enrichment scenario users may want to keep the previous join outputs unaffected when the dimension table is updated. This is because we only want to enrich the fact table without duplicated outputs.

Temporal join is for this scenario. More technically speaking, it joins an append-only stream (such as Kafka) with a temporal table (aka versioned table, e.g. backed by MySQL CDC). The stream side lookups the temporal table, which means the join is driven by the stream side only.

The syntax is like:

SELECT * FROM stream LEFT JOIN versioned FOR SYSTEM_TIME AS OF NOW() ON stream.col = versioned.id

Interesting SQL features 😄

I don’t know that much about SQL before I becoming a database developer. Every now and then I got some new surprise from SQL…

Server local timezone

Do you know SQL standard has two timestamp types: timestamp with/without time zone?

Support SET TIME ZONE LOCAL syntax · Issue #8551

The new syntax allows us to set the server’s local timezone, which is useful for local testing.

dev=> select now();
              now              
-------------------------------
 2023-03-16 10:41:10.951+00:00
(1 row)

dev=> set time zone local;
SET_VARIABLE
dev=> select now();
              now              
-------------------------------
 2023-03-16 11:41:36.958+01:00
(1 row)

BTW, this is done via strawlab/iana-time-zone: Rust crate to get the IANA time zone for the current system.

Intersting Bug

Inverse of column index mapping

fix(optimizer): fix hash join distribution by chenzl25 #8598

I talked about ColIndexMapping in This Week in RisingWave #3.

Althought mathematically simple and intuitive, it’s not easy to do such mappings correctly in programs.

Well, then we met another bug related to ColIndexMapping this week 🥲. (Luckily, it’s not very easy to trigger it.) This time, it’s about the inverse of the mapping. Shortly speaking, suppose we have an array of index pairs [(l1, r1), (l2, r2), ...], naturally we can build two mappings l -> r and r -> l. However, the inverse of l -> r is not r -> l! Can you tell why?

Reliability Improvements 💪

The Great MadSim!

fix: avoid panic when upstream input is closed for lookup #8529

This week, we identified a new bug through MadSim that deterministically shuts down and restarts nodes in a RisingWave cluster. This time, the bug was found during the execution path of the lookup executor. Thanks to MadSim, we were able to quickly identify the issue and resolve it.

Interval bugfixes and tests

Intervals are a fundamental data type for a streaming SQL database, but they can also be sophisticated in some ways. Recently, RisingWave has enhanced its support for intervals and migrated many related tests from Postgres.

OpenDAL

feat(test): add e2e test for OpenDAL fs backend #8528

Since February, RisingWave has been using OpenDAL as one of its underlying object storage implementations. OpenDAL greatly reduces our efforts in supporting various cloud storage systems, especially HDFS. This PR uses opendal fs engine to mock memory objects store.

By the way, OpenDAL is now an Apache Incubator project! 🎉

Rusty stuff 🦀️

We ❤️ Rust! This section is about some general Rust related issues.

Be more careful about error creation!

fix(expr): do not construct error for extracting time subfield by BugenZhao #8538

Error creation can be very expensive!

In This Week in RisingWave #1, I mentioned we can use ok_or_else to create expensive error lazily. This time the errors are not actually needed. Option is enough. Basically, I mean cases like this:

// Don't do this!
fn inner() -> Result<T> {}
fn outer() -> Result<T> {
    match inner() {
        Ok(t) => Ok(t),
        Err(_) => { 
          // try a different computation
          ...
        },
    }
}

My takeaway is: Think more about the definition of error types and try to keep it small. If it’s unavoidably large, then we have to think more when we use it.

BTW, kudos to @BugenZhao for catching this issue (again)!

P.S., this PR brings us 1000%+ throughput improvement (🤯) on nexmark q14, which is a simple SELECT with extract(hour from date_time).

New Contributors

Support optional parameter offset in tumble and hop by Eridanus117 #8490

This is the second PR by @Eridanus117.

feat(expr): support builtin function pi. by broccoliSpicy #8509

This is the second PR by @broccoliSpicy.

It’s great to see new contributors joining in, and even better when they show interest in diving deeper and contributing continuously! 🥰

CREATE SINK panic · Issue #8482

I remember @JuchangGit had submitted 2 issues in the past. This week he submits another one. I’d like to mention this because open source contribution is not only about code (PRs). Playing with the software and reporting issues are also very important contributions!

Finally, welcome to join the RisingWave Slack community. Also check out the good first issue and help wanted issues if you want to join the development of an open source database system!

So much for this week. See you next week (hopefully)! 🤗

This Week in RisingWave #4 (20230310)

2023-03-10T00:00:00+00:00

This blog series is my personal comments about (part of) the development of RisingWave.

Please take it as an unofficial and no-promise supplement.

Notable changes 🌟

Serial type

As a distributed database system, RisingWave has multiple instances of operators to achieve a high degree of parallelism. Some operators require specific input distribution to ensure the correctness of the result, so data must be shuffled at that point, which is represented by the Exchange operator.

At the same time, RisingWave generates hidden _row_id columns for sources without primary keys (aka append-only sources), also for the correctness of the system. Previously the _row_id is randomly distributed, which leads to the result that we must insert a HashExchange later after the source operator to enforce its distribution.

Since _row_id is fully controlled by us, why don’t we directly generate _row_id with desired distribution? So an optimization of using a new internal type Serial for _row_id with specialized shuffling logic is proposed to remove the unnecessary Exchange operators.

Auto execution mode selection

feat(batch): support auto execution mode #8274

Currently, RisingWave has 2 kinds of execution mode: local and distributed, which can be tuned by SET query_mode = [local | distributed] . Lower latency can be achieved by running OLTP queries in local execution mode, and running OLAP queries in distributed execution mode.

However, previously we leave the choice to users which makes our distributed execution useless to some extent, because most users don’t know or understand what those options exactly mean. As a database, we have the ability and should make a good choice for users to reduce their tuning works.

AWS Private link support

feat(source): support private link for kafka connector #8247

When using a cloud environment, users may face connection issues when attempting to create a source in RisingWave due to their AWS MSK service being located in a different VPC. To solve this, AWS PrivateLink can be used to establish a connection between RisingWave’s VPC and the user’s VPC. Users can set up an endpoint service to expose their MSK service, allowing RisingWave to easily create an endpoint to access it.

Sink Validation

feat(meta): introduce sink validation in meta #8417

Recently we already implemented MySQL/Postgres sink support. We are now working hard to keep improving its stability and usability. This PR Introduced validation for sink so that we can catch errors in an earlier phase and give better error messages.

Reliability Improvements 💪

Fuzzing Tests

feat(sqlsmith): Generate more join expressions #8395

This week we added more targeted testing for join expressions. We increased the number of joins, added generation for non_equi joins and added more equi join predicates.

The goal of these changes is to increase the coverage of join executors in streaming and batch, since the functionality of JOIN s are often complex.

Deterministic Tests

During the past week, MadSim, our deterministic testing framework, identified more than five issues that we promptly resolved. We are confident in the effectiveness of our framework. Additionally, I’m eagerly anticipating @wangrunji0408’s upcoming blog post about MadSim internals. 🤩

Rusty stuff 🦀️

We ❤️ Rust! This section is about some general Rust related issues.

Reduce debuginfo size

Although I already briefly mentioned this in the previous blog, let me elaborate it a little here.

fix: reduce debuginfo size #8326

We noticed unexpected 2GB of memory consumption on Meta node. The cause is quite interesting.

We found the memory consumption happens when meta node writes a log with backtrace included. To get the backtrace, debuginfo in the risingwave binary is loaded (and cached!). This is somewhat reasonable, but should the debuginfo be so large?

Why are debug symbols so huge? is a good post explaining what is debuginfo. Inspired by that, we tuned the level of debuginfo. This is a trade-off between binary size and utility (we still have most useful information for debug). At the same time, we also disabled debuginfo compression. This is another trade-off between binary size and memory overhead.

Capture unrecognized fields in serde

https://github.com/risingwavelabs/risingwave/pull/8325

By default, serde ignores unknown fields. By #[serde(deny_unknown_fields)] , serde rejects unknown fields. But what if you want to tolerate them, but produce a useful warning about it at the same time? Here’s a useful tip: By #[serde(flatten)] you can capture the unknown fields:

#[derive(Serialize, Deserialize)]
pub struct S {
  // ...

  #[serde(flatten)]
  pub unrecognized: HashMap<String, serde_json::Value>,
}

Auto derive a prefixed alias for protobuf message types

refactor(proto): auto derive a Pb-prefixed alias for proto message types by BugenZhao · Pull Request #8426 · risingwavelabs/risingwave

When using protobuf for communication, one thing might be defined multiple times in different places for different purposes. e.g., if we have a protobuf message Msg , a crate using Msg might have it’s own type for Msg at the same time, which might include useful methods and maybe a different data structure representation with the protobuf one. Previously, we need to manually alias it to avoid conflict. e.g.,

use pb_crate::Msg as PbMsg;

struct Msg {
  // ...
}

impl Msg {
  pub fn to_proto(self) -> PbMsg {
     // ...
  }
}

Now we came up with a method to mitigate this problem. Thanks to prost ’s ability to customize generated code for protobuf messages, we can add type alias pub type PbMsg = Msg at the same time of defining Msg . In this way, when trying to type PbMsg , the IDE will be able to find the type and automatically import it.

Finally, welcome to join the RisingWave Slack community. Also check out the good first issue and help wanted issues if you want to join the development of an open source database system!

So much for this week. See you next week (hopefully)! 🤗

This Week in RisingWave #3 (20230304)

2023-03-04T00:00:00+00:00

This blog series is my personal comments about (part of) the development of RisingWave.

Please take it as an unofficial and no-promise supplement.

Notable changes

Memory control policy

Recently we have implemented memory control mechanism, and now we are introducing the policy. Since RisingWave is a “streaming database”, it is crucial for us to consider both batch and streaming tasks while balancing the usage of memory between them.

Move memtable

feat(storage): move memtable down to local state store by wenym1 #7183

Our streaming executors access storage via StateTable interface, and storage layer exposes LocalStateStore interface. This PR wants to move part of the former (memtable) to the latter. It might help to implement more features in the storage layer, and also make it easier to implement other storage engines for bench purpose.

Optimizer updates

feat: Bushy tree join ordering by KveinAxel #8316

RisingWave is a streaming processing system which aims to provide real time low latency for our users. Making the join tree shallower helps to reduce the latency.

See the illustration below:

Async expr

refactor(expr): make evaluation async by wangrunji0408 #8229

… to prevent blocking when evaluting UDFs

RFC: suspend MV

RFC: Suspend MV on Non-Recoverable Errors by hzxa21 #54

Last time, I mentioned that RisingWave currently tolerates compute errors by default and we are in the process of implementing an error reporting mechanism. However, we had previously discussed another mechanism for handling errors which involved suspending MV. This RFC reintroduces this idea.

Intersting Bug

Injectivity of column index mapping

bug: mv with join and duplicate output columns has row indices hidden #8216

ColIndexMapping is an important utility in optimizer used when we wants to map an input column (index) to an output column. It’s a partial mapping from [0, n) to [0, m) , where n is the number of columns in the input, and m is the number of columns in the output.

When I began to work on RisingWave (one year ago 😲), I worked on column pruning, which removes unnecessary columns from the plan nodes. Naturally it involves a lot with ColIndexMapping . Althought mathematically simple and intuitive, it’s not easy to do such mappings correctly in programs. I had a hard time understanding ColIndexMapping at that time, and added many comments and did some refactor to make it less confusing and error-prone.

After long, we met this new bug related to ColIndexMapping again.. Under curtain circumstances, a plan node will have a non-injective mapping and duplicate columns in the output. In this case, the duplicate columns are unexpectedly hidden after mapping. We fixed this bug by considering the injectivity of the mapping. A more systematic solution is also proposed: Proposal (frontend): Prune duplicate columns #8277.

Rusty stuff 🦀️

We ❤️ Rust! This section is about some general Rust related issues.

Publish `await-tree`

We have a tracing mechanism called async stack trace in our system, which allows us to “capture a snapshot” of where, why, and how long the async tasks are pending in real-time. It helped us locate some stuck issues, e.g., streaming deadlock, which would be very hard to debug in other ways.

Now we have published it to crates.io, and renamed it to await-tree . I already invited @BugenZhao to write a blog post to introduce it, and can’t wait to see it. 😍

feat(dashboard): dump await-tree of compute nodes by BugenZhao #8330

We also integrated it into the RisingWave dashboard to make it more convenient to use. 😋

Large memory usage by backtrace and debuginfo

fix: reduce debuginfo size by fuyufjh #8326

We noticed unexpected memory consumption on Meta node. And the cause is (again) backtrace…

New contributor

So happy to see another 3 first-time contributors this week! 😲🥳

(BTW, I’m also quite interested in where they come from. Are they encouraged by my posts? If so, please let me know! 🤔🤣)

@erichgess submitted 2 PRs. I’m quite impressed by the thoroughness of their communication in issue and PR discussions regarding the problem’s situation, cause, and solution. This is one of the biggest reasons why I love open source or working in public; transparent over-communication benefits both current collaborators’ reviews and future contributors’ learning. 🤗

Check out the good first issue and help wanted issues if you also want to join force! (They are usually carefully chosen. Not just random chore work. It’s a good way to get started!)

P.S. Welcome to join the RisingWave Slack community.

So much for this week. See you next week (hopefully)!

This Week in RisingWave #2 (20230225)

2023-02-25T00:00:00+00:00

See #1 for an introduction of this series.

This blog series is my personal comments about (part of) the development of RisingWave.

Please take it as an unofficial and no-promise supplement.

Most exciting things 🤩

We decided to refactor tree-structured plan node into a DAG some time ago (See RFC #28). The largest benefit is to enable common sub-plan sharing.

Previously we already implemented LogicalShare , which represents the plan node that will be shared, and used it to share source/subquery/CTE/view. Now comes the last piece of the puzzle: common sub-plan sharing.

Python UDF

Unlike Flink, RisingWave’s uses SQL as the main interface, to make it easier for users to use. But sometimes users do want some custom logic… We are doing some interesting experiments on this topic, and this is a little milestone. Basically you can already play with Python UDF now!

feat(udf): minimal Python UDF SDK by wangrunji0408 #7943

Other notable things

Telemetry

Telemetry can greatly help us understand and improve how the system behaves in real-world scenarios!

Replace Bloom filter with XOR filter

Stream error reporting

RisingWave tolerates compute errors by default (see #4625), but previously errors are only shown in the log. Now we are trying to report the errors to users (See #7824).

Table schema change

ALTER TABLE is quite tricky for streaming… ADD COLOMN is somewhat reasonable, and we are working on it as the first step.

feat(streaming): support output indices in dispatchers by BugenZhao #8094: This ensures existing downstream MV to receive old output columns and new downstream MV receive all columns. It also enables a perf optimization for MV on MV.
feat: adding column of table schema change by BugenZhao #8063

New major SQL features

`jsonb` data type

The json data type stores an exact copy of the input text, which processing functions must reparse on each execution; while jsonb data is stored in a decomposed binary format that makes it slightly slower to input due to added conversion overhead, but significantly faster to process, since no reparsing is needed. jsonb also supports indexing, which can be a significant advantage.

https://www.postgresql.org/docs/current/datatype-json.html

New array function: `array_to_string`

feat(expr): support array_to_string by fuyufjh #8027

postgres=> select array_to_string(array[1, 2, 3, NULL, 5], ',');
 array_to_string 
-----------------
 1,2,3,5
(1 row)

postgres=> select array_to_string(array[1, 2, 3, NULL, 5], ',', '*');
 array_to_string 
-----------------
 1,2,3,*,5
(1 row)

It’s also called array_join in some other systems, and we added that alias as well.

New aggregate function: `stddev` / `stdvar`

feat: implement stddev/var function by shanicky #7952

Mind-blowing SQL surprise 🤯

I don’t know that much about SQL before I becoming a database developer. Every now and then I got some new surprise from SQL…

Operator `|/`

fix(sqlparser): align operator precedence with PostgreSQL by xiangjinwu #8174

Do you know Postgres has a square root operator |/ …? The operator precedence might also surprise you.

postgres=> select |/4;
 ?column? 
----------
        2
(1 row)

postgres=> select |/4+12;
 ?column? 
----------
        4
(1 row)

`NULL` s hurt our head – FULL OUTER JOIN in steraming

Should we ban full outer join for streaming query? · Issue #8084

create table t (a int primary key);
insert into t values(null);
create materialized view v as select t1.* from t as t1 full join t as t2 on t1.a = t2.a;

Then v will have primary key (t1.a, t2.a) , but …

left  side: +[null] --> Full Join -> +[null, null]
right side: +[null] --> Full Join -> +[null, null]

Rusty stuff 🦀️

We ❤️ Rust! This section is about some general Rust related issues.

Error handling

Is appropriate return a Enum Type Error? · Issue #8074

Error handling is quite a topic in Rust (or any language). We ran into this discussion again.

`ChaChaRng`

feat(sqlsmith): use ChaChaRng for determinism by kwannoel #8068: Use a reproducible rng so that deterministic fuzz test results can be (more) reproduceable.

Private marker trait

refactor(meta): list all implementors of MetadataModel by zwang28 #8122

Just show you the code! 😲

mod private {
    /// A marker trait helps to collect all implementors of `MetadataModel` in
    /// `for_all_metadata_models`. The trait should only be implemented by adding item in
    /// `for_all_metadata_models`.
    pub trait MetadataModelMarker {}
}

pub trait MetadataModel: std::fmt::Debug + Sized + private::MetadataModelMarker {
    // ...
}

macro_rules! for_all_metadata_models {
    ($macro:ident) => {
        $macro! {
            // These items should be included in a meta snapshot.
            // So be sure to update meta backup/restore when adding new items.
            { risingwave_pb::hummock::HummockVersion },
            // ...
        }
    };
}

macro_rules! impl_metadata_model_marker {
    ($({ $target_type:ty },)*) => {
        $(
            impl private::MetadataModelMarker for $target_type {}
        )*
    }
}

for_all_metadata_models!(impl_metadata_model_marker);

New Contributors

As I mentioned last time, RisingWave is an open source project, and new contributors are always welcome. So happy to see we do have some new contributors this week:

Check out the good first issue and help wanted issues if you also want to join force! (They are usually carefully chosen. Not just random chore work. It’s a good way to get started!)

P.S. Welcome to join the RisingWave Slack community.

So much for this week. (I also havn’t fully learned the details of many of them yet…) See you next week (hopefully)! 🤗

P.P.S I’m also considering deep diving into one or few interesting things every week, instead of writing a weekly summary like this. What do you think? 🤔

Stupidly effective ways to optimize Rust compile time

2023-02-17T00:00:00+00:00

本文的中文版

Although there are often complaints saying Rust’compilation speed is notorious slow, our project RisingWave is not very slow to compile, especially since previously contributors like (skyzh, BugenZhao) have put in a lot of effort. After using an M1 MacBook Pro, compiling is not a problem at all. A full debug compilation only takes 2-3 minutes.

However, over time, more and more things have been added to our CI, making it increasingly bloated. The main workflow now takes about 40 minutes, while the PR workflow takes about 25 minutes 30 seconds. Although it is still not intolerably slow, it is already noticeably slower than before.

So a few days ago, I decided to spend some time researching whether I could optimize the compilation speed a bit more.

What shocked me was that there were some very simple methods that, with just a little effort, produced astonishing results. I feel like I can describe them as low-hanging fruits, silver bullets, or even free lunch 🤯.

P.S. I highly recommend matklad’s blog (He is the original author of IntelliJ Rust and rust-analyzer):

Most of the methods I used are discussed there, and he explains them clearly. If not otherwise indicated, all quotes in this article come from there.

Although there are quite some articles talking about how to optimize Rust compilation speed (e.g., Tips for Faster Rust Compile Times), I still want to write another one to share my step-by-step process. Each optimization point comes with a corresponding PR, and you can combine the commit history to see the effect of each optimization point by comparing the CI pages before and after its PR.

P.P.S. Results after optimization: main workflow is now 27 minutes at the fastest, and PR workflow is now 16 minutes at the fastest, with most taking around 17-19 minutes.

Valuable data and charts to find the bottlenecks

Build times are a fairly easy optimization problem: it’s trivial to get direct feedback (just time the build), there are a bunch of tools for profiling, and you don’t even need to come up with a representative benchmark.

When trying to optimize anything, it would be good to have some profiling data and charts to find out the bottlenecks. Luckily, we do have some nice ones for optimizing CI time.

CI Waterfall & DAG Graph

We use Buildkite for our CI, and the normal view of a page (such as Build #17099) looks like this:

Buildkite has two very useful hidden pages, located at /waterfall and /dag, respectively, which show:

From the waferfall graph, we can see recovery test finishes last. Two large steps finish before it: build (deterministic simulation) and check. The DAG graph shows that recovery test depends only on simulation build, so we can forget about the check step for now, and conclude the biggest bottleneck is in the path of simulation build -> recovery test.

`cargo build --timings`

Cargo comes with built-in support for profiling build times (it was stabilized last year), which can be enabled by running cargo build --timings. It produces output like this:

We can see that the compilation times for some dependencies such as zstd-sys and protobuf-src are very long, so we should try to optimize them.

Step 1: Compilation cache

ci: try sccache #7799

If you think about it, it’s pretty obvious how a good caching strategy for CI should work.

Unfortunately, almost nobody does this.

Why should you give Sccache a try? With xuanwo’s strong recommendation, I was very tempted to try sccache, which was also a major trigger for my optimization efforts this time.

It’s so easy to use. Just add two environment variables to start it up:

ENV RUSTC_WRAPPER=sccache
ENV SCCACHE_BUCKET=ci-sccache-bucket

(Well, behind the scenes, you actually need to study Buildkite and AWS configurations - which are also very simple. Buildkite can obtain permissions through IAM roles, so I just need to a policy for the role to access an S3 bucket, without the need to configure things like secret keys. I had been thinking about whether I could echo the key out in CI before, but it seems there’s no need to worry about that. 😄)

The effect was immediately apparent, reducing the simulation build time by 2.5 minutes and the non-bottleneck debug build time by 4 minutes. Although it didn’t bring about a qualitative change, why not make use of the (almost free) quantitative change?

Step 2: Remove unused dependencies

build: remove unused deps #7816

All dependencies declared in Cargo.toml will be compiled regardless of whether they are actually used or not. Moreover, they may introduce unnecessary synchronization points, affecting the parallelism of compilation.

An old tool cargo-udeps is used to remove unused dependencies. But firstly, it does not support automatic fixing, and it is also very slow. Also, I had an impression that it cannot be used together with workspace-hack. This has led to RisingWave not cleaning up unused dependencies for a long time – a typical broken window effect 🥲!

In an issue of cargo-udeps about automatic fix, someone mentioned cargo-machete. Without many investigation I just gave it a shot, hoping it works. It turned out to be very fast and there were not many false positives! Although there were a few small problems (see the commit history of the above PR), they were easily fixed.

The author of cargo-machete has a blog introducing the harm of unused dependencies and the solution of cargo-machete. Specifically, cargo-udeps first compiles the project via cargo check and then analyzes it, while cargo-machete uses a simple and stupid approach: just ripgrep it.

This PR immediately removed dozens of unused dependencies, which surprised me again 🤯. Unfortunately, the CI time did not decrease further, which seems to indicate that sccache works very well… I roughly tested it locally, and it was faster by about ten to twenty seconds. It seems not a thing, but anyway it’s free :)

P.S. In fact, cargo-udeps can also be used with workspace-hack by configuring it: feat(risedev): add check-udeps #7836

Step 3: Disable incremental compilation

build: disable incremental build in CI #7838

After finishing the previous two steps, I almost wanted to finish my work, but I still felt a bit itchy and thought that the simulation build was still a little slow. So I decided to do some profiling. Then I saw the monsters in the --timings graph that I posted above. I felt that it didn’t make sense.

I tried to search the possible reasons why the build artifacts can be non-cacheable for sccache, and found that incremental compilation is a big caveat. I tried to disable it immediately and was shocked again. The effect was stupidly good:

This instantly reduced the time for simulation build by 4 minutes…

Actually, we turned off incremental compilation for our debug build a long time ago:

[profile.ci-dev]
incremental = false

But when we added a new build profile ci-sim later, we didn’t consider this issue. If you think about it, you can find although incremental compilation is good, it doesn’t make sense in CI!

CI builds often are closer to from-scratch builds, as changes are typically much bigger than from a local edit-compile cycle. For from-scratch builds, incremental adds an extra dependency-tracking overhead. It also significantly increases the amount of IO and the size of ./target, which make caching less effective.

So I simply added a global env var in CI to turn it off once and for all.

ENV CARGO_INCREMENTAL=0

Step 4: Single binary integration test

build: single-binary integration test #7842

It’s another stupidly effective optimization. tl;dr:

Don’t do this:

tests/
  foo.rs
  bar.rs

Do this instead:

tests/
  integration/
    main.rs
    foo.rs
    bar.rs

It’s because every file under tests/ will be compiled into a separate binary (meaning every one will link dependencies). Apart from slow compilation, this can even slow down test runnings (a flaw in cargo test).

This optimization didn’t reduce our test time (probably due to the superiority of cargo nextest), but it immediately reduced the compilation time by another 2 minutes… It’s also a bit funny that it also reduced the time for uploading/downloading, compressing/decompressing artifacts by 2 minutes…(although the latter did not affect the bottleneck).

Some previous efforts

The above is the main process of my optimization this time, and now I can finally be satisfied with the work. Finally, I would like to summarize some of our previous efforts for reference.

Use cargo nextest instead of cargo test.
Use the workspace-hack technique: see cargo hakari.
Add cache to the cargo registry, or use the recently stabilized sparse index.
Split a huge crate into multiple smaller crates.
Try to reduce linking time: linking takes a lot of time and is single-threaded, so it may probably become a bottleneck.
- Use a faster linker: mold for Linux, zld for macOS. lld is the most mature option for production use.
- Turn off Link Time Optimization (LTO) on debug builds.
Trade-off between compile time and performance: The total time of CI is compile time + test time, so whether to turn on compile optimization (including LTO mentioned above), and how much to turn on, is actually a trade-off between the two. You can test and adjust that in order to achieve an overall optimal choice. For example, here’s our build profile tuned by BugenZhao:

# The profile used for CI in pull requests.
# External dependencies are built with optimization enabled, while crates in this workspace are built
# with `dev` profile and full debug info. This is a trade-off between build time and e2e test time.
[profile.ci-dev]
inherits = "dev"
incremental = false
[profile.ci-dev.package."*"] # external dependencies
opt-level = 1
[profile.ci-dev.package."tokio"]
opt-level = 3
[profile.ci-dev.package."async_stack_trace"]
opt-level = 3
[profile.ci-dev.package."indextree"]
opt-level = 3
[profile.ci-dev.package."task_stats_alloc"]
opt-level = 3

# The profile used for deterministic simulation tests in CI.
# The simulator can only run single-threaded, so optimization is required to make the running time
# reasonable. The optimization level is customized to speed up the build.
[profile.ci-sim]
inherits = "dev"
opt-level = 2
incremental = false

For more optimization techniques, you may refer to other posts like Tips for Faster Rust Compile Times.

Conclusion

Things like CI and DX are easy to become messy if they are not taken care of regularly. My story shows that if you do some maintenance from time to time, you may get unexpected gains. A little effort can bring huge improvements.

Finally I’d like to quote matklad’s blog again as a conclusion:

Compilation time is a multiplier for basically everything. Whether you want to ship more features, to make code faster, to adapt to a change of requirements, or to attract new contributors, build time is a factor in that.

It also is a non-linear factor. Just waiting for the compiler is the smaller problem. The big one is losing the state of the flow or (worse) mental context switch to do something else while the code is compiling. One minute of work for the compiler wastes more than one minute of work for the human.

Let’s take some time to prevent “broken windows”. The effort would pay off!

This Week in RisingWave #1 (20230217)

2023-02-17T00:00:00+00:00

RisingWave is a distributed SQL database for stream processing written in Rust. As a developer of RisingWave, I’m always excited (and also a little bit overwhelmed) about its progress every day.

So why not share the excitement with more people (and also help myself to get a better understanding)? That’s why I decided to write this blog post about what’s happening in the project. Hope you will enjoy it!

This blog series is my personal comments about (part of) the development of RisingWave.

Please take it as an unofficial and no-promise supplement.

Most exciting things 🤩

Huge reduce of CI time

As my last post mentioned, last week I found some stupidly effective ways to optimize Rust compile time. I managed to reduce the CI time from main 40min/PR 25min30s to main 28min/PR 16-19min, and it looks good this week!

It’s quite a DX improvement. I’m very happy and would like to quote matklad’s blog again:

Compilation time is a multiplier for basically everything. Whether you want to ship more features, to make code faster, to adapt to a change of requirements, or to attract new contributors, build time is a factor in that.

It also is a non-linear factor. Just waiting for the compiler is the smaller problem. The big one is losing the state of the flow or (worse) mental context switch to do something else while the code is compiling. One minute of work for the compiler wastes more than one minute of work for the human.

Let’s take some time to prevent “broken windows”. The effort would pay off!

DDL UX improvements

DDL can take very long time if there are already a lot of existing data to consume. Previously you can only sit there and wait. But now, you can:

Show DDL’s progress: (by chenzl25 #7914)

dev=> select * from rw_catalog.rw_ddl_progress;
  ddl_id |         ddl_statement         | progress
--------+-------------------------------+----------
    1026 | CREATE INDEX idx ON sbtest1(c) | 69.02%
(1 row)

and cancel streaming jobs by ctrl-c! (by yezizp2012 #7917)

Most intersting things 😄

OpenDAL into RisingWave!

OpenDAL is a unified data access layer which aims to help access data freely, painlessly, and efficiently.

Previously, RisingWave only supports s3 as a storage backend (and also support other s3-compatible storage). Recently we are trying to add more storage backends.

Last week, we used OpenDAL to add support for using HDFS as a storage backend. This week, we tried more things:

Tried using google cloud storage in RisingWave (by wcy-fdu #7920). In our initial benchmark, it seems OpenDAL can be faster than s3-compatible protocol!
Changed the implementation for oss from s3-compatible mode to OpenDAL (by wcy-fdu #7969). S3-compatible mode for oss doesn’t support delete_objects, and also suffers from some unstable issues.

It seems that OpenDAL is quite promising!

Rusty stuff 🦀️

We ❤️ Rust! This section is about some general Rust related issues.

`clippy::or_fun_call`

Compared with or , The or_else methods on Option and Result can help to avoid expensive computation in the None or Err case. (i.e., lazy evaluation!) And the clippy::or_fun_call lint is for detecting that. It used to be warn by default, but unfortunately it seems to have a high false positive rate, and is now allow by default.

But we met a case where it’s very wanted. In a function, we used ok_or::<RwError> 9 times, so 9 RwError is created regardless of whether it’s needed. But constructing RwError is very expensive because it captures the backtrace…

What’s worse, on M1 Mac, capturing backtrace is VERY SLOW (~300ms #6131) and can even cause SEGFAULT (#6205)! 😇 It’s not completely resolved yet. We mitigated it by reducing unncessary backtrace capturing, but it’s still a problem.

So for this issue, of course we should use ok_or_else instead (by xxchan #7945).

`zip_eq` & `ExactSizeIterator`

zip_eq is a safer version of zip which checks if the two iterators have the same length. However, it’s notoriously slow (see #6856 by wangrunji0408 for a benchmark), because in the naive implementation, every item in the iterator is checked.

There’s a ExactSizeIterator trait and naturally it can have an optimized implementation. Such speciallized implementation does not exist because Rust doesn’t support specialization yet. (See tracking issue for specialization (RFC 1210))

So last week I added two new separated traits to replace zip_eq : (code here)

zip_eq_debug: uses zip_eq when debug_assertions is enabled, otherwise use zip. It’s a good trade-off between safety and performance, and can be a drop-in replacement for zip_eq.
zip_eq_fast: speciallized implementation for ExactSizeIterator. (Actually zip_eq_debug can be good enough, but I still added this after playing with specialization for a while. Just for fun…!😄)

This week, ExactSizeIterator are implemented for more our iterator implementations, so we can use zip_eq_fast more often… (by BugenZhao #7939)

Other notable things

System Parameters

Recently we are doing a large refactoring for system parameters in order to achive consistency and mutability for the cluster configurations (Tracking Issue #7381). This is a serious issue as RisingWave grows mature.

This week, the meta part for ALTER SYTEM is implemented by Gun9niR #7954.

`pg_catalog`

Since RisingWave is PostgreSQL compatible, I had thought that supporting other database tools would be easy. But it turns out to be very hard 😭. The largest obstacles is that database tools ususally rely heavily on the system tables in pg_catalog to get the metadata of the database so that they can provide a better UX. But there are so many features in the system tables!

We have been constantly making efforts to support more and more system tables in order to integrate RisingWave into other tools. This week, we:

Added pg_catalog.pg_conversion by yezizp2012 #7964. This is for DBeaver support.
Found it’s not possible to add typarray column in pg_catalog.pg_type #7555, which affects sqlalchemy support. But sqlalchemy-risingwave can be used as a substitute for the PG native plugin.

`EXPLAIN` format

EXPLAIN format is another thing I had thought would be easy. Although current situation is not so bad, it’s still far from perfect.

This week, we tried to use a “global expression id” to make the plan more readable, but it still has some problems:

By the way, here’s another intersting stuff: rewrite the EXPLAIN implementation using the (modified) Wadler-style algebraic pretty printer! See the RFC by ice1000

Query optimizer

Although OLAP batch queries are not the main focus of RisingWave, we still want to make it better. This week, we have these optimizer improvements to make batch queries faster:

We are also trying to add constant relation in streaming (#7854). As a first step, we are doing “plan-level constant folding”, e.g., merge Union with Values inputs into Values by jon-chuang #7923

Last but not least

fix: fix idle exit in playground mode of compactor by yezizp2012 #8014. During an HTTP server’s graceful shutdown (serve_with_shutdown), it will wait for all the connections to be closed. We ran into a situation where some connections were not closed and the server waits forever…
Tracking: Monitoring, logging and tooling improvements #8018. Observability matters!
Support error line/column number reporting during parsing #7863. This is a lovely feature to increase UX, but it will require hard work :)

By the way, welcome to join the RisingWave Slack community. Also check out the good first issue and help wanted issues if you want to join the development of an open source database system!

So much for this week. See you next week (hopefully)! 🤗

我如何动动小手就让 CI 时间减少了 10 分钟

2023-02-11T00:00:00+00:00

English version of this post

虽然经常有逸闻抱怨 Rust 编译速度臭名昭著地慢，但我们的项目 RisingWave 在经过前人比如（skyzh，BugenZhao）的一些努力后，编译速度并不算慢，特别是我自从用上 M1 的 Macbook Pro 后，编译速度根本不是问题，全量 debug 编译也就两三分钟。

然而随着时间推移，CI 里加了越来越多的东西，越来越臃肿了。现在 main workflow 需要大概 40min，PR workflow 需要大概 25min30s。虽然并不算特别慢，但是可以体感到比以前变慢了不少。

于是我前两天心血来潮，决定花点时间研究一下能不能再优化一点编译速度。

令我非常震惊的是，没想到存在着一些非常简单的方法，动动小手就产生了惊人的成效。感觉完全可以用 low-hanging fruits、silver bullet 甚至是 free lunch 来形容🤯。

P.S. 很推荐 matklad（IntelliJ Rust 和 rust-analyzer 的原作者）的 blog：

我用到的大部分方法这里面都有讲到，而且他讲的清晰明了。如果没有另外说明，那么文中的 quote 都来自这里。

本文算是我的实践记录，或者大概可以也当成一个 tl; dr 版。每一个优化点都带上了相应的 PR，可以结合 commit history 点开每个优化点前后的页面对比效果。

P.P.S. 优化完的结果：main 最快 27min，PR 最快 16min，大多在 17-19min 左右。

可供参考的数据、图表

Build times are a fairly easy optimization problem: it’s trivial to get direct feedback (just time the build), there are a bunch of tools for profiling, and you don’t even need to come up with a representative benchmark.

前两天在研究 profiling，那现在提到要优化，当然应该看看有没有什么数据、图表看看，找到瓶颈在哪里再来优化。

CI waterfall & dag graph

我们的 CI 用的是 Buildkite，正常点开一个页面（例如 Build #17099）长这样：

Buildkite 有两个非常好用的隐藏页面，分别是在 /waterfall 和 /dag 里，可以看到：

从图上我们可以清晰地看出，最大的瓶颈是 simulation build -> recovery test

`cargo build --timings`

Cargo 自带 profiling 编译时间的支持（貌似是去年 stablize 的），通过 cargo build –timings 启用，它长这样：

可以发现 zstd-sys ， protobuf-src 等几个依赖的编译时间非常长，应该想办法看看能不能优化掉。

Step 1: Compilation cache

ci: try sccache #7799

If you think about it, it’s pretty obvious how a good caching strategy for CI should work.

Unfortunately, almost nobody does this.

2023-04: 为什么你该试试 Sccache？在 xuanwo 的大力鼓吹下，我非常心动，也想尝试一下 sccache。这也算是我这次搞优化的一大 trigger。

不用多说，非常简单好用。只需加两个环境变量就一键启动了：

ENV RUSTC_WRAPPER=sccache
ENV SCCACHE_BUCKET=ci-sccache-bucket

（在这背后其实需要研究一下 Buildkite 和 AWS 的配置——实际上也非常傻瓜。Buildkite 可以通过 IAM Role 来获得权限，加一个 S3 bucket 的 policy 就 work 了，完全不用配置 secret key 之类的东西。我之前还在思考能不能在 CI 里把 key echo 出来，看来是完全不用担心这种事😄）

效果立竿见影，simulation build 减少了 2.5min，非瓶颈的 debug build 减少了 4min。虽然并没有质变，但是免费的量变何乐而不为呢？

Step 2: Remove unused dependencies

build: remove unused deps #7816

在 Cargo.toml 中声明的依赖不管实际上有没有用到，都会被编译。更甚它可能会引入不必要的 syncronization point，影响编译的并行度。

有个老工具 cargo-udeps 就是干这个的，但是首先它并不支持自动修复，而且它很慢。另外印象中有一个毛病是它不能和 workspace-hack 一起用。这导致 RisingWave 中长期没有清理过 unused dependencies。典型的破窗效应🥲！

在 cargo-udeps 里关于自动 fix 的 issue 下面看到有人提了 cargo-machete （这个名字是大砍刀的意思🤣），觉得是骡子是马拉出来遛遛，发现它跑的飞快，也没有几个 false postive。虽然有几个小问题（参考上面 PR 的 commit history），但是都能容易地修掉。

大砍刀的作者有一篇 blog 介绍了 unused dependencies 的危害以及 cargo-machete 的解法。具体说来，cargo-udeps 是用 cargo check 先编译了一遍再分析的，而 cargo-machete 是简单粗暴的 ripgrep。

这个 PR 一下子删掉了大几十个 udeps，也是让我大吃一惊🤯。可惜的是，CI 的时间并没有进一步缩短，感觉这侧面说明了 cache 效果很好……我本地粗略地测了一下，大概快了十几二十秒。蚊子腿也是肉嘛，anyway it’s free!

P.S. 其实 cargo-udeps 配一下也是能和 workspace-hack 用的：feat(risedev): add check-udeps #7836

Step 3: Disable incremental compilation

build: disable incremental build in CI #7838

干完上面两个小工作之后本来已经想收工了，但有点心痒痒，觉得 simulation build 还是有点慢。于是我决定 profiling 一下看看。然后就看到了一开始贴的 --timings 的图中的几个庞然大物，我觉得这很不 make sense。

我搜了搜 sccache non-cacheable 的原因，发现 incremental compilation 是个很大的 caveat，立马尝试了一下，然后我再次震惊了，效果 stupidly 好：

这让 simulation build 的时间瞬间下降了 4 分钟……

实际上我们的 debug build 是很早之前就关掉了 incremental compilation：

[profile.ci-dev]
incremental = false

但是后来加上新的 build profile 的时候没有考虑到这个问题。仔细想一想，incremental compilation 虽好，但它在 CI 里不太 make sense 啊！

CI builds often are closer to from-scratch builds, as changes are typically much bigger than from a local edit-compile cycle. For from-scratch builds, incremental adds an extra dependency-tracking overhead. It also significantly increases the amount of IO and the size of ./target, which make caching less effective.

于是我干脆在 CI 里加了个全局的 env var 来把它关掉，一劳永逸。

Step 4: Single binary integration test

build: single-binary integration test #7842

又是一个 stupidly effective 的优化。tl;dr:

Don’t do this:

tests/
  foo.rs
  bar.rs

Do this instead:

tests/
  integration/
    main.rs
    foo.rs
    bar.rs

因为 tests/ 下面的每个文件都会编译成一个单独的 binary（意味着会每个都 link 一下依赖）。除了编译慢，这甚至还可能导致测试跑的慢（cargo test 的缺陷）。

这个优化没有减少我们的测试时间（可能是因为 cargo nextest 的优越性），但它一下子又减少了快 2 分钟的编译时间……另外说来有点可笑的是，它还减少了 2 分钟的 artifacts 上传下载、压缩解压的时间……（虽然后者在瓶颈上并没有影响）

其他一些先前就存在的优化

以上就是我这次优化的主要过程了，这下终于可以心满意足地收工了。最后想再总结一些前人的努力，以供参考。

使用 cargo nextest 替代 cargo test。
使用 workspace-hack 技术：见 cargo hakari。
给 cargo registry 加 cache，或者使用刚刚 stablize 的 sparse index，可参考 DCjanus 的这篇 blog。
把巨大的 crate 拆分成多个小 create。
link time 的优化：link 很花时间，而且是单线程的，很可能成为瓶颈
- 使用更快的 linker：mold for Linux, zld for macOS. lld is the most mature option for production use.
- 在 debug build 上关掉 Link Time Optimization (LTO)。
Trade-off between compile time and performance：CI 的总时间是编译+测试，那么编译优化（包括上面的 LTO）开不开，开多少实际上就是在前后者之间 trade-off，可以调整测试来达到一个整体最优的选择。例如 bugen gg 在我们的 build profile 上的骚操作：

# The profile used for CI in pull requests.
# External dependencies are built with optimization enabled, while crates in this workspace are built
# with `dev` profile and full debug info. This is a trade-off between build time and e2e test time.
[profile.ci-dev]
inherits = "dev"
incremental = false
[profile.ci-dev.package."*"] # external dependencies
opt-level = 1
[profile.ci-dev.package."tokio"]
opt-level = 3
[profile.ci-dev.package."async_stack_trace"]
opt-level = 3
[profile.ci-dev.package."indextree"]
opt-level = 3
[profile.ci-dev.package."task_stats_alloc"]
opt-level = 3

# The profile used for deterministic simulation tests in CI.
# The simulator can only run single-threaded, so optimization is required to make the running time
# reasonable. The optimization level is customized to speed up the build.
[profile.ci-sim]
inherits = "dev"
opt-level = 2
incremental = false

除此以外的更多优化也有很多人都总结过，我就不多说（不懂）了，例如这篇 blog：Tips for Faster Rust Compile Times

总结

CI、开发者体验这种东西很容易就会在无人照料的情况下变得杂草丛生，但如果定期打理一下，可能会有意想不到的收获，一点点微小的努力就带来巨大的提升。

最后再摘两段 matklad blog 里的话作结：

Compilation time is a multiplier for basically everything. Whether you want to ship more features, to make code faster, to adapt to a change of requirements, or to attract new contributors, build time is a factor in that.

It also is a non-linear factor. Just waiting for the compiler is the smaller problem. The big one is losing the state of the flow or (worse) mental context switch to do something else while the code is compiling. One minute of work for the compiler wastes more than one minute of work for the human.

Profiling 101

2023-02-08T00:00:00+00:00

以前也浅尝辄止地试图 profile 过，但是被一大堆概念和工具搞的头昏脑涨，全是问题：

什么是 profiling，这个词什么意思？
CPU profiling 和 memory profiling 有什么区别？为什么都是火焰图？
Instrument 又是什么意思？
Golang 的 pprof package 和 pprof tool 有什么区别？
Mac 的 XCode Instruments 可以比较傻瓜式地 profile，但是没有 flamegraph，怎么办？

然后平时要么没有需求用不到 profiling，要么遇到问题很急，就病急乱投医，各种工具瞎试一通，最后不了了之，也没搞懂在干什么。

这两天在尝试用 profiling 工具查内存问题，虽然最后问题没让我看出来，但是这回终于基本搞懂他们到底是在干什么了，趁热打铁水一篇博客记录一下。

本文不包括：如何根据 profiling 的结果来进行性能优化（这是最终目标，需要大量经验），profiling 的各种具体实现技术。

本文主要科普基本概念，理解 profiling 在干什么，这算是一切后续工作的前提。因此本文目标读者是像我一样可能听说过一些，但是没怎么上手搞过 profiling，对基本概念也不太理解的朋友。

TL; DR

Profiling 其实包括分开的两个步骤：

收集程序运行的信息
分析/展示信息

Flamegraph 只是一种数据可视化方法，属于 profiling 的第二步。它不包括第一步收集信息，profiling 也可以不用它来展示信息。

什么是 flamegraph

就算不太懂 profiling 原理或者实践，看到这样的火焰图也会觉得非常直观，基本上能猜个大概是什么意思。

这一定程度上导致对门外汉来说提到 profiling 和 flamegraph 就觉得是绑定在一起的两个概念，但实际上不是。

其实它只是一种可视化工具，只负责展示信息。它的 input 可以是任何长得像 stacktrace 的东西。它的主要优点也正是可以明了地看出 stacktrace 和占比。

brendangregg/FlameGraph 这个 repo 作者 Brendan Gregg 是 flamegraph 的发明者。这个 repo 里包含了一系列（perl 写的）flamegraph 工具。里面包括：

flamegraph.pl 是最重要的用来生成 flamegraph SVG 图片的工具
stackcollapse-*.pl 系列可以把各种格式的 stacktrace 转换成 flamegraph.pl 的 input 格式

所以实际上我手写一个 flamegraph.pl 的输入格式的东西，也可以生成一个 flamegraph。例如这样：

a 1
a;b 0
a;b;c 1
a;b;c;e 1.5
a;b;d 2

上面这个例子应该很好地解释了 flamegraph 的原理。每一行是一个函数的完整的调用栈，最后一个数字是它自己的（时间/内存/…）占用数量。

把 stacktrace 转换成 flamegraph

现在很多 profiling 工具应该都可以直接生成 flamegraph，如果不行的话，可以自己想办法弄个脚本转换一下。例如 XCode Instruments 其实可以导出（它里面叫 Deep Copy… 并且还需要先选中才好导）这样的 stacktrace 文本（内存，CPU 什么的都可以导）：

# Allocation
Bytes Used	Count		Symbol Name
   9.75 KB     100.0%	56	 	_pthread_start
   9.75 KB     100.0%	56	 	 std::sys::unix::thread::Thread::new::thread_start::h8134a9cc26f143c2
...

# Time Profiler
Weight	Self Weight		Symbol Name
74.00 ms  100.0%	0 s	 	thread_start
74.00 ms  100.0%	0 s	 	 _pthread_start
74.00 ms  100.0%	0 s	 	  std::sys::unix::thread::Thread::new::thread_start::h8134a9cc26f143c2
...

# CPU Profiler
Weight	Self Weight		Symbol Name
34.35 Mc  100.0%	-	 	thread_start
34.35 Mc  100.0%	-	 	 _pthread_start
34.35 Mc  100.0%	-	 	  std::sys::unix::thread::Thread::new::thread_start::h8134a9cc26f143c2
...

我想要把 Allocation（内存 profile）的结果弄成 flamegraph，发现没有现成的工具。上面的工具包里有一个 stackcollapse-instruments.pl ，实际上处理的是 Time Profiler 的格式：

<>;
foreach (<>) {
	chomp;
	/\d+\.\d+ (?:min|s|ms)\s+\d+\.\d+%\s+(\d+(?:\.\d+)?) (min|s|ms)\t \t(\s*)(.+)/ or die;
	my $func = $4;
	my $depth = length ($3);
	$stack [$depth] = $4;
	foreach my $i (0 .. $depth - 1) {
		print $stack [$i];
		print ";";
	}

	my $time = 0 + $1;
	if ($2 eq "min") {
		$time *= 60*1000;
	} elsif ($2 eq "s") {
		$time *= 1000;
	}

	printf("%s %.0f\n", $func, $time);
}

这个 perl 代码和正则表达式虽然都看不懂，但是可以目测感受一下就是先做个匹配，然后根据空格缩进的数量处理 stack，最后再把时间单位转换一下就行了。

祭出 regex101 （其实我也第一次用）鉴定一下正则表达式：

再回头结合代码中 $2, $3, $4 的用法，可以非常确信每个变量的含义。

然后想要转 Allocation 的格式的话，依葫芦画瓢小改一下正则表达式和单位就行了（可以结合 regex101 的 debug 功能，看正则匹配到哪儿挂了，相当好用）。注意 Allocation 的信息里默认不包括 Self Bytes，可以手动打开一下。

什么是 profiling

profiling 这个词的含义

翻了翻词典和 wiki 的解释：

the recording and analysis of a person’s psychological and behavioral characteristics, so as to assess or predict their capabilities in a certain sphere or to assist in identifying a particular subgroup of people

Profiling, the extrapolation of information about something, based on known qualities

感觉是个非常 general 的词，但感觉差不多就是通过收集信息（观测）来分析推断的意思。常见其他用法包括心理学中的罪犯心理画像侧写，都差不多是一回事。

从第一个解释也可以看出，profiling 其实有两个步骤：收集信息和分析信息。例如 golang 自带的 pprof 包提供了多种方式启用 profiling：go test、HTTP 服务、手动在程序里启用，其实做的事情都是收集信息。以前我还搞不清楚它和 pprof tool 的关系，实际上后者是个分析、可视化工具，前者收集信息，收集成后者定义的输入格式。

收集信息

主要有两种收集信息的方式：

Statistical sampling：probes the target program’s call stack at regular intervals using operating system interrupts.
Code instrumentation：either inject code into a binary file that captures timing information or by using callback hooks.

后者比前者更精确，但是 overhead 也更大。前面提到的 Mac 上的 XCode Instruments，想必就是以后面这种方式来收集信息的。

instrumenation 这个词的含义

Instrumentation a collective term for measuring instruments that are used for indicating, measuring and recording physical quantities. The term has its origins in the art and science of scientific instrument-making.

Scientific instruments 实际上指科学仪器，that are used for indicating, measuring and recording physical quantities。Instrumentation 可以指使用科学仪器。

那么在软件的语境下， instrumentation 就指往代码中加入一些“仪器”，从而可以观测到软件的行为。它的中文好像叫“插桩”，也挺形象，就是感觉不太符合原意。

插入软件里的东西可能包括：

Collect profiling statistics：profiling 实际上就是 measuring dynamic program behaviors，非常像做科学实验
Run-time checking：像 AddressSanitizer 这种也被算作是 instrumentation，感觉稍微有点词义扩大的感觉

不过这么说来，我感觉 sampling 在更广义的角度看也可以算是一种“instrumentation”……

分析/展示信息

分析除了 flamegraph，其他还有像下面这样的 graphviz 图等等。

也可以不可视化，就看一看一些统计信息，例如 TopN。

不同的收集工具接受不同的 input 格式，相应地做的事情也会不一样，例如有的分析工具可能还有把 debug symbol 转成 source code address 这一步，而像之前的 flamegraph 就是接受现成的处理好的（collapsed）stacktrace。pprof 的输入格式是一个 protobuffer 定义。

CPU profiling & memory profiling

CPU profiling 和 memory profiling 有什么区别？为什么都是火焰图？

现在看这个问题就很简单了，它们要做的事情都是收集信息、分析信息这两步，只是具体收集的方法不一样而已。例如 jemalloc 的 heap profile 就是把 memory dump 一坨下来，然后可以用它的 jeprof 工具来分析，包括转成火焰图的格式。

总结

我终于理解了一切，所以那然后该怎么看懂火焰图，定位解决性能问题呢？😭

XX’s Blog

New site for This Week in RisingWave

This Week in RisingWave #6 (20230325)

Features Updates 🌟

NULLS {FIRST | LAST}

Rename relations

Performance Optimizations 💪

Bushy tree join ordering

Rusty stuff 🦀️

zld

New Contributors

This Week in RisingWave #5 (20230318)

Notable changes 🌟

Temporal join

Interesting SQL features 😄

Server local timezone

Intersting Bug

Inverse of column index mapping

Reliability Improvements 💪

The Great MadSim!

Interval bugfixes and tests

OpenDAL

Rusty stuff 🦀️

Be more careful about error creation!

New Contributors

This Week in RisingWave #4 (20230310)

Notable changes 🌟

Serial type

Auto execution mode selection

AWS Private link support

Sink Validation

Reliability Improvements 💪

Fuzzing Tests

Deterministic Tests

Rusty stuff 🦀️

Reduce debuginfo size

Capture unrecognized fields in serde

Auto derive a prefixed alias for protobuf message types

This Week in RisingWave #3 (20230304)

Notable changes

Memory control policy

Move memtable

Optimizer updates

Async expr

RFC: suspend MV

Intersting Bug

Injectivity of column index mapping

Rusty stuff 🦀️

Publish await-tree

Large memory usage by backtrace and debuginfo

New contributor

This Week in RisingWave #2 (20230225)

Most exciting things 🤩

Common sub-plan sharing

Python UDF

Other notable things

Telemetry

Replace Bloom filter with XOR filter

Stream error reporting

Table schema change

New major SQL features

jsonb data type

New array function: array_to_string

New aggregate function: stddev / stdvar

Mind-blowing SQL surprise 🤯

Operator |/

NULL s hurt our head – FULL OUTER JOIN in steraming

Rusty stuff 🦀️

Error handling

ChaChaRng

Private marker trait

New Contributors

Stupidly effective ways to optimize Rust compile time

Valuable data and charts to find the bottlenecks

CI Waterfall & DAG Graph

cargo build --timings

Step 1: Compilation cache

Step 2: Remove unused dependencies

Step 3: Disable incremental compilation

Step 4: Single binary integration test

`NULLS {FIRST | LAST}`

Publish `await-tree`

`jsonb` data type

New array function: `array_to_string`

New aggregate function: `stddev` / `stdvar`

Operator `|/`

`NULL` s hurt our head – FULL OUTER JOIN in steraming

`ChaChaRng`

`cargo build --timings`

`clippy::or_fun_call`

`zip_eq` & `ExactSizeIterator`

`pg_catalog`

`EXPLAIN` format

`cargo build --timings`