#
03921b97 |
| 21-Jun-2024 |
Chuanqi Xu <yedeng.yd@linux.alibaba.com> |
[serialization] No transitive type change (#92511)
Following of https://github.com/llvm/llvm-project/pull/92085.
#### motivation
The motivation is still cutting of the unnecessary change in t
[serialization] No transitive type change (#92511)
Following of https://github.com/llvm/llvm-project/pull/92085.
#### motivation
The motivation is still cutting of the unnecessary change in the
dependency chain. See the above link (recursively) for details.
And this will be the last patch of the `no-transitive-*-change` series.
If there are any following patches, they might be C++20 Named modules
specific to handle special grammars like `ADL` (See the reply in
https://discourse.llvm.org/t/rfc-c-20-modules-introduce-thin-bmi-and-decls-hash/74755/53
for example). So they won't affect the whole serialization part as the
series patch did.
#### example
After this patch, finally we are able to cut of unnecessary change of
types. For example,
```
//--- m-partA.cppm
export module m:partA;
//--- m-partA.v1.cppm
export module m:partA;
namespace NS {
class A {
public:
int getValue() {
return 43;
}
};
}
//--- m-partB.cppm
export module m:partB;
export inline int getB() {
return 430;
}
//--- m.cppm
export module m;
export import :partA;
export import :partB;
//--- useBOnly.cppm
export module useBOnly;
import m;
export inline int get() {
return getB();
}
```
The BMI of `useBOnly.cppm` is expected to not change if we only add a
new class in `m:partA`. This will be pretty useful in practice.
#### implementation details
The key idea of this patch is similar with the previous patches: extend
the 32bits type ID to 64bits so that we can store the module file index
in the higher bits. Then the encoding of the type ID is independent on
the imported modules.
But there are two differences from the previous patches:
- TypeID is not completely an index of serialized types. We used the
lower 3 bits to store the qualifiers.
- TypeID won't take part in any lookup process. So the uses of TypeID is
much less than the previous patches.
The first difference make we have some more slightly complex bit
operations. And the second difference makes the patch much simpler than
the previous ones.
show more ...
|
#
2f2ea355 |
| 20-Jun-2024 |
Chuanqi Xu <yedeng.yd@linux.alibaba.com> |
[Serialization] No transitive identifier change (#92085)
Following of https://github.com/llvm/llvm-project/pull/92083
The motivation is still cutting of the unnecessary change in the
dependency
[Serialization] No transitive identifier change (#92085)
Following of https://github.com/llvm/llvm-project/pull/92083
The motivation is still cutting of the unnecessary change in the
dependency chain. See the above link (recursively) for details.
After this patch, (and the above patch), we can already do something
pretty interesting. For example,
#### Motivation example
```
//--- m-partA.cppm
export module m:partA;
export inline int getA() {
return 43;
}
export class A {
public:
int getMem();
};
export template <typename T>
class ATempl {
public:
T getT();
};
//--- m-partA.v1.cppm
export module m:partA;
export inline int getA() {
return 43;
}
// Now we add a new declaration without introducing a new type.
// The consuming module which didn't use m:partA completely is expected to be
// not changed.
export inline int getA2() {
return 88;
}
export class A {
public:
int getMem();
// Now we add a new declaration without introducing a new type.
// The consuming module which didn't use m:partA completely is expected to be
// not changed.
int getMem2();
};
export template <typename T>
class ATempl {
public:
T getT();
// Add a new declaration without introducing a new type.
T getT2();
};
//--- m-partB.cppm
export module m:partB;
export inline int getB() {
return 430;
}
//--- m.cppm
export module m;
export import :partA;
export import :partB;
//--- useBOnly.cppm
export module useBOnly;
import m;
export inline int get() {
return getB();
}
```
In this example, module `m` exports two partitions `:partA` and
`:partB`. And a consumer `useBOnly` only consumes the entities from
`:partB`. So we don't hope the BMI of `useBOnly` changes if only
`:partA` changes. After this patch, we can make it if the change of
`:partA` doesn't introduce new types. (And we can get rid of this if we
make no-transitive-type-change).
As the example shows, when we change the implementation of `:partA` from
`m-partA.cppm` to `m-partA.v1.cppm`, we add new function declaration
`getA2()` at the global namespace, add a new member function `getMem2()`
to class `A` and add a new member function to `getT2()` to class
template `ATempl`. And since `:partA` is not used by `useBOnly`
completely, the BMI of `useBOnly` won't change after we made above
changes.
#### Design details
Method used in this patch is similar with
https://github.com/llvm/llvm-project/pull/92083 and
https://github.com/llvm/llvm-project/pull/86912. It extends the 32 bit
IdentifierID to 64 bits and use the higher 32 bits to store the module
file index. So that the encoding of the identifier won't get affected by
other modules.
#### Overhead
Similar with https://github.com/llvm/llvm-project/pull/92083 and
https://github.com/llvm/llvm-project/pull/86912. The change is only
expected to increase the size of the on-disk .pcm files and not affect
the compile-time performances. And from my experiment, the size of the
on-disk change only increase 1%+ and observe no compile-time impacts.
#### Future Plans
I'll try to do the same thing for type ids. IIRC, it won't change the
dependency graph if we add a new type in an unused units. I do think
this is a significant win. And this will be a pretty good answer to "why
modules are better than headers."
show more ...
|
Revision tags: llvmorg-18.1.8 |
|
#
5a0181f5 |
| 06-Jun-2024 |
Chuanqi Xu <yedeng.yd@linux.alibaba.com> |
[serialization] no transitive decl change (#92083)
Following of https://github.com/llvm/llvm-project/pull/86912
The motivation of the patch series is that, for a module interface unit `X`, when the
[serialization] no transitive decl change (#92083)
Following of https://github.com/llvm/llvm-project/pull/86912
The motivation of the patch series is that, for a module interface unit `X`, when the dependent modules of `X` changes, if the changes is not relevant with `X`, we hope the BMI of `X` won't change. For the specific patch, we hope if the changes was about irrelevant declaration changes, we hope the BMI of `X` won't change. **However**, I found the patch itself is not very useful in practice, since the adding or removing declarations, will change the state of identifiers and types in most cases.
That said, for the most simple example,
``` // partA.cppm export module m:partA;
// partA.v1.cppm export module m:partA; export void a() {}
// partB.cppm export module m:partB; export void b() {}
// m.cppm export module m; export import :partA; export import :partB;
// onlyUseB; export module onlyUseB; import m; export inline void onluUseB() { b(); } ```
the BMI of `onlyUseB` will change after we change the implementation of `partA.cppm` to `partA.v1.cppm`. Since `partA.v1.cppm` introduces new identifiers and types (the function prototype).
So in this patch, we have to write the tests as:
``` // partA.cppm export module m:partA; export int getA() { ... } export int getA2(int) { ... }
// partA.v1.cppm export module m:partA; export int getA() { ... } export int getA(int) { ... } export int getA2(int) { ... }
// partB.cppm export module m:partB; export void b() {}
// m.cppm export module m; export import :partA; export import :partB;
// onlyUseB; export module onlyUseB; import m; export inline void onluUseB() { b(); } ```
so that the new introduced declaration `int getA(int)` doesn't introduce new identifiers and types, then the BMI of `onlyUseB` can keep unchanged.
While it looks not so great, the patch should be the base of the patch to erase the transitive change for identifiers and types since I don't know how can we introduce new types and identifiers without introducing new declarations. Given how tightly the relationship between declarations, types and identifiers, I think we can only reach the ideal state after we made the series for all of the three entties.
The design of the patch is similar to https://github.com/llvm/llvm-project/pull/86912, which extends the 32-bit DeclID to 64-bit and use the higher bits to store the module file index and the lower bits to store the Local Decl ID.
A slight difference is that we only use 48 bits to store the new DeclID since we try to use the higher 16 bits to store the module ID in the prefix of Decl class. Previously, we use 32 bits to store the module ID and 32 bits to store the DeclID. I don't want to allocate additional space so I tried to make the additional space the same as 64 bits. An potential interesting thing here is about the relationship between the module ID and the module file index. I feel we can get the module file index by the module ID. But I didn't prove it or implement it. Since I want to make the patch itself as small as possible. We can make it in the future if we want.
Another change in the patch is the new concept Decl Index, which means the index of the very big array `DeclsLoaded` in ASTReader. Previously, the index of a loaded declaration is simply the Decl ID minus PREDEFINED_DECL_NUMs. So there are some places they got used ambiguously. But this patch tried to split these two concepts.
As https://github.com/llvm/llvm-project/pull/86912 did, the change will increase the on-disk PCM file sizes. As the declaration ID may be the most IDs in the PCM file, this can have the biggest impact on the size. In my experiments, this change will bring 6.6% increase of the on-disk PCM size. No compile-time performance regression observed. Given the benefits in the motivation example, I think the cost is worthwhile.
show more ...
|
#
4f70c5ec |
| 07-Jun-2024 |
Chuanqi Xu <yedeng.yd@linux.alibaba.com> |
Revert "[serialization] no transitive decl change (#92083)"
This reverts commit 5c104879c1a98eeb845c03e7c45206bd48e88f0c.
The ArmV7 bot is complaining the change breaks the alignment.
|
#
5c104879 |
| 06-Jun-2024 |
Chuanqi Xu <yedeng.yd@linux.alibaba.com> |
[serialization] no transitive decl change (#92083)
Following of https://github.com/llvm/llvm-project/pull/86912
The motivation of the patch series is that, for a module interface unit `X`, when the
[serialization] no transitive decl change (#92083)
Following of https://github.com/llvm/llvm-project/pull/86912
The motivation of the patch series is that, for a module interface unit `X`, when the dependent modules of `X` changes, if the changes is not relevant with `X`, we hope the BMI of `X` won't change. For the specific patch, we hope if the changes was about irrelevant declaration changes, we hope the BMI of `X` won't change. **However**, I found the patch itself is not very useful in practice, since the adding or removing declarations, will change the state of identifiers and types in most cases.
That said, for the most simple example,
``` // partA.cppm export module m:partA;
// partA.v1.cppm export module m:partA; export void a() {}
// partB.cppm export module m:partB; export void b() {}
// m.cppm export module m; export import :partA; export import :partB;
// onlyUseB; export module onlyUseB; import m; export inline void onluUseB() { b(); } ```
the BMI of `onlyUseB` will change after we change the implementation of `partA.cppm` to `partA.v1.cppm`. Since `partA.v1.cppm` introduces new identifiers and types (the function prototype).
So in this patch, we have to write the tests as:
``` // partA.cppm export module m:partA; export int getA() { ... } export int getA2(int) { ... }
// partA.v1.cppm export module m:partA; export int getA() { ... } export int getA(int) { ... } export int getA2(int) { ... }
// partB.cppm export module m:partB; export void b() {}
// m.cppm export module m; export import :partA; export import :partB;
// onlyUseB; export module onlyUseB; import m; export inline void onluUseB() { b(); } ```
so that the new introduced declaration `int getA(int)` doesn't introduce new identifiers and types, then the BMI of `onlyUseB` can keep unchanged.
While it looks not so great, the patch should be the base of the patch to erase the transitive change for identifiers and types since I don't know how can we introduce new types and identifiers without introducing new declarations. Given how tightly the relationship between declarations, types and identifiers, I think we can only reach the ideal state after we made the series for all of the three entties.
The design of the patch is similar to https://github.com/llvm/llvm-project/pull/86912, which extends the 32-bit DeclID to 64-bit and use the higher bits to store the module file index and the lower bits to store the Local Decl ID.
A slight difference is that we only use 48 bits to store the new DeclID since we try to use the higher 16 bits to store the module ID in the prefix of Decl class. Previously, we use 32 bits to store the module ID and 32 bits to store the DeclID. I don't want to allocate additional space so I tried to make the additional space the same as 64 bits. An potential interesting thing here is about the relationship between the module ID and the module file index. I feel we can get the module file index by the module ID. But I didn't prove it or implement it. Since I want to make the patch itself as small as possible. We can make it in the future if we want.
Another change in the patch is the new concept Decl Index, which means the index of the very big array `DeclsLoaded` in ASTReader. Previously, the index of a loaded declaration is simply the Decl ID minus PREDEFINED_DECL_NUMs. So there are some places they got used ambiguously. But this patch tried to split these two concepts.
As https://github.com/llvm/llvm-project/pull/86912 did, the change will increase the on-disk PCM file sizes. As the declaration ID may be the most IDs in the PCM file, this can have the biggest impact on the size. In my experiments, this change will bring 6.6% increase of the on-disk PCM size. No compile-time performance regression observed. Given the benefits in the motivation example, I think the cost is worthwhile.
show more ...
|
#
e2858189 |
| 06-Jun-2024 |
Chuanqi Xu <yedeng.yd@linux.alibaba.com> |
Revert "[serialization] no transitive decl change (#92083)"
This reverts commit 97c866f6c86456b3316006e6beff47e68a81c00a.
This fails on 32bit machines. See https://github.com/llvm/llvm-project/pull
Revert "[serialization] no transitive decl change (#92083)"
This reverts commit 97c866f6c86456b3316006e6beff47e68a81c00a.
This fails on 32bit machines. See https://github.com/llvm/llvm-project/pull/92083
show more ...
|
#
97c866f6 |
| 06-Jun-2024 |
Chuanqi Xu <yedeng.yd@linux.alibaba.com> |
[serialization] no transitive decl change (#92083)
Following of https://github.com/llvm/llvm-project/pull/86912
The motivation of the patch series is that, for a module interface unit `X`, when the
[serialization] no transitive decl change (#92083)
Following of https://github.com/llvm/llvm-project/pull/86912
The motivation of the patch series is that, for a module interface unit `X`, when the dependent modules of `X` changes, if the changes is not relevant with `X`, we hope the BMI of `X` won't change. For the specific patch, we hope if the changes was about irrelevant declaration changes, we hope the BMI of `X` won't change. **However**, I found the patch itself is not very useful in practice, since the adding or removing declarations, will change the state of identifiers and types in most cases.
That said, for the most simple example,
``` // partA.cppm export module m:partA;
// partA.v1.cppm export module m:partA; export void a() {}
// partB.cppm export module m:partB; export void b() {}
// m.cppm export module m; export import :partA; export import :partB;
// onlyUseB; export module onlyUseB; import m; export inline void onluUseB() { b(); } ```
the BMI of `onlyUseB` will change after we change the implementation of `partA.cppm` to `partA.v1.cppm`. Since `partA.v1.cppm` introduces new identifiers and types (the function prototype).
So in this patch, we have to write the tests as:
``` // partA.cppm export module m:partA; export int getA() { ... } export int getA2(int) { ... }
// partA.v1.cppm export module m:partA; export int getA() { ... } export int getA(int) { ... } export int getA2(int) { ... }
// partB.cppm export module m:partB; export void b() {}
// m.cppm export module m; export import :partA; export import :partB;
// onlyUseB; export module onlyUseB; import m; export inline void onluUseB() { b(); } ```
so that the new introduced declaration `int getA(int)` doesn't introduce new identifiers and types, then the BMI of `onlyUseB` can keep unchanged.
While it looks not so great, the patch should be the base of the patch to erase the transitive change for identifiers and types since I don't know how can we introduce new types and identifiers without introducing new declarations. Given how tightly the relationship between declarations, types and identifiers, I think we can only reach the ideal state after we made the series for all of the three entties.
The design of the patch is similar to https://github.com/llvm/llvm-project/pull/86912, which extends the 32-bit DeclID to 64-bit and use the higher bits to store the module file index and the lower bits to store the Local Decl ID.
A slight difference is that we only use 48 bits to store the new DeclID since we try to use the higher 16 bits to store the module ID in the prefix of Decl class. Previously, we use 32 bits to store the module ID and 32 bits to store the DeclID. I don't want to allocate additional space so I tried to make the additional space the same as 64 bits. An potential interesting thing here is about the relationship between the module ID and the module file index. I feel we can get the module file index by the module ID. But I didn't prove it or implement it. Since I want to make the patch itself as small as possible. We can make it in the future if we want.
Another change in the patch is the new concept Decl Index, which means the index of the very big array `DeclsLoaded` in ASTReader. Previously, the index of a loaded declaration is simply the Decl ID minus PREDEFINED_DECL_NUMs. So there are some places they got used ambiguously. But this patch tried to split these two concepts.
As https://github.com/llvm/llvm-project/pull/86912 did, the change will increase the on-disk PCM file sizes. As the declaration ID may be the most IDs in the PCM file, this can have the biggest impact on the size. In my experiments, this change will bring 6.6% increase of the on-disk PCM size. No compile-time performance regression observed. Given the benefits in the motivation example, I think the cost is worthwhile.
show more ...
|
Revision tags: llvmorg-18.1.7 |
|
#
cb60667b |
| 04-Jun-2024 |
Chuanqi Xu <yedeng.yd@linux.alibaba.com> |
Revert "[serialization] no transitive decl change (#92083)"
This reverts commit d8ec452db016f359feeec28994f6560b30b49824.
This fails on LLDB macOS CI. See https://github.com/llvm/llvm-project/pull/
Revert "[serialization] no transitive decl change (#92083)"
This reverts commit d8ec452db016f359feeec28994f6560b30b49824.
This fails on LLDB macOS CI. See https://github.com/llvm/llvm-project/pull/92083 for details.
show more ...
|
#
d8ec452d |
| 03-Jun-2024 |
Chuanqi Xu <yedeng.yd@linux.alibaba.com> |
[serialization] no transitive decl change (#92083)
Following of https://github.com/llvm/llvm-project/pull/86912
The motivation of the patch series is that, for a module interface unit `X`, when the
[serialization] no transitive decl change (#92083)
Following of https://github.com/llvm/llvm-project/pull/86912
The motivation of the patch series is that, for a module interface unit `X`, when the dependent modules of `X` changes, if the changes is not relevant with `X`, we hope the BMI of `X` won't change. For the specific patch, we hope if the changes was about irrelevant declaration changes, we hope the BMI of `X` won't change. **However**, I found the patch itself is not very useful in practice, since the adding or removing declarations, will change the state of identifiers and types in most cases.
That said, for the most simple example,
``` // partA.cppm export module m:partA;
// partA.v1.cppm export module m:partA; export void a() {}
// partB.cppm export module m:partB; export void b() {}
// m.cppm export module m; export import :partA; export import :partB;
// onlyUseB; export module onlyUseB; import m; export inline void onluUseB() { b(); } ```
the BMI of `onlyUseB` will change after we change the implementation of `partA.cppm` to `partA.v1.cppm`. Since `partA.v1.cppm` introduces new identifiers and types (the function prototype).
So in this patch, we have to write the tests as:
``` // partA.cppm export module m:partA; export int getA() { ... } export int getA2(int) { ... }
// partA.v1.cppm export module m:partA; export int getA() { ... } export int getA(int) { ... } export int getA2(int) { ... }
// partB.cppm export module m:partB; export void b() {}
// m.cppm export module m; export import :partA; export import :partB;
// onlyUseB; export module onlyUseB; import m; export inline void onluUseB() { b(); } ```
so that the new introduced declaration `int getA(int)` doesn't introduce new identifiers and types, then the BMI of `onlyUseB` can keep unchanged.
While it looks not so great, the patch should be the base of the patch to erase the transitive change for identifiers and types since I don't know how can we introduce new types and identifiers without introducing new declarations. Given how tightly the relationship between declarations, types and identifiers, I think we can only reach the ideal state after we made the series for all of the three entties.
The design of the patch is similar to https://github.com/llvm/llvm-project/pull/86912, which extends the 32-bit DeclID to 64-bit and use the higher bits to store the module file index and the lower bits to store the Local Decl ID.
A slight difference is that we only use 48 bits to store the new DeclID since we try to use the higher 16 bits to store the module ID in the prefix of Decl class. Previously, we use 32 bits to store the module ID and 32 bits to store the DeclID. I don't want to allocate additional space so I tried to make the additional space the same as 64 bits. An potential interesting thing here is about the relationship between the module ID and the module file index. I feel we can get the module file index by the module ID. But I didn't prove it or implement it. Since I want to make the patch itself as small as possible. We can make it in the future if we want.
Another change in the patch is the new concept Decl Index, which means the index of the very big array `DeclsLoaded` in ASTReader. Previously, the index of a loaded declaration is simply the Decl ID minus PREDEFINED_DECL_NUMs. So there are some places they got used ambiguously. But this patch tried to split these two concepts.
As https://github.com/llvm/llvm-project/pull/86912 did, the change will increase the on-disk PCM file sizes. As the declaration ID may be the most IDs in the PCM file, this can have the biggest impact on the size. In my experiments, this change will bring 6.6% increase of the on-disk PCM size. No compile-time performance regression observed. Given the benefits in the motivation example, I think the cost is worthwhile.
show more ...
|
#
6b30180b |
| 03-Jun-2024 |
Chuanqi Xu <yedeng.yd@linux.alibaba.com> |
Revert "[serialization] no transitive decl change (#92083)"
This reverts commit ccb73e882b2d727877cfda42a14a6979cfd31f04.
It looks like there are some bots complaining about the patch. See the post
Revert "[serialization] no transitive decl change (#92083)"
This reverts commit ccb73e882b2d727877cfda42a14a6979cfd31f04.
It looks like there are some bots complaining about the patch. See the post commit comment in https://github.com/llvm/llvm-project/pull/92083 to track it.
show more ...
|
#
ccb73e88 |
| 03-Jun-2024 |
Chuanqi Xu <yedeng.yd@linux.alibaba.com> |
[serialization] no transitive decl change (#92083)
Following of https://github.com/llvm/llvm-project/pull/86912
#### Motivation Example
The motivation of the patch series is that, for a module
[serialization] no transitive decl change (#92083)
Following of https://github.com/llvm/llvm-project/pull/86912
#### Motivation Example
The motivation of the patch series is that, for a module interface unit
`X`, when the dependent modules of `X` changes, if the changes is not
relevant with `X`, we hope the BMI of `X` won't change. For the specific
patch, we hope if the changes was about irrelevant declaration changes,
we hope the BMI of `X` won't change. **However**, I found the patch
itself is not very useful in practice, since the adding or removing
declarations, will change the state of identifiers and types in most
cases.
That said, for the most simple example,
```
// partA.cppm
export module m:partA;
// partA.v1.cppm
export module m:partA;
export void a() {}
// partB.cppm
export module m:partB;
export void b() {}
// m.cppm
export module m;
export import :partA;
export import :partB;
// onlyUseB;
export module onlyUseB;
import m;
export inline void onluUseB() {
b();
}
```
the BMI of `onlyUseB` will change after we change the implementation of
`partA.cppm` to `partA.v1.cppm`. Since `partA.v1.cppm` introduces new
identifiers and types (the function prototype).
So in this patch, we have to write the tests as:
```
// partA.cppm
export module m:partA;
export int getA() { ... }
export int getA2(int) { ... }
// partA.v1.cppm
export module m:partA;
export int getA() { ... }
export int getA(int) { ... }
export int getA2(int) { ... }
// partB.cppm
export module m:partB;
export void b() {}
// m.cppm
export module m;
export import :partA;
export import :partB;
// onlyUseB;
export module onlyUseB;
import m;
export inline void onluUseB() {
b();
}
```
so that the new introduced declaration `int getA(int)` doesn't introduce
new identifiers and types, then the BMI of `onlyUseB` can keep
unchanged.
While it looks not so great, the patch should be the base of the patch
to erase the transitive change for identifiers and types since I don't
know how can we introduce new types and identifiers without introducing
new declarations. Given how tightly the relationship between
declarations, types and identifiers, I think we can only reach the ideal
state after we made the series for all of the three entties.
#### Design details
The design of the patch is similar to
https://github.com/llvm/llvm-project/pull/86912, which extends the
32-bit DeclID to 64-bit and use the higher bits to store the module file
index and the lower bits to store the Local Decl ID.
A slight difference is that we only use 48 bits to store the new DeclID
since we try to use the higher 16 bits to store the module ID in the
prefix of Decl class. Previously, we use 32 bits to store the module ID
and 32 bits to store the DeclID. I don't want to allocate additional
space so I tried to make the additional space the same as 64 bits. An
potential interesting thing here is about the relationship between the
module ID and the module file index. I feel we can get the module file
index by the module ID. But I didn't prove it or implement it. Since I
want to make the patch itself as small as possible. We can make it in
the future if we want.
Another change in the patch is the new concept Decl Index, which means
the index of the very big array `DeclsLoaded` in ASTReader. Previously,
the index of a loaded declaration is simply the Decl ID minus
PREDEFINED_DECL_NUMs. So there are some places they got used
ambiguously. But this patch tried to split these two concepts.
#### Overhead
As https://github.com/llvm/llvm-project/pull/86912 did, the change will
increase the on-disk PCM file sizes. As the declaration ID may be the
most IDs in the PCM file, this can have the biggest impact on the size.
In my experiments, this change will bring 6.6% increase of the on-disk
PCM size. No compile-time performance regression observed. Given the
benefits in the motivation example, I think the cost is worthwhile.
show more ...
|
Revision tags: llvmorg-18.1.6 |
|
#
947b0628 |
| 06-May-2024 |
Chuanqi Xu <yedeng.yd@linux.alibaba.com> |
Reland "[Modules] No transitive source location change (#86912)"
This relands 6c31104.
The patch was reverted due to incorrectly introduced alignment. And the patch was re-commited after fixing the
Reland "[Modules] No transitive source location change (#86912)"
This relands 6c31104.
The patch was reverted due to incorrectly introduced alignment. And the patch was re-commited after fixing the alignment issue.
Following off are the original message:
This is part of "no transitive change" patch series, "no transitive source location change". I talked this with @Bigcheese in the tokyo's WG21 meeting.
The idea comes from @jyknight posted on LLVM discourse. That for:
``` // A.cppm export module A; ...
// B.cppm export module B; import A; ...
//--- C.cppm export module C; import C; ```
Almost every time A.cppm changes, we need to recompile `B`. Due to we think the source location is significant to the semantics. But it may be good if we can avoid recompiling `C` if the change from `A` wouldn't change the BMI of B.
This patch only cares source locations. So let's focus on source location's example. We can see the full example from the attached test.
``` //--- A.cppm export module A; export template <class T> struct C { T func() { return T(43); } }; export int funcA() { return 43; }
//--- A.v1.cppm export module A;
export template <class T> struct C { T func() { return T(43); } }; export int funcA() { return 43; }
//--- B.cppm export module B; import A;
export int funcB() { return funcA(); }
//--- C.cppm export module C; import A; export void testD() { C<int> c; c.func(); } ```
Here the only difference between `A.cppm` and `A.v1.cppm` is that `A.v1.cppm` has an additional blank line. Then the test shows that two BMI of `B.cppm`, one specified `-fmodule-file=A=A.pcm` and the other specified `-fmodule-file=A=A.v1.pcm`, should have the bit-wise same contents.
However, it is a different story for C, since C instantiates templates from A, and the instantiation records the source information from module A, which is different from `A` and `A.v1`, so it is expected that the BMI `C.pcm` and `C.v1.pcm` can and should differ.
To fully understand the patch, we need to understand how we encodes source locations and how we serialize and deserialize them.
For source locations, we encoded them as:
``` | | | _____ base offset of an imported module | | | |_____ base offset of another imported module | | | | | ___ 0 ```
As the diagram shows, we encode the local (unloaded) source location from 0 to higher bits. And we allocate the space for source locations from the loaded modules from high bits to 0. Then the source locations from the loaded modules will be mapped to our source location space according to the allocated offset.
For example, for,
``` // a.cppm export module a; ...
// b.cppm export module b; import a; ... ```
Assuming the offset of a source location (let's name the location as `S`) in a.cppm is 45 and we will record the value `45` into the BMI `a.pcm`. Then in b.cppm, when we import a, the source manager will allocate a space for module 'a' (according to the recorded number of source locations) as the base offset of module 'a' in the current source location spaces. Let's assume the allocated base offset as 90 in this example. Then when we want to get the location in the current source location space for `S`, we can get it simply by adding `45` to `90` to `135`. Finally we can get the source location for `S` in module B as `135`.
And when we want to write module `b`, we would also write the source location of `S` as `135` directly in the BMI. And to clarify the location `S` comes from module `a`, we also need to record the base offset of module `a`, 90 in the BMI of `b`.
Then the problem comes. Since the base offset of module 'a' is computed by the number source locations in module 'a'. In module 'b', the recorded base offset of module 'a' will change every time the number of source locations in module 'a' increase or decrease. In other words, the contents of BMI of B will change every time the number of locations in module 'a' changes. This is pretty sensitive. Almost every change will change the number of locations. So this is the problem this patch want to solve.
Let's continue with the existing design to understand what's going on. Another interesting case is:
``` // c.cppm export module c; import whatever; import a; import b; ... ```
In `c.cppm`, when we import `a`, we still need to allocate a base location offset for it, let's say the value becomes to `200` somehow. Then when we reach the location `S` recorded in module `b`, we need to translate it into the current source location space. The solution is quite simple, we can get it by `135 + (200 - 90) = 245`. In another word, the offset of a source location in current module can be computed as `Recorded Offset + Base Offset of the its module file - Recorded Base Offset`.
Then we're almost done about how we handle the offset of source locations in serializers.
From the abstract level, what we want to do is to remove the hardcoded base offset of imported modules and remain the ability to calculate the source location in a new module unit. To achieve this, we need to be able to find the module file owning a source location from the encoding of the source location.
So in this patch, for each source location, we will store the local offset of the location and the module file index. For the above example, in `b.pcm`, the source location of `S` will be recorded as `135` directly. And in the new design, the source location of `S` will be recorded as `<1, 45>`. Here `1` stands for the module file index of `a` in module `b`. And `45` means the offset of `S` to the base offset of module `a`.
So the trade-off here is that, to make the BMI more independent, we need to record more abstract information. And I feel it is worthy. The recompilation problem of modules is really annoying and there are still people complaining this. But if we can make this (including stopping other changes transitively), I think this may be a killer feature for modules. And from @Bigcheese , this should be helpful for clang explicit modules too.
And the benchmarking side, I tested this patch against https://github.com/alibaba/async_simple/tree/CXX20Modules. No significant change on compilation time. The size of .pcm files becomes to 204M from 200M. I think the trade-off is pretty fair.
I didn't use another slot to record the module file index. I tried to use the higher 32 bits of the existing source location encodings to store that information. This design may be safe. Since we use `unsigned` to store source locations but we use uint64_t in serialization. And generally `unsigned` is 32 bit width in most platforms. So it might not be a safe problem. Since all the bits we used to store the module file index is not used before. So the new encodings may be:
``` |-----------------------|-----------------------| | A | B | C |
* A: 32 bit. The index of the module file in the module manager + 1. * The +1 here is necessary since we wish 0 stands for the current module file. * B: 31 bit. The offset of the source location to the module file * containing it. * C: The macro bit. We rotate it to the lowest bit so that we can save * some space in case the index of the module file is 0. ```
(The B and C is the existing raw encoding for source locations)
Another reason to reuse the same slot of the source location is to reduce the impact of the patch. Since there are a lot of places assuming we can store and get a source location from a slot. And if I tried to add another slot, a lot of codes breaks. I don't feel it is worhty.
Another impact of this decision is that, the existing small optimizations for encoding source location may be invalided. The key of the optimization is that we can turn large values into small values then we can use VBR6 format to reduce the size. But if we decided to put the module file index into the higher bits, then maybe it simply doesn't work. An example may be the `SourceLocationSequence` optimization.
This will only affect the size of on-disk .pcm files. I don't expect this impact the speed and memory use of compilations. And seeing my small experiments above, I feel this trade off is worthy.
The mental model for handling source location offsets is not so complex and I believe we can solve it by adding module file index to each stored source location.
For the practical side, since the source location is pretty sensitive, and the patch can pass all the in-tree tests and a small scale projects, I feel it should be correct.
I'll continue to work on no transitive decl change and no transitive identifier change (if matters) to achieve the goal to stop the propagation of unnecessary changes. But all of this depends on this patch. Since, clearly, the source locations are the most sensitive thing.
---
The release nots and documentation will be added seperately.
show more ...
|
Revision tags: llvmorg-18.1.5 |
|
#
d333a0de |
| 30-Apr-2024 |
Chuanqi Xu <yedeng.yd@linux.alibaba.com> |
Revert "[Modules] No transitive source location change (#86912)"
This reverts commit 6c3110464bac3600685af9650269b0b2b8669d34.
Required by the post commit comments: https://github.com/llvm/llvm-pro
Revert "[Modules] No transitive source location change (#86912)"
This reverts commit 6c3110464bac3600685af9650269b0b2b8669d34.
Required by the post commit comments: https://github.com/llvm/llvm-project/pull/86912
show more ...
|
#
6c311046 |
| 30-Apr-2024 |
Chuanqi Xu <yedeng.yd@linux.alibaba.com> |
[Modules] No transitive source location change (#86912)
This is part of "no transitive change" patch series, "no transitive
source location change". I talked this with @Bigcheese in the tokyo's
WG
[Modules] No transitive source location change (#86912)
This is part of "no transitive change" patch series, "no transitive
source location change". I talked this with @Bigcheese in the tokyo's
WG21 meeting.
The idea comes from @jyknight posted on LLVM discourse. That for:
```
// A.cppm
export module A;
...
// B.cppm
export module B;
import A;
...
//--- C.cppm
export module C;
import C;
```
Almost every time A.cppm changes, we need to recompile `B`. Due to we
think the source location is significant to the semantics. But it may be
good if we can avoid recompiling `C` if the change from `A` wouldn't
change the BMI of B.
# Motivation Example
This patch only cares source locations. So let's focus on source
location's example. We can see the full example from the attached test.
```
//--- A.cppm
export module A;
export template <class T>
struct C {
T func() {
return T(43);
}
};
export int funcA() {
return 43;
}
//--- A.v1.cppm
export module A;
export template <class T>
struct C {
T func() {
return T(43);
}
};
export int funcA() {
return 43;
}
//--- B.cppm
export module B;
import A;
export int funcB() {
return funcA();
}
//--- C.cppm
export module C;
import A;
export void testD() {
C<int> c;
c.func();
}
```
Here the only difference between `A.cppm` and `A.v1.cppm` is that
`A.v1.cppm` has an additional blank line. Then the test shows that two
BMI of `B.cppm`, one specified `-fmodule-file=A=A.pcm` and the other
specified `-fmodule-file=A=A.v1.pcm`, should have the bit-wise same
contents.
However, it is a different story for C, since C instantiates templates
from A, and the instantiation records the source information from module
A, which is different from `A` and `A.v1`, so it is expected that the
BMI `C.pcm` and `C.v1.pcm` can and should differ.
# Internal perspective of status quo
To fully understand the patch, we need to understand how we encodes
source locations and how we serialize and deserialize them.
For source locations, we encoded them as:
```
|
|
| _____ base offset of an imported module
|
|
|
|_____ base offset of another imported module
|
|
|
|
| ___ 0
```
As the diagram shows, we encode the local (unloaded) source location
from 0 to higher bits. And we allocate the space for source locations
from the loaded modules from high bits to 0. Then the source locations
from the loaded modules will be mapped to our source location space
according to the allocated offset.
For example, for,
```
// a.cppm
export module a;
...
// b.cppm
export module b;
import a;
...
```
Assuming the offset of a source location (let's name the location as
`S`) in a.cppm is 45 and we will record the value `45` into the BMI
`a.pcm`. Then in b.cppm, when we import a, the source manager will
allocate a space for module 'a' (according to the recorded number of
source locations) as the base offset of module 'a' in the current source
location spaces. Let's assume the allocated base offset as 90 in this
example. Then when we want to get the location in the current source
location space for `S`, we can get it simply by adding `45` to `90` to
`135`. Finally we can get the source location for `S` in module B as
`135`.
And when we want to write module `b`, we would also write the source
location of `S` as `135` directly in the BMI. And to clarify the
location `S` comes from module `a`, we also need to record the base
offset of module `a`, 90 in the BMI of `b`.
Then the problem comes. Since the base offset of module 'a' is computed
by the number source locations in module 'a'. In module 'b', the
recorded base offset of module 'a' will change every time the number of
source locations in module 'a' increase or decrease. In other words, the
contents of BMI of B will change every time the number of locations in
module 'a' changes. This is pretty sensitive. Almost every change will
change the number of locations. So this is the problem this patch want
to solve.
Let's continue with the existing design to understand what's going on.
Another interesting case is:
```
// c.cppm
export module c;
import whatever;
import a;
import b;
...
```
In `c.cppm`, when we import `a`, we still need to allocate a base
location offset for it, let's say the value becomes to `200` somehow.
Then when we reach the location `S` recorded in module `b`, we need to
translate it into the current source location space. The solution is
quite simple, we can get it by `135 + (200 - 90) = 245`. In another
word, the offset of a source location in current module can be computed
as `Recorded Offset + Base Offset of the its module file - Recorded Base
Offset`.
Then we're almost done about how we handle the offset of source
locations in serializers.
# The high level design of current patch
From the abstract level, what we want to do is to remove the hardcoded
base offset of imported modules and remain the ability to calculate the
source location in a new module unit. To achieve this, we need to be
able to find the module file owning a source location from the encoding
of the source location.
So in this patch, for each source location, we will store the local
offset of the location and the module file index. For the above example,
in `b.pcm`, the source location of `S` will be recorded as `135`
directly. And in the new design, the source location of `S` will be
recorded as `<1, 45>`. Here `1` stands for the module file index of `a`
in module `b`. And `45` means the offset of `S` to the base offset of
module `a`.
So the trade-off here is that, to make the BMI more independent, we need
to record more abstract information. And I feel it is worthy. The
recompilation problem of modules is really annoying and there are still
people complaining this. But if we can make this (including stopping
other changes transitively), I think this may be a killer feature for
modules. And from @Bigcheese , this should be helpful for clang explicit
modules too.
And the benchmarking side, I tested this patch against
https://github.com/alibaba/async_simple/tree/CXX20Modules. No
significant change on compilation time. The size of .pcm files becomes
to 204M from 200M. I think the trade-off is pretty fair.
# Some low level details
I didn't use another slot to record the module file index. I tried to
use the higher 32 bits of the existing source location encodings to
store that information. This design may be safe. Since we use `unsigned`
to store source locations but we use uint64_t in serialization. And
generally `unsigned` is 32 bit width in most platforms. So it might not
be a safe problem. Since all the bits we used to store the module file
index is not used before. So the new encodings may be:
```
|-----------------------|-----------------------|
| A | B | C |
* A: 32 bit. The index of the module file in the module manager + 1. The +1
here is necessary since we wish 0 stands for the current module file.
* B: 31 bit. The offset of the source location to the module file containing it.
* C: The macro bit. We rotate it to the lowest bit so that we can save some
space in case the index of the module file is 0.
```
(The B and C is the existing raw encoding for source locations)
Another reason to reuse the same slot of the source location is to
reduce the impact of the patch. Since there are a lot of places assuming
we can store and get a source location from a slot. And if I tried to
add another slot, a lot of codes breaks. I don't feel it is worhty.
Another impact of this decision is that, the existing small
optimizations for encoding source location may be invalided. The key of
the optimization is that we can turn large values into small values then
we can use VBR6 format to reduce the size. But if we decided to put the
module file index into the higher bits, then maybe it simply doesn't
work. An example may be the `SourceLocationSequence` optimization.
This will only affect the size of on-disk .pcm files. I don't expect
this impact the speed and memory use of compilations. And seeing my
small experiments above, I feel this trade off is worthy.
# Correctness
The mental model for handling source location offsets is not so complex
and I believe we can solve it by adding module file index to each stored
source location.
For the practical side, since the source location is pretty sensitive,
and the patch can pass all the in-tree tests and a small scale projects,
I feel it should be correct.
# Future Plans
I'll continue to work on no transitive decl change and no transitive
identifier change (if matters) to achieve the goal to stop the
propagation of unnecessary changes. But all of this depends on this
patch. Since, clearly, the source locations are the most sensitive
thing.
---
The release nots and documentation will be added seperately.
show more ...
|
Revision tags: llvmorg-18.1.4, llvmorg-18.1.3, llvmorg-18.1.2, llvmorg-18.1.1, llvmorg-18.1.0, llvmorg-18.1.0-rc4, llvmorg-18.1.0-rc3, llvmorg-18.1.0-rc2, llvmorg-18.1.0-rc1, llvmorg-19-init, llvmorg-17.0.6, llvmorg-17.0.5, llvmorg-17.0.4, llvmorg-17.0.3, llvmorg-17.0.2, llvmorg-17.0.1, llvmorg-17.0.0, llvmorg-17.0.0-rc4, llvmorg-17.0.0-rc3, llvmorg-17.0.0-rc2, llvmorg-17.0.0-rc1, llvmorg-18-init, llvmorg-16.0.6, llvmorg-16.0.5, llvmorg-16.0.4, llvmorg-16.0.3, llvmorg-16.0.2, llvmorg-16.0.1, llvmorg-16.0.0, llvmorg-16.0.0-rc4, llvmorg-16.0.0-rc3, llvmorg-16.0.0-rc2, llvmorg-16.0.0-rc1, llvmorg-17-init, llvmorg-15.0.7, llvmorg-15.0.6, llvmorg-15.0.5, llvmorg-15.0.4, llvmorg-15.0.3, working, llvmorg-15.0.2, llvmorg-15.0.1, llvmorg-15.0.0, llvmorg-15.0.0-rc3, llvmorg-15.0.0-rc2, llvmorg-15.0.0-rc1, llvmorg-16-init, llvmorg-14.0.6, llvmorg-14.0.5, llvmorg-14.0.4, llvmorg-14.0.3, llvmorg-14.0.2, llvmorg-14.0.1, llvmorg-14.0.0, llvmorg-14.0.0-rc4, llvmorg-14.0.0-rc3, llvmorg-14.0.0-rc2, llvmorg-14.0.0-rc1, llvmorg-15-init, llvmorg-13.0.1, llvmorg-13.0.1-rc3, llvmorg-13.0.1-rc2, llvmorg-13.0.1-rc1, llvmorg-13.0.0, llvmorg-13.0.0-rc4, llvmorg-13.0.0-rc3, llvmorg-13.0.0-rc2, llvmorg-13.0.0-rc1, llvmorg-14-init, llvmorg-12.0.1, llvmorg-12.0.1-rc4, llvmorg-12.0.1-rc3, llvmorg-12.0.1-rc2, llvmorg-12.0.1-rc1, llvmorg-12.0.0, llvmorg-12.0.0-rc5, llvmorg-12.0.0-rc4, llvmorg-12.0.0-rc3, llvmorg-12.0.0-rc2, llvmorg-11.1.0, llvmorg-11.1.0-rc3, llvmorg-12.0.0-rc1, llvmorg-13-init, llvmorg-11.1.0-rc2, llvmorg-11.1.0-rc1, llvmorg-11.0.1, llvmorg-11.0.1-rc2, llvmorg-11.0.1-rc1, llvmorg-11.0.0, llvmorg-11.0.0-rc6, llvmorg-11.0.0-rc5, llvmorg-11.0.0-rc4, llvmorg-11.0.0-rc3, llvmorg-11.0.0-rc2, llvmorg-11.0.0-rc1, llvmorg-12-init, llvmorg-10.0.1, llvmorg-10.0.1-rc4, llvmorg-10.0.1-rc3, llvmorg-10.0.1-rc2, llvmorg-10.0.1-rc1, llvmorg-10.0.0, llvmorg-10.0.0-rc6, llvmorg-10.0.0-rc5, llvmorg-10.0.0-rc4, llvmorg-10.0.0-rc3, llvmorg-10.0.0-rc2, llvmorg-10.0.0-rc1, llvmorg-11-init, llvmorg-9.0.1, llvmorg-9.0.1-rc3, llvmorg-9.0.1-rc2, llvmorg-9.0.1-rc1 |
|
#
f7170d17 |
| 22-Nov-2019 |
Duncan P. N. Exon Smith <dexonsmith@apple.com> |
clang/Modules: Move Serialization/Module.{h,cpp} to ModuleFile, NFC
Remove some cognitive load by renaming clang/Serialization/Module.h to clang/Serialization/ModuleFile.h, since it declares the Mod
clang/Modules: Move Serialization/Module.{h,cpp} to ModuleFile, NFC
Remove some cognitive load by renaming clang/Serialization/Module.h to clang/Serialization/ModuleFile.h, since it declares the ModuleFile class. This also makes editing a bit easier, since the basename of the file no long conflicts with clang/Basic/Module.h, which declares the Module class. Also move lib/Serialization/Module.cpp to lib/Serialization/ModuleFile.cpp.
show more ...
|