Dogwood Shipping Out Today

chupasaurus · August 4, 2020, 3:01pm

Thank you for a quick response.

Kyle_Rankin · August 4, 2020, 3:49pm

It’s a good point and what we generally advise people to do who want daily status updates. Our bug tracker and code repos for all our various projects on https://source.puri.sm is the best way to stay informed on daily progress.

I assumed this issue like others is being tracked in there, but perhaps not as, especially at the beginning, it was unclear which project one would assign it to as we are still troubleshooting whether it’s a hardware or software problem. That said, checking out merge requests and Purism developer activity against our linux-next repo would probably give you a good snapshot into what’s being worked on.

maximilian · August 5, 2020, 8:36am

Hi @Kyle_Rankin,

after @amosbatto’s post I also had a look at https://source.puri.sm but I also did not find this issue there, which is as I understood also what happened to amosbatto.

If I had a look at https://source.puri.sm/groups/Librem5/-/issues which are all issues availible for the librem 5 project (also linux next which you mention) there are only 6 issues updated not longer ago then one week and none of them seems to be related to the found issue.

jrial · August 5, 2020, 1:16pm

As a software developer who sometimes has to debug long running processes, I feel your pain…

It should be tracked somewhere, even if it turns out to be the wrong project. You can always move it across projects. Or create a new issue once more is known, and link back to the original issue. You could even set up a “generic/unsure” project for such hard to pinpoint bugs, where they can live until they can be properly categorised.

Kyle_Rankin · August 5, 2020, 4:39pm

Since people seem to be interested in the daily progress, the most recent change to our linux-next package contains a patch that doesn’t completely fix the issue but dramatically increases stability in Dogwood for most use cases:

https://master.pureos.net/export/changelogs/main/l/linux-librem5/amber-phone-staging_changelog

And don’t ask me what it all means, I’m just a user in this case! I can say that after applying this upgrade I haven’t experienced the issue yet.

dos · August 6, 2020, 1:12am

You are right, it’s not there. We usually post things publicly on our issue tracker - that’s our default mode of operation - but that wasn’t the case this time. In fact, there’s not even a private ticket on this issue there. Not an excuse, but an explanation: we are all remote and the bug tracker is usually our main way to synchronize on the work. This case however required higher than usual level of synchronization and brainstorming, so we have been mostly discussing it over less asynchronous means like text and voice chat, and it turns out that nobody bothered to copy things from there to the issue tracker afterwards.

Posting it publicly probably wouldn’t be technically useful - it’s unlikely anyone would be able to help in any meaningful way without having the actual hardware in hand - but it would certainly make you more informed on the progress, so yeah, we could have done that better.

Anyway, what we know now is that the issue is triggered when there’s traffic on the i2c-0 bus. Not all the time, there’s just a small probability of it happening, but depending on your workload there might be high traffic there so the end outcome is that the probability of hitting the bug gets higher with time.

We have a workaround now that Kyle mentioned above - it effectively disables voltage scaling of the CPU ARM cores, since that’s what generated the vast majority of traffic on the i2c-0 bus; so the bug is much harder to trigger with that patch in place. This may slightly increase power consumption, but the difference doesn’t seem to be significant since the 1.0GHz state it affects is the least commonly used one.

However, this is just a workaround, not a fix. Other uses of i2c-0 bus may still trigger the bug, and that includes RTC clock (but that usually gets used only once at boot and once at shutdown, so it doesn’t really matter) and USB-C controller. So, with the workaround applied, the act of (un)plugging the USB cable may still trigger the bug. So far I haven’t seen that happening on my device in the past few days since I got that workaround applied, but the theoretical possibility is still there.

We continue to look for the root cause of the issue. I know that we have some suspicions, but no hard facts yet and those go way outside of my particular area of expertise anyway so I’m probably not the right person to describe them

FWIW, the workaround is already packaged in the staging repos.

JR-Fi · August 7, 2020, 7:40am

This is promising (both, about talking about the issue and troubleshooting). Keep it up!

Fan · August 9, 2020, 5:05pm

Kyle,

Many years ago I had a small computer repair business. Problems like this became my favorite to fix.

First try running a few (more than one) phones in the freezer, literally. …Inside a plastic bag so it can’t get wet. … Transistors are definitely heat sensitive. Of course do keep within the max temperature specs of the chips.

Also run some in an cool oven. They should be able to take a substantial increased ambient temperature, like for example 120 degrees F and not fail. I also used to use a heat gun to spot heat boards up, and then freeze spray to selectively cool parts down.

If this does happen to turn out to be heat related, look at the most high power components first! From lots of experience at this, they are the ones that are most likely to have issues. But of course anything and everything is open for scrutiny.

Also look at what could cause a hard shutdown, and also what could do it intermittently. Like memory, bus, or cpu bit errors. Voltage instability. Noise on bus. That one is a hard one to find. Much easier to over quiet it and see if things get better than to directly observe it.

Is there any way you can isolate parts of the phone, e.g. unplug the modem, display, etc.? Divide and conquer is your best friend if at all possible.

And given that this is intermittent, and not directly repeatable, then it is chaotic, and thus very likely from an overheated or underrated or defective component.

Possibly they are shipping you silicon 2nds, i.e. chips that failed in the factory, but were good enough to fool you. I had to replace a bunch of brand new capacitors on some hardware I was manufacturing once because from time to time they would catch on fire. That was wild. They were rated for 35 volts, and all I had on them was 5 volts. I had purchased them from a non-main stream vendor. That was a mistake on my part.

Stay on it until you’re sure you fix this! You’ll thank yourself later. Love you guys.

Gavaudan · August 9, 2020, 11:32pm

You might direct your suggestions to @dos since he seems to be more directly involved with fixing the issue, or at least closer to those who are.

PrivateWhite · August 10, 2020, 5:54pm

Hi Kyle,
It is really great to read that dogwood started shipping out at the beginning of August. Would it be possible to get to know how large the dogwood batch is going to be?

Kyle_Rankin · August 10, 2020, 6:31pm

We rarely disclose sales numbers outside of crowdfunding targets and also haven’t been disclosing the size of the pre-mass-production batches other than to say they are small.

PrivateWhite · August 10, 2020, 6:37pm

That’s ok. I was asking this to get an impression on whether it makes sense to hope for more detailed reviews - as up until now the feedback I found on received dogwood batch phones came from purism employees. And based on other threads I read, I am not the only one eagerly awaiting some news from end users on their dogwood devices

Kyle_Rankin · August 10, 2020, 6:43pm

Hopefully you’ll start to see some things soon as people start receiving them. We do also have a review unit going out.

PrivateWhite · August 10, 2020, 6:44pm

Great, Thanks for the info.

eugenr · August 30, 2020, 10:10am

We have a workaround now that Kyle mentioned above - it effectively disables voltage scaling of the CPU ARM cores[…]

However, this is just a workaround, not a fix. Other uses of i2c-0 bus may still trigger the bug […]

We continue to look for the root cause of the issue.[…]

Does anybody have any new info on this bug? What causes the problem? Is there any proper fix for it? Are you still looking for one?

Thanks.

eugenr · December 24, 2020, 10:20am

@dos tries to make some light to what happened to the Dogwood bug:

I’m not exactly sure, I’m a software guy there - but I don’t think there was a single particular cause ever identified. There were some suspicions, but unfortunately those go far beyond my understanding of electrical engineering In the end the layout around the PMIC has been carefully reviewed by multiple parties with plenty of cautionary measures added and it turned out to improve things.

From what I can say, the shutdowns were triggered by i2c0 activity. Basically there was a small chance of it happening each time i2c0 was in use - and since the most heavy user of that bus was CPU voltage scaling, we have workaround that by setting the CPU voltage to a fixed value on Dogwood, which made those shutdowns extremely rare there. We have stress-tested i2c0 bus on Evergreen a lot and haven’t been able to reproduce those shutdowns there.

From here: https://teddit.net/r/Purism/comments/kj1lm6/first_ever_librem5_review_which_is_not_biased/ggvxs8x/

Thanks.