ARM test fails on Rawhide due to fsck taking too long
Open, Unbreak Now!Public

Description

So I looked a bit more into the failing ARM test on Rawhide. I extended the timeout for the test on staging to 1200 (which gets multiplied by the scaling factor so it winds up being 6000). That lets the test get farther, but it still fails.

It looks like the system is getting stuck running an fsck at boot, and systemd eventually times out and fails the boot because the fsck takes too long. In https://openqa.stg.fedoraproject.org/tests/63905#step/install_arm_image_deployment/8 you see it starting the fsck, in https://openqa.stg.fedoraproject.org/tests/63905#step/install_arm_image_deployment/9 we can see it's been running for 15 minutes, and by https://openqa.stg.fedoraproject.org/tests/63905#step/install_arm_image_deployment/10 we can see the 'Local File Systems' target failing because various devices haven't shown up.

Compare to the f25 Final test: https://openqa.fedoraproject.org/tests/48478 . If we look at the video for that test, we can see the same fsck does actually *happen*, but it completes within the space of 3 seconds: we see 'Starting File System Check' between 168.441856 and 169.935881, and 'Started File System Check' between 169.935881 and 171.030909.

So obviously something very different happens to the fsck process between f25 and Rawhide, but I'm not sure why. Could be some issue in the image compose process means the filesystem is actually inconsistent and has to be fixed, and that takes forever in emulation. But it needs looking into. Could you please recreate the test manually and figure out what's going on with the fsck? Thanks.

adamwill created this task.Dec 13 2016, 9:01 PM

I can reproduce it locally using QEMU. I'll ask @lbrabec to test it on actual HW to see if it's happening there too.

I suspect on real HW it might actually boot OK, since the fsck will run much quicker...

garretraziel added a comment.EditedJan 4 2017, 2:53 PM

OK, so I tried to dig deeper into it today and I concluded that ARM is piece of sh*t technology and people should be punished for using it.

Edit: We weren't able to test it on real HW (with Fedora-Server-armhfp-Rawhide-20170102.n.0 image) and I suspect borked SD card (we should get new one). I saw the same problem when running through virt-manager (as in this guideline). I'll investigate it further.

This is output from running with virt-manager. I'm not sure whether this is bug in rawhide or with ARM virtualization.

jsedlak claimed this task.Mar 6 2017, 4:23 PM
jsedlak added a subscriber: garretraziel.

Don't use a pastebin as a permanent record, it expires :)

I bumped the timeout on openQA at the end of last week, so now the tests get to run for a bit longer, and give us more useful output:

https://openqa.fedoraproject.org/tests/59723

I think at this point we're running into the Alpha blocker https://bugzilla.redhat.com/show_bug.cgi?id=1422634 , so that may be making it difficult to encounter and investigate *this* problem, if it still exists. You may want to try booting with selinux=0 or enforcing=0 .