git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [DISCUSS] Unsustainable situation with ptests


Bq. For the short term green runs, I think we should @Ignore the tests which
are known to be failing since many runs. They are anyways not being
addressed as such. If people think they are important to be run we should
fix them and only then re-enable them.

I think that is a good idea, as we would minimize the time that we halt development. We can create a JIRA where we list all tests that were failing, and we have disabled to get the clean run. From that moment, we will have zero tolerance towards committing with failing tests. And we need to pick up those tests that should not be ignored and bring them up again but passing. If there is no disagreement, I can start working on that.

Once I am done, I can try to help with infra tickets too.

-Jesús


On 5/11/18, 1:57 PM, "Vineet Garg" <vgarg@xxxxxxxxxxxxxxx> wrote:

    +1. I strongly vote for freezing commits and getting our testing coverage in acceptable state.  We have been struggling to stabilize branch-3 due to test failures and releasing Hive 3.0 in current state would be unacceptable.
    
    Currently there are quite a few test suites which are not even running and are being timed out. We have been committing patches (to both branch-3 and master) without test coverage for these tests.
    We should immediately figure out what’s going on before we proceed with commits.
    
    For reference following test suites are timing out on master: (https://issues.apache.org/jira/browse/HIVE-19506)
    
    
    TestDbNotificationListener - did not produce a TEST-*.xml file (likely timed out)
    
    TestHCatHiveCompatibility - did not produce a TEST-*.xml file (likely timed out)
    
    TestNegativeCliDriver - did not produce a TEST-*.xml file (likely timed out)
    
    TestNonCatCallsWithCatalog - did not produce a TEST-*.xml file (likely timed out)
    
    TestSequenceFileReadWrite - did not produce a TEST-*.xml file (likely timed out)
    
    TestTxnExIm - did not produce a TEST-*.xml file (likely timed out)
    
    
    Vineet
    
    
    On May 11, 2018, at 1:46 PM, Vihang Karajgaonkar <vihang@xxxxxxxxxxxx<mailto:vihang@xxxxxxxxxxxx>> wrote:
    
    +1 There are many problems with the test infrastructure and in my opinion
    it has not become number one bottleneck for the project. I was looking at
    the infrastructure yesterday and I think the current infrastructure (even
    its own set of problems) is still under-utilized. I am planning to increase
    the number of threads to process the parallel test batches to start with.
    It needs a restart on the server side. I can do it now, it folks are okay
    with it. Else I can do it over weekend when the queue is small.
    
    I listed the improvements which I thought would be useful under
    https://issues.apache.org/jira/browse/HIVE-19425 but frankly speaking I am
    not able to devote as much time as I would like to on it. I would
    appreciate if folks who have some more time if they can help out.
    
    I think to start with https://issues.apache.org/jira/browse/HIVE-19429 will
    help a lot. We need to pack more test runs in parallel and containers
    provide good isolation.
    
    For the short term green runs, I think we should @Ignore the tests which
    are known to be failing since many runs. They are anyways not being
    addressed as such. If people think they are important to be run we should
    fix them and only then re-enable them.
    
    Also, I feel we need light-weight test run which we can run locally before
    submitting it for the full-suite. That way minor issues with the patch can
    be handled locally. May be create a profile which runs a subset of
    important tests which are consistent. We can apply some label that
    pre-checkin-local tests are runs successful and only then we submit for the
    full-suite.
    
    More thoughts are welcome. Thanks for starting this conversation.
    
    On Fri, May 11, 2018 at 1:27 PM, Jesus Camacho Rodriguez <
    jcamacho@xxxxxxxxxx<mailto:jcamacho@xxxxxxxxxx>> wrote:
    
    I believe we have reached a state (maybe we did reach it a while ago) that
    is not sustainable anymore, as there are so many tests failing / timing out
    that it is not possible to verify whether a patch is breaking some critical
    parts of the system or not. It also seems to me that due to the timeouts
    (maybe due to infra, maybe not), ptest runs are taking even longer than
    usual, which in turn creates even longer queue of patches.
    
    There is an ongoing effort to improve ptests usability (
    https://issues.apache.org/jira/browse/HIVE-19425), but apart from that,
    we need to make an effort to stabilize existing tests and bring that
    failure count to zero.
    
    Hence, I am suggesting *we stop committing any patch before we get a green
    run*. If someone thinks this proposal is too radical, please come up with
    an alternative, because I do not think it is OK to have the ptest runs in
    their current state. Other projects of certain size (e.g., Hadoop, Spark)
    are always green, we should be able to do the same.
    
    Finally, once we get to zero failures, I suggest we are less tolerant with
    committing without getting a clean ptests run. If there is a failure, we
    need to fix it or revert the patch that caused it, then we continue
    developing.
    
    Please, let’s all work together as a community to fix this issue, that is
    the only way to get to zero quickly.
    
    Thanks,
    Jesús
    
    PS. I assume the flaky tests will come into the discussion. Let´s see
    first how many of those we have, then we can work to find a fix.