git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [DISCUSS] Unsustainable situation with ptests


Vineet has already been working on disabling those tests that were timing out. I am working on disabling those that are generating different q files consistently for last ptests n runs. I am keeping track of all these tests in https://issues.apache.org/jira/browse/HIVE-19509.

-Jesús

On 5/14/18, 2:25 AM, "Prasanth Jayachandran" <pjayachandran@xxxxxxxxxxxxxxx> wrote:

    +1 on freezing commits until we get repetitive green tests. We should probably disable (and remember in a jira to reenable then at later point) tests that are flaky to get repetitive green test runs.
    
    Thanks
    Prasanth
    
    
    
    On Mon, May 14, 2018 at 2:15 AM -0700, "Rui Li" <lirui.fudan@xxxxxxxxx<mailto:lirui.fudan@xxxxxxxxx>> wrote:
    
    
    +1 to freezing commits until we stabilize
    
    On Sat, May 12, 2018 at 6:10 AM, Vihang Karajgaonkar
    wrote:
    
    > In order to understand the end-to-end precommit flow I would like to get
    > access to the PreCommit-HIVE-Build jenkins script. Does anyone one know how
    > can I get that?
    >
    > On Fri, May 11, 2018 at 2:03 PM, Jesus Camacho Rodriguez <
    > jcamacho@xxxxxxxxxx> wrote:
    >
    > > Bq. For the short term green runs, I think we should @Ignore the tests
    > > which
    > > are known to be failing since many runs. They are anyways not being
    > > addressed as such. If people think they are important to be run we should
    > > fix them and only then re-enable them.
    > >
    > > I think that is a good idea, as we would minimize the time that we halt
    > > development. We can create a JIRA where we list all tests that were
    > > failing, and we have disabled to get the clean run. From that moment, we
    > > will have zero tolerance towards committing with failing tests. And we
    > need
    > > to pick up those tests that should not be ignored and bring them up again
    > > but passing. If there is no disagreement, I can start working on that.
    > >
    > > Once I am done, I can try to help with infra tickets too.
    > >
    > > -Jesús
    > >
    > >
    > > On 5/11/18, 1:57 PM, "Vineet Garg"  wrote:
    > >
    > >     +1. I strongly vote for freezing commits and getting our testing
    > > coverage in acceptable state.  We have been struggling to stabilize
    > > branch-3 due to test failures and releasing Hive 3.0 in current state
    > would
    > > be unacceptable.
    > >
    > >     Currently there are quite a few test suites which are not even
    > running
    > > and are being timed out. We have been committing patches (to both
    > branch-3
    > > and master) without test coverage for these tests.
    > >     We should immediately figure out what’s going on before we proceed
    > > with commits.
    > >
    > >     For reference following test suites are timing out on master: (
    > > https://issues.apache.org/jira/browse/HIVE-19506)
    > >
    > >
    > >     TestDbNotificationListener - did not produce a TEST-*.xml file
    > (likely
    > > timed out)
    > >
    > >     TestHCatHiveCompatibility - did not produce a TEST-*.xml file (likely
    > > timed out)
    > >
    > >     TestNegativeCliDriver - did not produce a TEST-*.xml file (likely
    > > timed out)
    > >
    > >     TestNonCatCallsWithCatalog - did not produce a TEST-*.xml file
    > (likely
    > > timed out)
    > >
    > >     TestSequenceFileReadWrite - did not produce a TEST-*.xml file (likely
    > > timed out)
    > >
    > >     TestTxnExIm - did not produce a TEST-*.xml file (likely timed out)
    > >
    > >
    > >     Vineet
    > >
    > >
    > >     On May 11, 2018, at 1:46 PM, Vihang Karajgaonkar <
    > vihang@xxxxxxxxxxxx
    > > > wrote:
    > >
    > >     +1 There are many problems with the test infrastructure and in my
    > > opinion
    > >     it has not become number one bottleneck for the project. I was
    > looking
    > > at
    > >     the infrastructure yesterday and I think the current infrastructure
    > > (even
    > >     its own set of problems) is still under-utilized. I am planning to
    > > increase
    > >     the number of threads to process the parallel test batches to start
    > > with.
    > >     It needs a restart on the server side. I can do it now, it folks are
    > > okay
    > >     with it. Else I can do it over weekend when the queue is small.
    > >
    > >     I listed the improvements which I thought would be useful under
    > >     https://issues.apache.org/jira/browse/HIVE-19425 but frankly
    > speaking
    > > I am
    > >     not able to devote as much time as I would like to on it. I would
    > >     appreciate if folks who have some more time if they can help out.
    > >
    > >     I think to start with https://issues.apache.org/
    > jira/browse/HIVE-19429
    > > will
    > >     help a lot. We need to pack more test runs in parallel and containers
    > >     provide good isolation.
    > >
    > >     For the short term green runs, I think we should @Ignore the tests
    > > which
    > >     are known to be failing since many runs. They are anyways not being
    > >     addressed as such. If people think they are important to be run we
    > > should
    > >     fix them and only then re-enable them.
    > >
    > >     Also, I feel we need light-weight test run which we can run locally
    > > before
    > >     submitting it for the full-suite. That way minor issues with the
    > patch
    > > can
    > >     be handled locally. May be create a profile which runs a subset of
    > >     important tests which are consistent. We can apply some label that
    > >     pre-checkin-local tests are runs successful and only then we submit
    > > for the
    > >     full-suite.
    > >
    > >     More thoughts are welcome. Thanks for starting this conversation.
    > >
    > >     On Fri, May 11, 2018 at 1:27 PM, Jesus Camacho Rodriguez <
    > >     jcamacho@xxxxxxxxxx> wrote:
    > >
    > >     I believe we have reached a state (maybe we did reach it a while ago)
    > > that
    > >     is not sustainable anymore, as there are so many tests failing /
    > > timing out
    > >     that it is not possible to verify whether a patch is breaking some
    > > critical
    > >     parts of the system or not. It also seems to me that due to the
    > > timeouts
    > >     (maybe due to infra, maybe not), ptest runs are taking even longer
    > than
    > >     usual, which in turn creates even longer queue of patches.
    > >
    > >     There is an ongoing effort to improve ptests usability (
    > >     https://issues.apache.org/jira/browse/HIVE-19425), but apart from
    > > that,
    > >     we need to make an effort to stabilize existing tests and bring that
    > >     failure count to zero.
    > >
    > >     Hence, I am suggesting *we stop committing any patch before we get a
    > > green
    > >     run*. If someone thinks this proposal is too radical, please come up
    > > with
    > >     an alternative, because I do not think it is OK to have the ptest
    > runs
    > > in
    > >     their current state. Other projects of certain size (e.g., Hadoop,
    > > Spark)
    > >     are always green, we should be able to do the same.
    > >
    > >     Finally, once we get to zero failures, I suggest we are less tolerant
    > > with
    > >     committing without getting a clean ptests run. If there is a failure,
    > > we
    > >     need to fix it or revert the patch that caused it, then we continue
    > >     developing.
    > >
    > >     Please, let’s all work together as a community to fix this issue,
    > that
    > > is
    > >     the only way to get to zero quickly.
    > >
    > >     Thanks,
    > >     Jesús
    > >
    > >     PS. I assume the flaky tests will come into the discussion. Let´s see
    > >     first how many of those we have, then we can work to find a fix.
    > >
    > >
    > >
    > >
    > >
    > >
    > >
    > >
    >
    
    
    
    --
    Best regards!
    Rui Li