git.net

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [DISCUSS] Unsustainable situation with ptests


Will there be a separate voting thread? Or the voting on this thread is sufficient for lock down?

Thanks
Prasanth

> On May 14, 2018, at 2:34 PM, Alan Gates <alanfgates@xxxxxxxxx> wrote:
> 
> ​I see there's support for this, but people are still pouring in commits.
> I proposed we have a quick vote on this to lock down the commits until we
> get to green.  That way everyone knows we have drawn the line at a specific
> point.  Any commits after that point would be reverted.  There isn't a
> category in the bylaws that fits this kind of vote but I suggest lazy
> majority as the most appropriate one (at least 3 votes, more +1s than
> -1s).
> 
> Alan.​
> 
> On Mon, May 14, 2018 at 10:34 AM, Vihang Karajgaonkar <vihang@xxxxxxxxxxxx>
> wrote:
> 
>> I worked on a few quick-fix optimizations in Ptest infrastructure over the
>> weekend which reduced the execution run from ~90 min to ~70 min per run. I
>> had to restart Ptest multiple times. I was resubmitting the patches which
>> were in the queue manually, but I may have missed a few. In case you have a
>> patch which is pending pre-commit and you don't see it in the queue, please
>> submit it manually or let me know if you don't have access to the jenkins
>> job. I will continue to work on the sub-tasks in HIVE-19425 and will do
>> some maintenance next weekend as well.
>> 
>> On Mon, May 14, 2018 at 7:42 AM, Jesus Camacho Rodriguez <
>> jcamacho@xxxxxxxxxx> wrote:
>> 
>>> Vineet has already been working on disabling those tests that were timing
>>> out. I am working on disabling those that are generating different q
>> files
>>> consistently for last ptests n runs. I am keeping track of all these
>> tests
>>> in https://issues.apache.org/jira/browse/HIVE-19509.
>>> 
>>> -Jesús
>>> 
>>> On 5/14/18, 2:25 AM, "Prasanth Jayachandran" <
>>> pjayachandran@xxxxxxxxxxxxxxx> wrote:
>>> 
>>>    +1 on freezing commits until we get repetitive green tests. We should
>>> probably disable (and remember in a jira to reenable then at later point)
>>> tests that are flaky to get repetitive green test runs.
>>> 
>>>    Thanks
>>>    Prasanth
>>> 
>>> 
>>> 
>>>    On Mon, May 14, 2018 at 2:15 AM -0700, "Rui Li" <
>> lirui.fudan@xxxxxxxxx
>>> <mailto:lirui.fudan@xxxxxxxxx>> wrote:
>>> 
>>> 
>>>    +1 to freezing commits until we stabilize
>>> 
>>>    On Sat, May 12, 2018 at 6:10 AM, Vihang Karajgaonkar
>>>    wrote:
>>> 
>>>> In order to understand the end-to-end precommit flow I would like
>> to
>>> get
>>>> access to the PreCommit-HIVE-Build jenkins script. Does anyone one
>>> know how
>>>> can I get that?
>>>> 
>>>> On Fri, May 11, 2018 at 2:03 PM, Jesus Camacho Rodriguez <
>>>> jcamacho@xxxxxxxxxx> wrote:
>>>> 
>>>>> Bq. For the short term green runs, I think we should @Ignore the
>>> tests
>>>>> which
>>>>> are known to be failing since many runs. They are anyways not
>> being
>>>>> addressed as such. If people think they are important to be run
>> we
>>> should
>>>>> fix them and only then re-enable them.
>>>>> 
>>>>> I think that is a good idea, as we would minimize the time that
>> we
>>> halt
>>>>> development. We can create a JIRA where we list all tests that
>> were
>>>>> failing, and we have disabled to get the clean run. From that
>>> moment, we
>>>>> will have zero tolerance towards committing with failing tests.
>>> And we
>>>> need
>>>>> to pick up those tests that should not be ignored and bring them
>>> up again
>>>>> but passing. If there is no disagreement, I can start working on
>>> that.
>>>>> 
>>>>> Once I am done, I can try to help with infra tickets too.
>>>>> 
>>>>> -Jesús
>>>>> 
>>>>> 
>>>>> On 5/11/18, 1:57 PM, "Vineet Garg"  wrote:
>>>>> 
>>>>>    +1. I strongly vote for freezing commits and getting our
>>> testing
>>>>> coverage in acceptable state.  We have been struggling to
>> stabilize
>>>>> branch-3 due to test failures and releasing Hive 3.0 in current
>>> state
>>>> would
>>>>> be unacceptable.
>>>>> 
>>>>>    Currently there are quite a few test suites which are not
>> even
>>>> running
>>>>> and are being timed out. We have been committing patches (to both
>>>> branch-3
>>>>> and master) without test coverage for these tests.
>>>>>    We should immediately figure out what’s going on before we
>>> proceed
>>>>> with commits.
>>>>> 
>>>>>    For reference following test suites are timing out on
>> master: (
>>>>> https://issues.apache.org/jira/browse/HIVE-19506)
>>>>> 
>>>>> 
>>>>>    TestDbNotificationListener - did not produce a TEST-*.xml
>> file
>>>> (likely
>>>>> timed out)
>>>>> 
>>>>>    TestHCatHiveCompatibility - did not produce a TEST-*.xml file
>>> (likely
>>>>> timed out)
>>>>> 
>>>>>    TestNegativeCliDriver - did not produce a TEST-*.xml file
>>> (likely
>>>>> timed out)
>>>>> 
>>>>>    TestNonCatCallsWithCatalog - did not produce a TEST-*.xml
>> file
>>>> (likely
>>>>> timed out)
>>>>> 
>>>>>    TestSequenceFileReadWrite - did not produce a TEST-*.xml file
>>> (likely
>>>>> timed out)
>>>>> 
>>>>>    TestTxnExIm - did not produce a TEST-*.xml file (likely timed
>>> out)
>>>>> 
>>>>> 
>>>>>    Vineet
>>>>> 
>>>>> 
>>>>>    On May 11, 2018, at 1:46 PM, Vihang Karajgaonkar <
>>>> vihang@xxxxxxxxxxxx
>>>>>> wrote:
>>>>> 
>>>>>    +1 There are many problems with the test infrastructure and
>> in
>>> my
>>>>> opinion
>>>>>    it has not become number one bottleneck for the project. I
>> was
>>>> looking
>>>>> at
>>>>>    the infrastructure yesterday and I think the current
>>> infrastructure
>>>>> (even
>>>>>    its own set of problems) is still under-utilized. I am
>>> planning to
>>>>> increase
>>>>>    the number of threads to process the parallel test batches to
>>> start
>>>>> with.
>>>>>    It needs a restart on the server side. I can do it now, it
>>> folks are
>>>>> okay
>>>>>    with it. Else I can do it over weekend when the queue is
>> small.
>>>>> 
>>>>>    I listed the improvements which I thought would be useful
>> under
>>>>>    https://issues.apache.org/jira/browse/HIVE-19425 but frankly
>>>> speaking
>>>>> I am
>>>>>    not able to devote as much time as I would like to on it. I
>>> would
>>>>>    appreciate if folks who have some more time if they can help
>>> out.
>>>>> 
>>>>>    I think to start with https://issues.apache.org/
>>>> jira/browse/HIVE-19429
>>>>> will
>>>>>    help a lot. We need to pack more test runs in parallel and
>>> containers
>>>>>    provide good isolation.
>>>>> 
>>>>>    For the short term green runs, I think we should @Ignore the
>>> tests
>>>>> which
>>>>>    are known to be failing since many runs. They are anyways not
>>> being
>>>>>    addressed as such. If people think they are important to be
>>> run we
>>>>> should
>>>>>    fix them and only then re-enable them.
>>>>> 
>>>>>    Also, I feel we need light-weight test run which we can run
>>> locally
>>>>> before
>>>>>    submitting it for the full-suite. That way minor issues with
>>> the
>>>> patch
>>>>> can
>>>>>    be handled locally. May be create a profile which runs a
>>> subset of
>>>>>    important tests which are consistent. We can apply some label
>>> that
>>>>>    pre-checkin-local tests are runs successful and only then we
>>> submit
>>>>> for the
>>>>>    full-suite.
>>>>> 
>>>>>    More thoughts are welcome. Thanks for starting this
>>> conversation.
>>>>> 
>>>>>    On Fri, May 11, 2018 at 1:27 PM, Jesus Camacho Rodriguez <
>>>>>    jcamacho@xxxxxxxxxx> wrote:
>>>>> 
>>>>>    I believe we have reached a state (maybe we did reach it a
>>> while ago)
>>>>> that
>>>>>    is not sustainable anymore, as there are so many tests
>> failing
>>> /
>>>>> timing out
>>>>>    that it is not possible to verify whether a patch is breaking
>>> some
>>>>> critical
>>>>>    parts of the system or not. It also seems to me that due to
>> the
>>>>> timeouts
>>>>>    (maybe due to infra, maybe not), ptest runs are taking even
>>> longer
>>>> than
>>>>>    usual, which in turn creates even longer queue of patches.
>>>>> 
>>>>>    There is an ongoing effort to improve ptests usability (
>>>>>    https://issues.apache.org/jira/browse/HIVE-19425), but apart
>>> from
>>>>> that,
>>>>>    we need to make an effort to stabilize existing tests and
>>> bring that
>>>>>    failure count to zero.
>>>>> 
>>>>>    Hence, I am suggesting *we stop committing any patch before
>> we
>>> get a
>>>>> green
>>>>>    run*. If someone thinks this proposal is too radical, please
>>> come up
>>>>> with
>>>>>    an alternative, because I do not think it is OK to have the
>>> ptest
>>>> runs
>>>>> in
>>>>>    their current state. Other projects of certain size (e.g.,
>>> Hadoop,
>>>>> Spark)
>>>>>    are always green, we should be able to do the same.
>>>>> 
>>>>>    Finally, once we get to zero failures, I suggest we are less
>>> tolerant
>>>>> with
>>>>>    committing without getting a clean ptests run. If there is a
>>> failure,
>>>>> we
>>>>>    need to fix it or revert the patch that caused it, then we
>>> continue
>>>>>    developing.
>>>>> 
>>>>>    Please, let’s all work together as a community to fix this
>>> issue,
>>>> that
>>>>> is
>>>>>    the only way to get to zero quickly.
>>>>> 
>>>>>    Thanks,
>>>>>    Jesús
>>>>> 
>>>>>    PS. I assume the flaky tests will come into the discussion.
>>> Let´s see
>>>>>    first how many of those we have, then we can work to find a
>>> fix.
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>>> 
>>> 
>>>    --
>>>    Best regards!
>>>    Rui Li
>>> 
>>> 
>>> 
>>> 
>>> 
>>